Today I am going to try out a new way to aggregate sc-RNA seq data from independent experiments. The new methodology is called scvis. Scvis prosits that the high-dimenision sc-RNA data (many more genes than number of cells tested) is governed by low-dimension factors (cells types or sample origins). Scvis aimed to learn the low-dimensional factors (assumed to have two or three dimension) and this allow new data (e.g. other SCRNA experiments) to overlay on top of the previous t-SNE, easing result interpretation (i.e. inferring cell identity).
Therefore, we shall be able to first analyse one set of data, assign cell types, and then overlay the next set of data based on the t-SNE topology? Lets test the code out then.
Testing out scvis
So the aim is to preserve the cell type of the data and mitigate batch effects. One of the classical example is the simulated vs control pbmc cells. So in this case scvis shall serve as a canonical correlation analysis and help us to preserve the cell type.
1. Prepare the training dataset
We will use the normal matrix as training dataset. Scvis requires a input expression matrix, the labels of cell identity. We will use the two groups of PBMCs from Kang 2017 as out toy dataset. We download the data from Seurat tutorial and perform FindCluster on individual dataset.
T-SNE plots of the ctrl and stim groups (Fig A and C) look quite different in topologies. crtl comprises of 12 clusters according to Seurat, so we output the matrix of the genes in gene.use and cluster identity of pbmc for scvis.
2. Running scvis
scvis train --data_matrix_file crtl_matx.tsv \ --out_dir /home/a/Documents/trails_errors_bioinformatics/scvis \ --data_label_file /home/a/Documents/trails_errors_bioinformatics/scvis/crtl_ident.tsv\ --verbose \ --verbose_interval 50
scvis generated three outputs: a trained model, the 2D embedding of the cells (somewhat a t-SNE?) and the log_likelihood of each embedding. Lets focus on the 2-D embedding first, it seems like that each data point (a cell) is colored by their cluster identity? We can check it by changing the color by the identity of the Seurat assigned cluster number. We can see that the t-SNE clustering structure is conserved overall, but interestingly, cluster four and six is way more loosely organised in scvis than t-SNE. It indicated that maybe these cells are more diverse than the t-SNE had shown but were aggregated into a cluster because they are all very different from other clusters.
3. Mapping with scvis
Now, we use the mapping function to map the stimulated data to the learnt control data.
scvis map --data_matrix_file stim_matx.tsv \
--out_dir /home/a/Documents/trails_errors_bioinformatics/scvis \
--pretrained_model_file /home/a/Documents/trails_errors_bioinformatics/scvis/model/perplexity_10_regularizer_0.001_batch_size_512_learning_rate_0.01_latent_dimension_2_activation_ELU_seed_1_iter_3000.ckpt
So the key question I wanted to address is that can I use the cell-type assignation of the control data to infer accurately the cell type assignation of the stimulated data? Lets look at the t-SNE generated by Seurat; both datasets (Figure A and C) have 11 clusters, but the stimulated cells appeared to have quite different inter-cluster relationships than the control. For instance, from the scvis results, we know that cluster 7 and 8 in the control group is unified as cluster 3 in the stimulated data. And cluster five in the control group is completely split into two subpopulations of cluster 7 in the stimulated group in the t-SNE (Figure C). These abherent clustering patterns are smoothed in the scvis and give a more realistic distribution of cells according the the genes.use.
Scvis also appeared to offer a different clustering topology compared to the Seurat CCA output, where the combined and normalised data set from CCA generated 12 instead of 11 clusters. One explanation to these contrasting results was that the two group of cells do not completely overlap in terms of cell type and thus the number of clusters increases after merging the two set of data. Alternatively, the stimulation treatment effects were still not completely corrected by the CCA method. The particularly conflicting groupings are those of the B and B-activated cells (cluster 7 and 8 in Fig B and cluster 3 in Fig D), and the CD8T and NK cluster (cluster 6 (Fig A) in control vs cluster 7 and 8 in stimulated data (Fig C)).
More cross-checking about the cell labels will tell if the cell-type assignation is correct. But still, what is the best way to rate if the clustering is correct?
In conclusion, scvis appears to be very promising to merge multiple set of scRNA data and extract useful biological functions. It can serve as an alternative to CCA analysis. Finally, there must be a way to quantify the performance of all these clustering methods. let find it and learnt about it before we return to evaluate these cell-type groupings.
Comments