With the increasing published scRNA-seq studies, one of the ideals is to compare the findings across studies to confirm 1) novel subtypes and 2) important cell markers. Pooling multiple studies together in theory would increase the power to classify cell-type (by increasing the number of cells assessed), but such approach may also bring in a lot of technical noises that had resulted in cell clustering based on experiments but not their biological (gene expression) signals.
Meta-studies of scRNA-seq had been confined by batch effects, contributed by lab-specific technical biases and sequencing methods used. And many research groups had dedicated their efforts in removing these confounding effects (Eling et al. 2018 and Butler et al 2018).
Well, MetaNeighbor (Crow et al. 2018) demonstrated a new way to remove confounding factors! It combined two practices in the implementation, using correlation cell-cell similarity matrix and gene sets (from GO terms). So the advantages of using a list of genes provide a more robust measure compared to using individual gene against dropouts; and using cell-cell similarity matrix bypass the noises of individual experimental setting and allow us to compare the biological signal directly. Then MetaNeighbor performs network weighting and neighbor-voting to inform us the predictive power (report as the mean AUROC given the cell type and gene set across n-fold dataset cross validation) of a gene list for a particular cell-type. (I will have to learn more about network construction and how to calculate weights but I will leave them for now.)
The surprising results of the paper is that 1) highly variable genes may not make biological sense! The author showed that the GO groups and randomly selected highly variable gene lists both have high performance in classifying cell type; increasing genes in the gene list further increase accuracy (100 to 800 genes). The explanation was that many genes from many different pathways are under differential expression. In summary, MetaNeighbor appears to be worth trying if I have a marker gene list and would like to know if it is universal across datasets!
At this point, I have downloaded and installed MetaNeighbor (https://github.com/gillislab/MetaNeighbor) in R studio. MetaNeighbor consists of a supervised part and an unsupervised part. From what I gathered, if I have a gene list, use the supervised part to evaluate if the gene list discriminate the cell type well from each other. It only test two cell types in the tutorial and I don’t know if we can simultaneously test multiple cell types. If I have two dataset with cell labelled by cell type, I can then use the unsupervised part of MetaNeighbor to evaluate across experiment which groups (cell type) are more similar between each other across experiments.
The first part of the analysis, I followed the MetaNeighbor tutorial Part2. Both Sst_Chadl-Int1 (AUROC=0.99) and Smad3-Imt14 (AUROC=0.97) are among the top hits showing high similarity. It also showed that Int1 related well with other Sst cell types (Int1-Sst_Th, AUROC=0.89; Int1-Sst_Cdk6, AUROC=0.88). It also yield a list of HVG (331 genes in total) for cell-type based clustering. But can I actually identify gene markers defining a particular group of cells? I have not find a specific function for that yet. So I assume that the list of HVG is a pool of genes that define different clusters.
Next we compare the performance of Seurat and MetaNeighbor.
My aim is to 1) test if I could use it for my data and 2) how consistent it is compared to Seurat.
1) Split mn_data and rerun CCA to examine how cells cluster
Looks like that Smad3 from GSE71585 and Int14 from GSE60361 clustered well as well as Sst-Chodl from GSE71585 and Int1 from GSE60361. Following the Seurat standard PCA procedure showed that the two datasets had a huge batch effect. In addition, very few HVG were identified (76 genes) for running downstream PCA and CCA in Seurat.
2) Run CCA
We then follow the CCA pipeline and see that there were improvements in clustering as most of the batch effects were removed. However, we could see that the clustering became imprecise (especially for Smad3).
3) FindMarker
I performed FindConservedMarkers() for cluster 2(Smad3 and other cells) and 5(SstChodl exclusive). And used these marker genes as gene list and feed it back into MetaNeighbor. Surprisingly, all gene lists, even those cluster 1 markers, distinguish Smad from SstChodl with AUROC > 0.89. It indicated that these genes may express in multiple clusters and have distinct expression levels. Still maker genes of cluster 5 had the highest performance. This cluster also shares the highest number of genes with the var.gene list identified by MetaNeighbor (107 common genes out of 397 markers).
Finally, I swapped the HVG list between two packages: I used HVG identified in MetaNeighbor (331 genes) to run Seurat CCA and carried out MetaNeighborUS() using all the HVG (76 genes) identified using Seurat from the two individual datasets. Apparently, Seurat HVG set also performed well in MetaNeighbor but not as much as the MetaNeighbor HVG (difference in AUROC=0.01). On another hand, CCA only identified 6 clusters using the 331 HVG identitied by MetaNeighbor. Thus, maybe some cell types are so similar that the typical unsupervised method really is not sensitive enough to tell them apart.
To conclude: MetaNeighbor is very useful if we would like to compare the population composition between two individual studies; we can validate cell types and enhance clustering. Then we can perform supervised learning for gene list defining a particular cluster. It would be extremely useful if the data already have the cell type labelled. It was consistent with the publication that HVG, regardless of their function, can distinguish cell type very well using correlation networks. I think I will now go back to the basics and learn about how networks are generated; surely there will be a lot of new things to learn!
Did you use MetaNeighbor in your own analysis? And what is your own experience?
Comments