Cell differentiation is one of the fields that actively employ scRNA-seq to answer intersting questions. scRNA-seq has the high enough resolution to identify rare cell-types within a mass cell-populations; one application is on identification of stem-cells in a tissue. The stem-cell state can be seen as fickle and irregular; they are outliers of their more differentiated counterpart and present at low number, rendering k-means and kNN clustering unreliable. Still, effective grouping of differentiating cells according to their differentiating trajectories would provide us invaluable information on how the gene-expression profile changes over development as well as inform us the cellular composition of a maturing tissue.
The process of investigating the differentiation process of a tissue is done in two steps: 1) identify the stem cells and 2) rationally link the clusters together according to changes in differential genes expression. This is exactly what the packages RaceID, StemID and FateID offer. The three packages are developed by Grun and are suggested to be used in subsequent manner. In brief, RaceID identified rare cell-type, most likely stem cells, then StemID, emsembl the differentiation trajectories from the stem cell subpopulation to the most differentiated cells. In this post, i tried to go through the RaceID and StemID tutorial and see if I could apply the pipeline to a set of differentiating embryoid body (EB) cells (Han et al (2018)). I thought the dataset is perfect for testing out RaceID and StemID, as it had a temporal component and the dataset consist of a mix of cells in many different developmental states.
In the first part, we imported the EB4 (EB cells at the fourth day of development) and EB8 (EB cells at the eighth day of development). In the original paper, EB4 and EB8 cells formed non-overlapping clusters in the t-SNE. However, the observation that the two cell populations did not share any commonalities seems most likely to be a consequent of batch effect rather than development.
'''
EB8<-read.csv(gzfile("GSM3015985_EB8.csv.gz"), row.names = 1)
EB4<-read.csv(gzfile("GSM2871127_EB-4.csv.gz"), row.names = 1)
EB8<-CreateSeuratObject(EB8, project ="EB8")
EB4<-CreateSeuratObject(EB4, project="EB4")
EB8@meta.data$experiment<-c("EB8")
EB4@meta.data$experiment<-c("EB4")
### Q1 do we expect Day4 EB have some overlap with Day 8 EB?
EB8<-NormalizeData(EB8)
EB4<-NormalizeData(EB4)
EB8<-ScaleData(EB8)
EB4<-ScaleData(EB4)
EB8<-FindVariableGenes(EB8)
EB4<-FindVariableGenes(EB4)
### individual t-sne ###
genes.use <- unique(c(EB4@var.genes, EB8@var.genes))
EB8<-RunPCA(EB8, pcs.compute = 30)
EB4<-RunPCA(EB4, pcs.compute = 30)
EB8<-FindClusters(EB8, genes.use = genes.use, reduction.type = "pca", dims.use = 1:30, resolution=1.5)
EB4<-FindClusters(EB4, genes.use = genes.use, reduction.type = "pca", dims.use = 1:30, resolution=1.5)
EB8<-RunTSNE(EB8, do.fast = TRUE, dims.use = 1:30)
EB4<-RunTSNE(EB4, dims.use = 1:30, do.fast = TRUE)
TSNEPlot(EB8)+ ggtitle("EB8")
TSNEPlot(EB4)+ ggtitle("EB4")
'''
To look into the matter, I performed cluster identification in Seurat individually on the two dataset and used CCA to align the dataset followed by cluster analysis. Interestingly, Seurat found four clusters in each of the individual dataset. I have increased the resolution setting as described in the paper to 1.5 and we definitely saw more clustering, but the clusters do not separate well in the t-SNE. Then we perform the CCA, we can see that most of the information are now lost and CCA gnerated a very poorly clustered t-SNE.
Next, I repeated the analysis using RaceID3. I have found that using the K-medoid clustering, EB8 is best clustered with 4 clusters and EB4 3 clusters (indicated by the Jaccard’s index).
'''
EB8<-read.csv(gzfile("GSM3015985_EB8.csv.gz"), row.names = 1)
EB4<-read.csv(gzfile("GSM2871127_EB-4.csv.gz"), row.names = 1)
EB8<-SCseq(EB8)
EB8 <- filterdata(EB8,mintotal=500)
fdata <- getfdata(EB8)
EB8<- compdist(EB8,metric="pearson")
EB8 <- clustexp(EB8)
plotjaccard(EB8)+title("EB8")
EB8 <- clustexp(EB8, cln = 4, clustnr = 4)
plotjaccard(EB8)+title("EB8")
EB8 <- findoutliers(EB8)
plotoutlierprobs(EB8)
EB8 <- comptsne(EB8)
EB8 <- compfr(EB8,knn=5)
plotmap(EB8,fr=T)
plotmap(EB8,final = F, fr=F)+title("EB8")
### EB4
EB4<-SCseq(EB4)
EB4 <- filterdata(EB4,mintotal=500)
fdata <- getfdata(EB4)
EB4<- compdist(EB4,metric="pearson")
EB4 <- clustexp(EB4)
plotjaccard(EB4)+title("EB4")
EB4 <- clustexp(EB4, cln = 3, clustnr = 3)
plotjaccard(EB4)+title("EB4")
EB4 <- findoutliers(EB4)
plotoutlierprobs(EB4)
EB4 <- comptsne(EB4)
EB4 <- compfr(EB4,knn=5)
plotmap(EB4,final = F, fr=F)+title("EB4")
'''
I tried to follow the steps and proceed to StemID. However, it required a KNN clustering from RaceID for tree emsembl! And the KNN clustering generated a lot of 1-cell clusters (outliers) in the dataset and made the tree looked very strange.
The left spantree figure is from the EB4 data. Cluster 3 seems to be the base and two lineages branched off from it. The right tree reflect EB8 trajectory, it seems like that cluster 1 is the base but it is very hard to tell. The ideal is to overlay the two datasets together so that EB4 will occupy the more basal part of the tree and EB8 cells locate at the leaves, showing a developmental progression.
I have to find a way to change the clustering setting so as to obtain a more resonable tree. Also, I have to look into a better way to merge the two dataset together to give me more informative results!
Hopefully, I will get more answers in my next update.
Comentarios