In this post, I am looking into an earlier topic solved by Seurat infering spatial location of cells through marker gene expression in zebrafish embryo (Satija 2015). There is also a old tutorial for the paper published in Seurat tutorial.
What was interesting about the paper was that it addressed the main question that my initial project on scRNA-seq analysis tried to answer: How to match an expression profile to a fluorescent staining pattern? And can we further use the cells’ expression profiles to predict identity of other cells? It is also cool to dip into a few alternative ways of data analysis that are now completely standardized in the Seurat package. I have to say that it was not that easy to dig into the old Seurat code and work out how and why things were done that way, but it was quite rewarding as I learn a few new R functions on the way.
OK, so a brief summary of the paper is that the authors tried to identify a cell’s location at the zerbrafish embryo based on the on/off state of a particular set of differentiation genes. The first interesting aspect is that they used highly variable genes (HVG) identified in the scRNA-seq to project PCA loading of other genes; thus, they identified important genes (sharing a “larger” loading in the PC) contributing to defining cell identity but were not captured in the HVG gene set because they missed the dispersion threshold. The second aspect is that the author imputed the gene expression data of the landmark differentiated genes with specific expression pattern in the embryo. They demonstrated that imputing pooled the co-expressing genes information to refine the expression level of these marker genes, thus making the later spatial inference step less noisy. Next, the authors transformed the scRNA-seq data of these marker genes to a on/off stage visualized in the rmbryo for later spatial prediction processes. These three analyses were covered in tutorial 1 to 3 that I so far am able to update the codes and reproduce their results with small variations.
1) Using HVG genes to find other important genes via PC projection
The concept is new to me but it is relatively stright forward technically. I followed the paper’s specified parameter to run Seurat FindVariableGenes() and I found 184 instead of 160 genes.
‘’’
data<-read.table("GSE66688_zdata.expMatrix.txt")
## follow paper method session ###
zf9<-CreateSeuratObject(data, project="zf9", min.genes =2000 , min.cells=3)
## zf1-3 are batches ##
## check if nGene and nUMI are consistent ##
mean(zf9@meta.data$nGene)
mean(zf9@meta.data$nUMI)
zf9<-NormalizeData(zf9, scale.factor = 200000)
zf9<-ScaleData(zf9, do.center = T, vars.to.regress = c("nUMI", "orig.ident"))
### find HVG ###
zf9<-FindVariableGenes(zf9, y.cutoff = 2, x.low.cutoff = 0.25, x.high.cutoff = 7)
### we found 184 instead of 160 genes!
‘’’
In the paper, Seurat use the R base prcomp() analysis to perform PCA. I did both runPCA() and prcomp() in the exercise as I was not sure.
‘’’
zf9<-RunPCA(zf9)
zf9<-RunTSNE(zf9, dims.use = 1:20)
FindClusters(zf9, reduction.type = "pca", dims.use = 20)
TSNEPlot(zf9)
‘’’
t-SNE plot showed that the batch effects are elimiated but I don’t think I see any obvious clustering.
2) We use predict() to project the HVG PC on the full gene set.
‘’’
zf9_pca<-prcomp(zf9@scale.data[zf9@var.genes, ], center = F)
## project pca axis to full data ##
project.zf9 = predict(zf9_pca, zf9@scale.data)
‘’’
I have to say that my PCA results were not the same as the published results just by looking at the Seurat output in the supplementary info of the paper. But we moved on. The next step was to find the important but not highly variable genes. The authors surveyed the projected gene loading and find all genes with loading greater than 1e x -2 (1%) in PC1, 2, and 3. I repeated the procedure and found that this yielded 7199 genes! I increased the thresholds to 5 to reduce the risk of using too many genes (predictors) with too little cells (instances). Now, we ended up with 385 genes. We would use these 385 genes to perform the imputing of the 47 landmark genes with known spatial distribution.
‘’’
# insitu.matrix contains the expression details for 47 genes for 8 locations #
load("output_part1.Robj")
wb <- loadWorkbook("Spatial_ReferenceMap.xlsx"); insitu.genes <- getSheets(wb)
insitu.matrix <- data.frame(sapply(1:length(insitu.genes),function(x)as.numeric(as.matrix(wb[x][2:9,2:9]))))
V_D<-as.character(seq(0, 7))
row_wb<-c("24-30", "17-23", "13-16", "9-12", "7-8", "5-6", "3-4", "1-2")
rownames(insitu.matrix)<-melt(sapply(V_D, function(x)paste(x, row_wb, sep="_")))[,3]
insitu.genes <- toupper(insitu.genes); colnames(insitu.matrix) <- (insitu.genes)
‘’’
The imputation aimed at getting a better informed expression values of these marker genes by pooling in expressions of other related genes. So the fist step of the original code was to perform LASSO modeling on HGV and the marker genes (insitu.genes) and then use the predict() function to impute expression value of these insitu.genes. This step makes more sense to perform on normalized data not the scaled data.
‘’’
### impute insitu.gene.values
genes.use=zf9_full_pc_genes
lasso.input=t(zf9@scale.data[genes.use,])
genes.use=zf9_sig_genes[!zf9_sig_genes %in% insitu.genes]
genes.fit<-insitu.genes
### on normalised data
lasso.fits_data=data.frame(
t(sapply(genes.fit,function(x)
lasso.fxn(t(as.matrix(zf9@data[genes.use[genes.use!=x],])),
as.matrix(zf9@data[x,]),s.use=20,x,do.print=FALSE,gram=TRUE))))
‘’’
We can see that the correlation of co-expressed gene pairOSR1 and MIXL1 became stronger after imputation. This helps the later inference step for on/off spatial landmark genes. With the imputed data, we can now build expression model of the insitu.genes and define a range of on/off state.
3) Infer on/off state of insitu.genes
‘’’
### build expression model using the imputed insitu.genes
### partition imputed expression into 2 group (k=2) ###
fit_gene<-function(df, gene, do.k){
do.k=2
data.use=df[gene,]
data.cut=as.numeric(df[gene,])
cell.ident=as.numeric(cut(data.cut,do.k))
### NEW DIscovery of using CUT!!! ###
### ploting
plot(c(df[gene,]), cell.ident, pch=16)
title(main=gene)
cell.ident=order(tapply(as.numeric(data.use),cell.ident,mean))[cell.ident]
ident.table=table(cell.ident)
hist(as.numeric(data.use),probability = TRUE,xlab=gene,main=gene)
for(i in 1:do.k)
lines(seq(-10,10,0.01),(ident.table[i]/sum(ident.table)) * dnorm(seq(-10,10,0.01),mean(as.numeric(data.use[cell.ident==i])),sd(as.numeric(data.use[cell.ident==i]))),col=i,lwd=2)
### Save probability of the ident(1/2) for each cell , given the normalised exp ###
raw.probs=t(sapply(data.use,function(y) unlist(lapply(1:do.k,function(x) ((ident.table[x]/sum(ident.table))*dnorm(y,mean(as.numeric(data.use[cell.ident==x])),sd(as.numeric(data.use[cell.ident==x]))))))))
norm.probs=raw.probs/apply(raw.probs,1,sum)
### return prob row:gene, col:cells
return(norm.probs)
}
‘’’
This cut() function has blown my mind! I didn’t know that R could separate two sets of data simply using this function. Anyway, now we compute the probability distribution of the on and the off state for each gene on their expression values and inferred the probability whether a landmark gene is on or off for each cell.
‘’’
### use normalised data not Scaled data for fiting gene models on exp!!! ###
### collect a list of cell ident.prob either at ident 1/2 ###
partition_prob<-lapply(insitu.genes, function(x) fit_gene(lasso.fits_data, x, 2))
### make table ###
for (i in 1:length(partition_prob)){
p_df<-data.frame(partition_prob[[i]])
colnames(p_df)<-paste(insitu.genes[i], c(1, 2), sep=".")
if (i ==1){
all_prob<-p_df}
else {
all_prob<-cbind(all_prob, p_df)
}}
‘’’
We can see that most landmark genes showed clear separations of on/off state (ident.1/ident.2). Most genes had a non-zero baseline so that the on-state is defined by high expression values. And we can see that only a very small amount of cells are in the alternative state. Unfortunately this is where I am stuck at the moment, at the cell mapping step. Here, I have a few very complicated functions that I am still studying. I will write about them in the future posts if I manage to crack the code.
Conclusion
I have to say that I have learnt a few new stuff in R and I was glad to see how to use LASSO on scRNA-seq data! In summary, we learnt
1) use HVG to find other importnat genes via principle-component projection,
2) data imputing using LASSO,
3) cutting data into two normal distribution of different means and infer the probability of the cell belonging to either of the distribution in R.
Now we have all the probability of whether each marker gene is on/off for each cell. How can we use the combination these probabilities to infer the likely location of a cell? I am excited to dig deeper and learn more about it! It is also cool to see that Seurat is not applied on cluster identification but on other scientific problems.
Comments