It has been about one month now that I am getting my head spinning to learn a lot of bits and pieces about spectral clustering (useful introductory blog). I have to admit that all these stuff is really maths heavy to me from a user stand point but I am trying my best to understand how and why these methods work.
My basic understanding about these techniques are that clustering can be broken down into a few steps:
1) create a similarity/distance matrice
2) perform dimension reduction – identify latent key patterns
3) perform k-means/SNN clustering
Using different methods to create the distance matrices had profound impacts on the clustering results, so as using applying different dimension reduction technique. Another report (Priebe CE 2019) showed that different spectral graph embedding methods convey different sets of latent patterns, both could lead to meaningful groupings.
Today I am writing about my attempt on apply UMAP on the PBMC data and compare the output with t-SNE. The UMAP introduction video is so well done that I would encourage you to watch it to familiarize yourself with the UMAP theories. UMAP also showed promising results in giving biological meanings in the cluster topology, shown in Becht et al (2018). Thus, let see if I can extract new meanings from the PBMC data.
‘’’
library(Seurat)
library(data.table)
ctrl.data <- read.table("immune_control_expression_matrix.txt.gz",
sep = "\t")
stim.data <- read.table("immune_stimulated_expression_matrix.txt.gz",
sep = "\t")
ctrl<-CreateSeuratObject(ctrl.data, assay="RNA", project = "ctrl")
ctrl<-NormalizeData(ctrl)
ctrl<-ScaleData(ctrl, do.center = T, do.scale = T)
stim<-CreateSeuratObject(stim.data, assay="RNA", project = "stim")
stim<-NormalizeData(stim)
stim<-ScaleData(stim, do.center = T, do.scale = T)
#### How many HVG to use? ###
ctrl<-FindVariableFeatures(ctrl, mean.cutoff=c(0.025, 10), dispersion.cutoff=c(1, 10))
### 2000 HGV genes ###
ctrl_PCA<-RunPCA(ctrl, npcs = 100, features = ctrl@assays$RNA@var.features)
ctrl_PCA<-RunTSNE(ctrl_PCA, dims = 1:100)
ctrl_PCA<-FindNeighbors(ctrl_PCA, reduction="pca", dims=1:100)
ctrl_PCA<-FindClusters(ctrl_PCA)
### 13 clusters ###
ctrl_PCA<-FindNeighbors(ctrl_PCA, reduction="tsne", dims=1:2)
ctrl_PCA<-FindClusters(ctrl_PCA)
### 31 communities? ###
ctrl_PCA<-RunUMAP(ctrl_PCA, dims = 1:100, reduction = "pca")
ctrl_PCA<-FindNeighbors(ctrl_PCA, reduction="umap", dims=1:2)
ctrl_PCA<-FindClusters(ctrl_PCA)
### 33 communities ###
‘’’
I realized that Seurat had already started to phrased in UMAP when I revisited the integrated analysis of the ctrl and stim data. Interestingly, Seurat still use PCA for cluster and neighbor identification. The RunUMAP() function merely was used for data visualisation. Of course, UMAP generated more discrete clusters shown in the plot compared to t-SNE. Still, I am a bit puzzled then why not use UMAP loading for the downstream clustering? And why the UMAP function only return two dimensional loading?
With the current setting, I cannot really test the differences between PCA, t-SNE, and UMAP; FindClusters() did not really report final cluster numbers for the latter two methods; PCA reported 13 final clusters, t-SNE indicated 31 communities and UMAP 33 communities. Somehow, they report two-fold more clusters than PCA-SNN clustering results; maybe the resolution setting need some tinkering to accommodate the t-SNE and UMAp results... Nonetheless, UMAP appeared to form compact clusters with clear spatial separation compared to t-SNE.
Next, we turned to the integrated analysis of both stim and ctrl data. It seemed that the ICA pipeline is replaced by the IntegrateData() function.
‘’’
### integrate ctrl + stim ###?
merge.anchor <- FindIntegrationAnchors(object.list = list(ctrl, stim), dims = 1:20)
immune.combined <- IntegrateData(anchorset = merge.anchor, dims = 1:20)
DefaultAssay(immune.combined) <- "integrated"
# Run the standard workflow for visualization and clustering
immune.combined <- ScaleData(immune.combined, verbose = FALSE)
immune.combined <- RunPCA(immune.combined, npcs = 30, verbose = FALSE)
# UMAP and Clustering
immune.combined <- RunUMAP(immune.combined, reduction = "pca", dims = 1:20)
#### only stored 2 PC for UMAP, so the rest of the clustering are still based on PCA? ###
immune.combined <- FindNeighbors(immune.combined, reduction = "pca", dims = 1:20)
immune.combined <- FindClusters(immune.combined, resolution = 0.5)
### return 13 communities ###Z!
DimPlot(immune.combined, reduction = "pca", label = TRUE)
DimPlot(immune.combined, reduction = "umap", split.by = "orig.ident")
### independent use umap package ###
immune.combined <- FindNeighbors(immune.combined, reduction = "umap", dims = 1:2)
immune.combined <- FindClusters(immune.combined, resolution = 0.5)
### return 13 communities ###Z!
DimPlot(immune.combined, reduction = "umap", label = TRUE) +
theme(legend.position = "none")+
ggtitle("UMAP-clustering")
### 29 community!!! ###
‘’’
Again, UMAP identify double amount of communities compared to PCA. I wonder what if I use the original UMAP package on the ctrl data, will that give me different results? Going through the UMAP package tutorial (https://cran.r-project.org/web/packages/umap/vignettes/umap.html). It looked like that it is more orientated for model training: using UMAP to infer identity for new data. I was a bit confused of what data I shall feed into the UMAP model. For fear of crashing, I subsample the ctrl to 10000 cells and only look at the Marker genes (3654 genes) Seurat identified from the 13 cluster (almost like giving UMAP a cheat-sheet?!).
‘’’
library("umap")
### create dataframe that give count and labels to the data ###
### row:cell, col=gene + label ###
cells<-sample(rownames(ctrl_PCA@meta.data), size = 10000)
ctrl_markers<-FindAllMarkers(ctrl_PCA, only.pos = T, min.pct = 0.2)
ctrl_HVG<-ctrl@assays$RNA@var.features
ctrl_input<-data.frame(ctrl_PCA@assays$RNA@scale.data[ctrl_markers$gene, cells])
### only use marker genes and subset ctrl data to 10000 cells ###
ctrl_input<-t(ctrl_input)
ctrl_input<-merge.data.frame(ctrl_input, ctrl_PCA@meta.data, by= "row.names")
ctrl_input.data=ctrl_input[, 1:3546]
ctrl_input.labels=ctrl_input$seurat_clusters
### somehow rwoname became col 1?? ###
ctrl.umap=umap(ctrl_input.data[, 2:ncol(ctrl_input.data)])
### using gene exp as input is very slow, thus why Seurat uses PCA output instead ###
### but is that a valid move? ###
ctrl.umap
#umap embedding of 10000 items in 2 dimensions
#object components: layout, data, knn, config
plot(ctrl.umap$layout[, 1], ctrl.umap$layout[, 2], col=ctrl_input.labels, type="p")+
title("UMAP with scale.ctrl.data")
### using log_normalised data did not give back 13 cluster in umap plot ###
ctrl_UMAP_scale_marker<-kmeans(ctrl.umap$layout, 13)
### looked very strange ###
cluster_kmeans_compare<-data.frame(t=ctrl_input.labels, p=ctrl_UMAP_kmeans$cluster)
q<-cluster_kmeans_compare %>% group_by(t, p) %>% summarize(n=n()) %>%
ggplot(aes(x=t, y=n, group=interaction(t, p), fill=factor(p))) +
geom_bar(stat="identity", position="fill")+
labs(x="cluster assigned by Seurat", y="freq of items of the cluster",
fill="cluster assigned by UMAP")
p<-data.frame(UMAP_1=ctrl.umap$layout[, 1], UMAP_2=ctrl.umap$layout[, 2], col=ctrl_UMAP_kmeans$cluster) %>%
ggplot(aes(x=UMAP_1, y=UMAP_2, color=factor(col)))+ geom_point()
plot_grid(p, q)
‘’’
I also tried UMAP with the 100 PC loadings generated from Seurat as input; another attempt was using Seurat identified highly variable features (2000 HGV genes in my case) as input. All of them give somewhat conflicting results and very poor clustering. Performing k-means clustering from the UMAP loadings also yield some highly contaminated clusters (e.g. Cluster 2 in Seurat was evenly distributed into many 6 different groups in the UMAP-kmeans result), why?
What is missing in my code that I cannot reproduce the Seurat results? I will have to dig deeper to see what really happens. In addition, I would like to continue with the UMAP package for training model and identity inference... hopefully in the next post.
Comments