After deriving the distance matrix from the data, I can now move on to clustering the cells based on expression profiles. I usually follow the standard Seurat guideline and use FindCluster to split the cells into different groups. More specifically, Seurat implement shared nearest neighbor (SNN) to Find subpopulation clusters, given the PCA run previously using the highly variable genes. By changing the resolution parameter in FindClusters(), one could increase the number of clusters identified. To employ the SNN method, we have to first construct a graph, which indicate cell-cell relatedness. But instead of using the distance matrix directly, we would construct the graph in a transformed dimensions of Principle components! Rotating the data is a key part of normalisation (further explanation in http://mccormickml.com/2014/07/22/mahalanobis-distance/) to help us to determine if certain points are genuinely closer or further from the centroid of a cluster.
In that light, I would like to further explore how PCA influences clustering. PCA can be useful for a few algorithms, such as K-means and K-nearest-neighbors KNN. I will work on these two methods later; instead the first application of PCA I studied was in hierarchical clustering: pcaReduce. It was a comprehensive paper and gave me some insights into how hierarchical clustering is done.
The algorithm consists of 6 key steps (details in paper method Algorithm 1 and supplementary file 1):
1-PCA
2-K-means clustering (initially 30 clusters, projected on 29 PCs)
3-pairwise comparison of cluster similarity
4-merge the closest 2 clusters
5-repeat the process starting from 3 on data projected on 1 fewer PC from the last iteration.
It iterated until there were two clusters left.
So lets try pcaReduce in R. First, I used RunPCA(pcs.compute = 30) in Seurat. Then I run t-SNE using all 30 dimension, we can see that increasing the PCs used did not alter the overall topology of the t-SNE much.
Now perform pcaReduce only on the HVG identified in Seurat.
PcaReduce performed 100 repeats of clustering and merging from 31 to 2 clusters, the output is a matrix for each iteration, with each column as the round of k-mean cluster and each row is the cluster identify of a cell. However, there is no other output (e.g. p-value) to justify which cluster were merged in each iteration. We can see that the initial clustering seemed unrealistic as many clusters literally mix together.
Taking the B-cells as an example, we can see that the clustering of the 305 B-cells in the pbmc data converged in different ways in the four iteration of pcaReduce. We can see that by iteration 20, most b cells were grouped into the same cluster (inidcated as the line all converged). We can see that in 3 out of four iteration, there is a cell (CTAGGATGAGCCTA) indicated in red-dash line), seemed never group with the other B cells. It may be an outlier.
One drawback of pcaReduce is that the clustering confidence is calculated as the Adjusted Rand Index (ARANDI), which compares the assigned cluster of a cell to the predetermined cell identity. In that sense, there is not much predictive power there. We could not readlly use it to determine the optimal number of cluster in a newly-sequenced population with unknown cell identity, can we?
So in the end how to we know we get the correct number of clusters in out data?
Comments