It looks like that Euclidean and Manhattan both have lower within cluster distance and higher between cluster distance but Inner-product distance remain the same. However, does that means Euclidean and Manhattan distance would be the better choice for K-mean clustering? Or do I expect using Inner-product distance would generate a completely different clustering pattern?
One of the key steps to analyse scRNA-seq data is doing some sort of clustering. To fully explore the uniqueness of each cell and identify sub-populations, it is sensible to first group more similar cells (in terms of their expression profiles). After we have divide the cells into subpopulations, we would proceed to identify deferential genes marking the novel subpopulation (aka gene markers).
That all sounds quite strict forward until I encounter the notion that we can cluster the cells in quite a few different ways! What make things more complicated is that the underlying methods to determine cell-cell dissimilarities could be employed before even moving on to clustering. Is it important to understand how to calculate cell-cell distance before performing PCA and t-SNE? Not really, because one can cluster the cells based on the raw-counts and normalised counts. Still, it would be useful to know how the distance matrices are computed. Another seemingly common practice in scRNA-seq is that one would select a gene set piror to clustering; for example, Seurat select highly varible genes before running PCA.
The exercise: Looking at how different distance metrics potentially affect clustering.
The task: compute distance matrices using a few popular functions and see if the cells within the same cluster cluster closely/further apart.
So lets start with the 2700 pbmc dataset available in the Seurat tutorial. I went through the tutorial to the step FindVariableGenes. I changed the parameter y.cutoff = 2 to get 652 highly variable genes that would be used for PCA in the later step. But now I will deviate and try to calculate cell-cell distance using different distance metics.
Meanwhile, I will go through the whole tutorial to get subpopulation assignation of cell (so that I know which cluster the cells are assigned to). The rationale was that cells from the same cluster shall ne more similar (shorter distance), and the metrics that provide the shortest within group distance and the largest between group distance shall be the best metrics to use for clustering. Well, I don’t know if the Seurat give options on computing cell-cell distance matrix (maybe it is predetermined in the SLM implemented in (FindClusters?)). There still a lot backgrounds to learn and understand how these functions are written!
Anyhow, I managed to get pretty five groups of cells (quite different from the tutorial, because of the different genes I used?). So now we can move on the exploring the distance matrix computation!
First, I extracted the normalised expression count of the 652 HVG of the 2700 cells. Then I would like to systemically compute the different expression of each genes between each cells, we will get a distance matrix. I found that the R package philentropy gave me a lot of options to compute distance matrix. First thing First, I transformed the matrix so that the column of the expression matrix has to be gene and row be cell to compute the multivariate distance matrix. I tried "euclidean", "manhattan", "inner_product", "pearson" methods.
There are two concerns to begin with: 1) I don’t know how these methods deal with 0 entries (resulted from dropouts or “unexpressed” genes); 2) looks like that the package take log on all values and I cannot cancel such setting.
Next, I performed Z transform for the distance so that I can compare between the methods. I selected the distance of cluster 2 vs cluster 2 as within cluster distance and cluster 2 vs cluster 3 as between cluster distance to assess the performance. I picked the two cluster was because they both have moderate number of cells (382 and 261 cells in cluster 2 and 3) and looked quite far away from each other from the t-SNE I generated. The ideal method in theory shall have a lower distance of cluster 2 vs cluster 2 than that of cluster 2 vs cluster 3.
It looks like that Euclidean and Manhattan both have lower within cluster distance and higher between cluster distance but Inner-product distance remain the same. The Pearson correlation distance metric outperform others, generating the clearest within cluster and between cluster separation.
However, there are other distance methods I would like to explore (e.g. Mahalanobis distance). There a lot to consider before doing any clustering! On top of that, if I generated different clustering patterns, does that allow me to interpret the data (i.e. finding DE genes) in a different way? And how do I validate the clustering pattern as genuine or artefacts?
Further readings:
Kim et al. (2018) Impact of similarity metrics on single-cell RNA-seq data clustering
Comments