Single-cell sequencing generate a lot of noises. One of the sources comes from sequencing low-quality cells. During the high-throughput sequencing procedure, cells being separated in the mricofuildic system could be stressed, broken and killed (aka. low quality cells). These cells possess transcriptomic profile reflecting their viability but not the phenotypes we are interested. In addition, empty droplets/beads also produce sequencing artifacts that compromise deferential expression (DE) studies.
Ideally, we can identifying a few key features marking these low quality cells and remove them systematically. But we yet have a quantitative way to do so. In the Seurat packages, and generally other single-cell tutorials I have read, they focus on the total UMI count, percentage of mitochondria encoded genes, and percentage of spike-in. The cut-off is drawn in a relative manner with an assumption that most cells sequenced are healthy. Wouldn’t it be nice that we have a standardized means to set the cutt-off?
Ilicic et al (2016) “Classification of low quality cells from single-cell RNA-seq data” presented a generic approach to detect low quality cells in scRNA-seq data. They used microscopic inspected datasets (BMDCs) and (mESCs), which each cell had annotated whether they are broken or not, to extract biological and technical features of low quality cells.
So lets download these cells from Array express as described: E-MTAB-2600 and E-GEOD-48968. The mESCs (E-MTAB-2600) was previously used to assess technical noise. The cells were exposed to three culture conditions and then sequenced using Smart-seq. I could find the mESCs fastq files but not the count table, that’s a bit cumbersome to start from fastq files.... Lets move on to E-GEOD-48968, there are 2425 mouse dendritic cells (DCs) sequenced in Smart-seq after simulated. This set of data appears to have the count table available. The phenotypic metadata is at supplementary file 2 (Shalek et al 2014). But this dataset only have cells as “Good” or “Unknown”, so that is not ideal either. I have found that only mES1 (E-MTAB-3749) has a count-table (possessed) and a phenotype profile containing dead cells (19 out of 673 cells!). That does raise a question on how well the machine learning algorithm cope with imbalance classes (35 dead cells out of 961 cells in the training data set). Oh no! But they only made the bulk RNA dataset available on Array Express! Ok, so looks like I cannot really reproduce the experiment with any of these datasets...
Running out of options, I decided to look into other supplementary files as I would like to know what genes or pathways are shown highly expressed in broken cells. It seems like that the GO category Membrane (GO:00160200) and Mitochondrion (GO:0005739) are both shown to be downregulated in broken cells (goAdjustedPvalueMean_DDownregul) the the supplementary data.
Luckily, they seem to have original data (mSE and training datasets) with the published cellity package in Github (https://github.com/Teichlab/cellity). So lets download it and see what they look like. I decided to follow the vignette step by step.
Cellity output image
The training_mES_dataset consists of two lists, one contain more features than the other. The key features that is used for conducting the PCA are Mapped %, Transcriptome variance, Cytoplasm %, mtDNA %, mitochondria_downreg %. At this point, knowing what genes are involved in each category would be very nice but i could not find these information anywhere (e.g. Metabolism GO:0008152 have 12650 gene entries, are all genes used or only a subset of them?). Emm, looks like the gene lists are hidden in the extract_features() function. Wow, so actually these features are generated from the python pipeline (https://github.com/Teichlab/celloline)! We will look at it later, lets finish off the cellity tutorial first. Given a valid stat file, we will be about to run cellity extract features and assess_cell_quality_PCA. It nicely return a table of cell name and whether they are healthy (0) or low-quality (1).
So now the key is to find how the features are determined in the python pipeline celloline. Reading through the code, the stat file is generated via the stat.py code. But the stat did not really output the some key features used for assess_cell_quality_PCA, namely Cytoplasm %, mtDNA %, mitochondria_downreg %. It does though output the features of “intergenic:, “intragenic” and “ambigious” count.
Summary:
1. Certain gene sets are good indicator to QC cell quality.
2. Once the statistics of these features are computed using celloline, one can infer the cell quality using cellity.
3. Would be nice to have a full list of the genes used to compute feature stat including Cytoplasm %, mitochondria_upreg %, etc.
Final thoughts:
This packages would be useful. But my biggest interest was to find out what genes were used to identify low-quality cells, I am not sure I have gained anythings from it. Still it is a good exercise!
Commentaires