top of page
Search
Writer's pictureAthena Chu.

Remove batch effects, just to make things more complicated

NB: This is just a commentary

It all began when I was looking into the latest paper of Dynverse (https://dynverse.org/) (now published in Nature Biotech) . It is a neat piece of work that standardize all scRNA-seq trajectory inference tools, making these tools extremely easy to use and be compared between each other.

I was trying to replicate the trajectory inference of the fibroblast reprogramming dataset (GSE67310) (Fig. 6B in the paper) using different packages, then I came across the issue that the output clustering does not share the same shape as the figure. The likely cause would be a lack of filtering and normaliszation of the input expression data. Even though in the paper it specified a typical preprocessing pipeline via scater and scran, the author mentioned that they handled the data according to the original methods published along with the dataset. So the next logical thing I did was to think about possible way to normalise the data. On top of apply what the dynverse paper advised (include genes > 3 median absolute deviations, expressed in at least 5% of cells) , I found that I may have to put in a normalised expression dataset for trajectory inference.


The fibroblast reprogramming dataset (Treutlein et al 2016) consists of FPKM values for gene expression with spike-in available. The dataset contained cells collected at multiple time points, both sorted and un-sorted. So basically, the dataset was stitched together from a series of sequencing experiments and for sure it is confounded by batch effects. I saw that the author had done a very decent job in the data analysis and managed to retrieve a similar shape for single cell clustering at extended Fig. 6, still I wonder if there is a consensus way to remove batch effects.


There are quite a few common pipeline to remove batch effect, the more common one is via scater normaliseExprs(), sva Combat() or the limma removebatcheffect() function. Both Combat() and the limma removebatcheffect() required batch variable which define which cell belongs to which batch. I wonder if limma removebatcheffect() over-correct the data as the batch variables are attributed to the different time-point for cell harvest. Scater ICA only remove cell-cycle related noises, ignoring other latent factors contributing to the extreme cell heterogenity. So there are only scater normaliseExprs() and sva Combat() left to choose from, and I have to first understand what they do and how they model batch effects.


So what is sva and f-scLVM? SVA attempts to identify latent factors causing the batch effects, attributed to biological and technical processes. f-scLVM is a latent factor correction tool based on SVA, but it characterizes two set of latent factors: annotated (gene based) and hidden factors. It accommodates user defined REACTOME gene sets for batch effect correction and assigned the residual batch effects to multiple hidden factors. Unfortunately, the f-scLVM model is no longer compartible to scater (v1.12.1, R 3.6.0) SingleCellExperiment object and normaliseExprs() seemed to be phrased out. Still it was interesting to go through both f-scLVM and sva tutorial to learn more about how to interrogate batch effects and use the latent factors of correct expression values and select highly variable genes. I have also learnt new stuff about design matrix and contrast matrix.


I have now learnt more about f-scLVM in the slalom package, sva, and limma packages, thanks to these tutorial, hopefully, I will be able to apply them on my analysis using dyo.


Tutorials:


A few more examples of using f-scLVM:


154 views0 comments

Recent Posts

See All

Comentários


Post: Blog2_Post
bottom of page