Handling new type of data: ECCITE-seq Part 1

Athena Chu.
Apr 26, 2019
3 min read

So before losing myself in learning about diffusion maps and many other spectral clustering techniques, I am quite excite to see the ECCITE-seq is finally published in Nature with a few new features to expand the power and scope of information one can extract from single-cell sequencing experiment. It is just fascinating to see the simple solutions ECCITE-seq came up with to sequence sgRNA, antibody tags, and transcriptome in one single experiment. It also demonstrated that the gel-bead system is so versatile for numerous adaptations; I am intrigued to see what future developments will be implemented in the 10x genomics system.

I would also like to try out the new Seurat HTO-demux function just released this month to see the additional benefits do cell hash tagging bring into the single-cell data. Cell hash-tagging makes use of oligo-nucleotide conjugated antibodies (HTO); each hastag labelled a sample prior to pooling for subjecting to the 10X Chromium system, maximizing the barcoding capacity of 10X and removing batch effect at its root. The Seurat tutorial focused on handling the HTO tag. My aim is to go through the tutorial and apply the new function on the ECCITE-seq data to see how other types of data could be incorporated into the analysis.

After briefly go through the HTO tutorial, I would like to now try to analyse the uploaded data for ECCITE-seq (GEO:GSE126310). There are five sets of data uploaded in the GEO accession. First I am trying the smallest dataset sp.CDNA, sp.ADT, and sp.GDO. My aim here is to see how I could integrate multi-modal dataset in Seurat. The powerful new feature of Seurat v3.0 us that it can accommodate several assays in the object, in this case we can add cDNA, ADT and GDO on the same object.

‘’’

library(data.table)

sp.cda<-read.csv(gzfile("GSM3596081_sp-cDNA.txt.gz"), as.is = TRUE, sep=" ")

### 112137 genes (human + mouse) x 6623 cell barcodes

sp.adt<-fread("GSM3596082_sp-ADT-count.csv.gz", sep=",")

sp.GDO<-fread("GSM3596083_sp-GDO-count.csv.gz")

### only include cells present in the cdna data ###

sp.bc<-Reduce(intersect, list(colnames(sp_seurat@assays$RNA),sp.adt$cell,sp.GDO$cell))

### build Seurate object ###

sp_seurat<-CreateSeuratObject(counts = sp.cda[, sp.bc], assay = "RNA")

sp_seurat<-NormalizeData(sp_seurat)

sp_seurat<-ScaleData(sp_seurat)

### Run PCA

sp_seurat<-FindVariableFeatures(object = sp_seurat, assay = "RNA")

sp_seurat<-RunPCA(object = sp_seurat, assay = "RNA", npcs=30)

ElbowPlot(object = sp_seurat, ndims = 30)

sp_seurat<-FindNeighbors(object = sp_seurat, assay = "RNA", dims = 1:25)

sp_seurat<- FindClusters(object = sp_seurat, assay = "RNA", resolution = 0.8)

### found 13 communities ###

sp_seurat<-RunTSNE(object = sp_seurat, npcs=30)

DimPlot(object = sp_seurat, label = TRUE) + NoLegend()+

labs(title = "sp.cDNA")

‘’’

Note that sp.adt and sp.GDO appeared to have ways more cell-bracodes than the RNA dataset. Regardless, we only look into cells that are present in all three sets of data. We first perform the typical log normalisation and scaling for the RNA dataset, find highly variable features (“genes” in this case), then perform PCA and tSNE. Seurat introduced divided the FindCluster() step into 2 steps prior to the t-SNE clustering: FindNeighbors() and FindClusters(). Finally, we saw that there seurat determined that there were 13 clusters in the data.

Next, we can add in the ADT data into the Seurat experiment.

‘’’

### add assay to Seurat object ### features (row) x cells (col)###

sp.adt<-sp.adt[sp.adt$cell %in% sp.bc, ]

rownames(sp.adt)<-sp.adt$cell

input<-t(sp.adt[, 2:ncol(sp.adt)])

colnames(input)<-sp.adt$cell

sp_seurat[["ADT"]]<-CreateAssayObject(counts=input)

sp_seurat <- NormalizeData(sp_seurat, assay = "ADT", normalization.method = "CLR")

sp_seurat <- ScaleData(object = sp_seurat, assay = "ADT", do.scale = T, do.center = T)

‘’’

I have to say that I am reading about the CLR transformation; there seems to be debates on whether CLR transformation is suitable for compositional data. Another thing I kinda don’t understand is what ScaleData() was doing to the normalised data:

1. What is the default for the scaling feature, I mean didn’t we already transformed the data with CLR? Do we have to regress based on Library size, which sounds redundant?

‘’’

### add-hoc codes for integrated feature plot

p<-data.frame(sp_seurat@reductions$tsne@cell.embeddings) %>%

ggplot(aes(x=tSNE_1, y=tSNE_2, colour=sp_seurat@assays$ADT@scale.data["hCD29", ])) +

geom_point(alpha=0.2) + scale_color_gradient(low="white", high = "purple") + NoLegend()+

labs(title = "hCD29")

q<-data.frame(sp_seurat@reductions$tsne@cell.embeddings) %>%

ggplot(aes(x=tSNE_1, y=tSNE_2, colour=sp_seurat@assays$ADT@scale.data["mCD29", ])) +

geom_point(alpha=0.2) + scale_color_gradient(low="white", high = "purple") + NoLegend()+

labs(title = "mCD29")

library(cowplot)

plot_grid(p, q, align=c("hv"))

‘’’

The next thing I did was to see the power of antibodies tagging on single cell data, I found that the human and mouse CD29 antibodies separated well in the t-SNE, showing that the clustering based on RNA data is congruent with the antibodies tagging. That’s pretty cool. But I have a feeling that I am just doing three independent analyses on RNA, ADT and GDO data but not an integrated study as I would hope for. I have to say that this week I just managed to learn about these new functions and the new organisation of Seurat subject.

Next week I will work with the HTO tagged data as well as the sgRNA data and see how these additional information render more power and resolution in the 10X system.

Handling new type of data: ECCITE-seq Part 1

Recent Posts

Comments