Integrated analysis of scRNA, HTO, sgRNA data : ECCITE-seq Part 2

Athena Chu.
May 3, 2019
4 min read

Okay, so this week I looked into another ECCITE-seq data: the K562 cells infected with a CRISPr library, tagged with cell-hashing antibodies, as well as labelled with surface protein CD29 and CD46. My aims here are 1) understand how the cells were clustered using different data, and 2) look into how sgRNA infection may change gene expression. Since now I am quite good at putting the Seurat subject together, I hastily put all the datasets into one Seurat subject.

‘’’

K_cdna<-fread("GSM3596090_K-cDNA.txt.gz", sep=" ", data.table=F)

### test if CD29 and CD46 expressed after CRISPr ###

K_ADT<-fread("GSM3596091_K-ADT-count.csv.gz", sep=",", data.table=F)

### HTO29, HTO30, HTO44 cell hashing ###

K_HTO<-fread("GSM3596092_K-HTO-count.csv.gz", sep=",", data.table=F)

### sgRNA count ###

K_GDO<-fread("GSM3596093_K-GDO-count.csv.gz", sep=",", data.table=F)

### process data.table, assign col 1 as row.names ###

process_data_table<-function(x){

rownames(x)<-x$V1

x<-x[, 2:ncol(x)]

return(x)

}

K_cdna<-process_data_table(K_cdna)

K_ADT<-process_data_table(K_ADT)

K_HTO<-process_data_table(K_HTO)

K_GDO<-process_data_table(K_GDO)

### Use cell hashing technique to remove doublets ###

### assign cells to one guide ###

### compare sgRNA and cDNA aboundance ###

### analysis intersect of data ###

K.bc<-c(Reduce(intersect, list(colnames(K_cdna),colnames(K_HTO),colnames(K_GDO), colnames(K_ADT))))

### 6274 common cell ID ###

### Create Seurat object ##

K_seurat<-CreateSeuratObject(assay = "RNA", counts=K_cdna[, K.bc], project = "K562_eccite-seq")

K_seurat<-NormalizeData(K_seurat, assay = "RNA", normalization.method = "LogNormalize", scale.factor = 10000)

K_seurat[["HTO"]]<-CreateAssayObject(counts = K_HTO[1:3, K.bc])

K_seurat[["ADT"]]<-CreateAssayObject(counts = K_ADT[1:2, K.bc])

K_seurat[["GDO"]]<-CreateAssayObject(counts = K_GDO[1:13, K.bc])

### CLR transform HTO, ADT and GDO objects ###

K_seurat<-NormalizeData(K_seurat, assay = "HTO", normalization.method = "CLR")

K_seurat<-ScaleData(K_seurat, assay = "HTO", do.scale = T, do.center = T)

K_seurat<-NormalizeData(K_seurat, assay = "ADT", normalization.method = "CLR")

K_seurat<-ScaleData(K_seurat, assay = "ADT", do.scale = T, do.center = T)

K_seurat<-NormalizeData(K_seurat, assay = "GDO", normalization.method = "CLR")

K_seurat<-ScaleData(K_seurat, assay = "GDO", do.scale = T, do.center = T)

‘’’

I log-normalised the expression data and performed CLR normalisation on other datasets as recommended in the methods. First, I used HTOdemux() to identify singlet cells. According to the methods, HTOdemux first performs a k-medoids clustering where k is the sum of cell-hashtag used plus one (negative cluster). Then, Seurat uses the negative cluster to develop thresholds for classification of which hashtag belong to the cell, and whether the cell is a singlet or a doublet.

‘’’

K_seurat<-HTODemux(K_seurat, assay = "HTO", nstarts=20)

table(K_seurat@meta.data$HTO_classification)

RidgePlot(K_seurat, features=c("HTO29-5P-TATCACATCGGT", "HTO30-5P-CACATAATGACG", "HTO44-5P-TAACGACGTGGT"), ncol=2)+NoLegend()

## 737 negative cells? ##

### is that possible to perform k-means clustering for the HTO with 4 clusters - to identify the negative cluster? ###

HTO_kmeans<-kmeans(t(K_seurat@assays$HTO@scale.data), 4)

### plot HTO_data? ###

K_HTO_subset<-RunPCA(K_seurat, assay = "HTO", npcs = 4,

features = c("HTO29-5P-TATCACATCGGT", "HTO30-5P-CACATAATGACG", "HTO44-5P-TAACGACGTGGT"))

library(gridExtra)

grid.arrange(

DimPlot(K_HTO_subset, cells = names(HTO_kmeans$cluster[HTO_kmeans$cluster==1])) + NoLegend()+ggtitle("1")+ylim(-4, 6)+ xlim(-4, 4),

DimPlot(K_HTO_subset, cells = names(HTO_kmeans$cluster[HTO_kmeans$cluster==2])) + NoLegend()+ggtitle("2")+ylim(-4, 6)+ xlim(-4, 4),

DimPlot(K_HTO_subset, cells = names(HTO_kmeans$cluster[HTO_kmeans$cluster==3])) + NoLegend()+ggtitle("3")+ylim(-4, 6)+ xlim(-4, 4),

DimPlot(K_HTO_subset, cells = names(HTO_kmeans$cluster[HTO_kmeans$cluster==4])) + NoLegend()+ggtitle("4")+ylim(-4, 6)+ xlim(-4, 4))

### kmeans included more negatives than Seurat HTOdemux ###

### the duplets in kmeans clustering is also kinda ambiguous ###

‘’’

In parallel, I clustered the cells using K-means clustering with K=4. I can see that both methods agreed with each other very well and HTOdemux is extremely sensitive to doublets. I think such clear clustering was because of the very sensitive HTO detection methods; the raw HTO counts seemed to always have one dominate HTO possessing counts that is of magnitude higher than other HTOs. And using the RunPCA() function, we can see that the HTOs project along three different direction and the negative cells are centered around space at [0, 0].

Next, we looked into the sgRNA infection dataset. Running PCA and FindClusters() did return the correct number of clusters corresponding to the number sgRNAs. I think I could use MULTIseqDemux() to assign with sgRNA infected which cells too, shall try it next time.

‘’’

### GDO clustering ###

K_GDO_subset<-RunPCA(K_seurat, assay = "GDO", npcs = 10,

features = rownames(K_seurat@assays$GDO@counts))

K_GDO_subset<-RunTSNE(K_GDO_subset, dims = 1:5)

DimPlot(K_GDO_subset)

FeaturePlot(K_GDO_subset, features = rownames(K_GDO_subset@assays$GDO@counts), ncol=3)+theme(title = element_text(size=4))

‘’’

However, the clustering did not alwayfollowed the sgRNAs read counts; there are always some small satillete clusters scattered around in addition to the main sgRNA cluster for each sgRNA.

Then, I turned to the RNA data. FindClusters() identified 9 clusters based on the RNA data but FindMarkers() could not find any DE genes! A caution note that the clustering is rely on rare events that occur in merely less than 10% of cells in the cluster, which was inferred by FindNeighbor(); real biological differences may not really exist. Another indirect conclusion was that the sgRNA infection did not really impose profound transcriptomic changes to the K562 cells.

‘’’

### somehow interesting as some sgRNA located at scattered clusters in t-SNE ###

K_RNA_subset<-FindVariableFeatures(K_seurat, assay = "RNA")

K_RNA_subset<-ScaleData(K_RNA_subset, assay="RNA", do.center = T, do.scale = T)

K_RNA_subset<-RunPCA(K_RNA_subset, assay="RNA", npcs=30, features=K_RNA_subset@assays$RNA@var.features)

K_RNA_subset<-FindNeighbors(K_RNA_subset, assay="RNA")

K_RNA_subset<-FindClusters(K_RNA_subset, assay="RNA")

DimPlot(K_RNA_subset, reduction="pca")

### 9 clusters ###

K_RNA_subset<-JackStraw(K_RNA_subset, dims=10, assay = "RNA")

K_RNA_subset<-ScoreJackStraw(K_RNA_subset)

JackStrawPlot(K_RNA_subset)

K_RNA_subset<-RunTSNE(K_RNA_subset, eduction = "pca", dims=1:5)

DimPlot(K_RNA_subset)

FeaturePlot(K_RNA_subset, features = rownames(K_RNA_subset@assays$GDO@counts), ncol=3)+NoLegend()

### What made these K652 ddifferent

K_RNA_subset_markers<-FindAllMarkers(K_RNA_subset, assay = "RNA", min.pct = 20, only.pos = T)

### Warning: No DE genes identified!!!

‘’’

Finally, I repeated the analysis that evaluate the impact of sgRNA infection to RNA and protein level of JAK1, and CD29. For JAK1 genes being targeted by three different sgRNAs, we can see that RNA levels did experience a 2 fold difference when we set the threshold of sgRNA level at 2. We can also see that JAK1.1 seemed to be more effective in infecting cells than JAK1.2 and JAK1.3.

On another hand, CD29 expression (ITGB1 expression level) and CD29 protein level were both affected by CD29 targeting sgRNAs. Significant changes is observed for cells possessing sgRNA expression level at 4. Of note, CD29 protein staining level correlated with ITGB1 expression at 0.25, r2 = 0.06. Therefore, seeing such significant effects of sgRNA on both protein and gene expression for CD29 is quite impressive.

To conclude, I am impressed with the good signals for the HTO, GDO and ADT datasets generated by ECCITE-seq. And I can see that the data analyses is quite straight forwards and I can start to see that we can now link up many different information at single-cell level, thanks to this new technological breakthrough. One complaint is that each Seurat subject appeared to be able to store one set of reductions and is incapable to perform t-SNE for different assays; so what the points of storing multiple assays into one object?

Next week, I will crack on to analyse the control and CTCL dataset, that will probably be a ICC analysis, We will see how HTO tagging can help remove batch effects and give us better signals.

Integrated analysis of scRNA, HTO, sgRNA data : ECCITE-seq Part 2

Recent Posts

Comments