You may click on the following links to fast forward to the codes directly producing the Figures:
Back to index page.
This vignette illustrates the steps by which the query data
(Smartseq2) was pre-processed. It also reproduces some of the
Supplementary Figures in the original publication involved with the
pre-processing of query data (Lin et al., 2022).
Here we only show how plain plots were generated directly from data. To
reach the final illustrations in the publication, these plots may
require further editing in graphic design softwares (i.e. Adobe
Illustrator). Those beautification steps are not central to our analysis
and therefore will be omitted.
The input file of this vignette is a feature-barcode-count table
(GSE197743_AAV_MG_SMARTSEQ2_counts.tsv.gz) produced by the
Snakemake pipeline ScRNAseq_smkpipe_at_Luolab.
The output is a processed .h5Seurat with metadata slot
filled.
Microglia were isolated from adult mice (~8w) using a method based on the “cold-mechanical dissociation” protocol (Bennett et al., 2016)1. The steps involving FACS sorting from the original protocol was omitted to minimize ex vivo activation of microglia.
We found that the single cell transcriptome acquired by 10X Chromium (3’ Chemistry v3) captured transcripts of the AAV-packaged transgene (mScarlet) at a very low rate (data not published). Therefore we carried out deep single cell sequencing using a modified protocol based on a customized Smart-seq2 protocol (Zhang et al., 2020)2 on microglia isolated from mice after stereotaxic AAV delivery. The Smart-seq2 based method captured mScarlet transcripts with substantially higher efficiently.
To characterize the microglia transcriptomes under quiescent and immune activation states, we isolated microglia from mice subjected to Saline, Poly(i:c) and LPS treatment (i.p.), and performed scRNA-seq using the 10X Chromium platform.
Upstream analysis pipelines (i.e. sequence alignment, generation of digital expression matrix, etc.):
cellranger count command and then
aggregated using the cellranger aggr command.Downstream analysis:
scater)Seurat)Seurat)Load reuiqred libraries.
suppressPackageStartupMessages({
library(Seurat)
library(SeuratDisk)
library(scater)
library(tidyverse)
library(RColorBrewer)
})
If you haven’t done so, download data from GEO.
if(!file.exists('../data/GSE197743/GSE197743_AAV_MG_SMARTSEQ2_counts.tsv.gz')){
library(GEOquery)
options(timeout = 6000) ## In case of slow Internet connections
getGEOSuppFiles(GEO = "GSE197743", makeDirectory = T, baseDir = '../data/')
}
Load the data matrix to workspace.
data_long <- read.table(file = '../data/GSE197743/GSE197743_AAV_MG_SMARTSEQ2_counts.tsv.gz', sep = "\t", header = TRUE) # Raw table produced by ScRNAseq_smkpipe_at_Luolab
data <- pivot_wider(data_long, names_from = "cell", values_from = "count", values_fill = 0) # Expression matrix
Rename Ensembl ID to Gene symbol.
# `../miscs/MM.GRCm38.102.annotation.tab` is generated by running `../misc/awk_extract_gtf.sh` on the annotation file (Mus_musculus.GRCm38.102.chr.mScarlet_WPRE.gtf)
anno <- read.table(file = '../miscs/MM.GRCm38.102.annotation.tab', sep = "\t", col.names = c("ensembl_id", "symbol"))
# Custom function to rename rows from dataframe
renameRows <- function(df, anno){
stopifnot(is.data.frame(df))
stopifnot("ensembl_id" %in% colnames(anno) | "symbol" %in% colnames(anno))
if(any(!rownames(df) %in% anno$ensembl_id)){
df <- dplyr::slice(df, -which(!rownames(df) %in% anno$ensembl_id)) # df may contain rows not in the annotation table. Remove these rows.
}
rn <- rownames(df)
rn <- anno[match(rn,anno$ensembl_id),"symbol"]
dup <- rn[duplicated(rn)]
print(paste0("The following genes names are duplicated: ", paste0(dup, collapse = ", "), ". Making unique."))
for (d in dup){
count = 0
for (i in 1:length(rn)){
if (rn[i] == d) {
count = count+1
if (count>1){
rn[i] = paste0(rn[i],"-",count)
}
}
}
}
rownames(df) <- rn
return(df)
}
data <- data %>% column_to_rownames(var = "gene") %>% as.data.frame()
data <- renameRows(data, anno)
## [1] "The following genes names are duplicated: 1700030C10Rik, Gm16701, Gm18645, 4933434E20Rik, Gm4430, Gm16499, Zkscan7, Gm35558, Gm35558, Aldoa, 4930594M22Rik, Pcdha11, Gm35438, Gm5089, Gm16364, Gm36638, Pakap. Making unique."
Transgene (mScarlet) should be moved to metadata slot to avoid its impact on downstream analyses.
ms <- data.frame(t(data["mScarlet",])); colnames(ms) <- "mScarlet"
data <- data[-which(rownames(data) == "mScarlet"),]
Create Seurat object.
cells <- CreateSeuratObject(data, project = "AAV_Microglia", names.field = 2, names.delim = "_", meta.data = ms)
cells$batch <- cells$orig.ident
cells$id <- "Query"
cells$tech <- "Smart-seq2"
The metadata slot is updated based on the experiment design.
If a cell contains mScarlet transcript, it is denoted as
Transduced, otherwise it is denoted as
Untransduced.
Idents(cells) <- "orig.ident"
cells <- SetIdent(cells, cells = WhichCells(cells, expression = mScarlet > 0), value = "Transduced")
cells <- SetIdent(cells, cells = WhichCells(cells, expression = mScarlet == 0), value = "Untransduced")
cells[["transduction"]] <- Idents(cells)
Update batch labels.
Two AAV variants were used for different experiments (AAV-MG1.2 and
AAV-MG1.1).
Idents(cells) <- "orig.ident"
cells <- RenameIdents(cells,
`yt-8-5` = "AAV-MG1.1", `yt-8-5-2` = "AAV-MG1.1", `yt-8-5-3` = "AAV-MG1.1",
`yt-8-5-4` = "AAV-MG1.1", `yt-8-5-6` = "AAV-MG1.1", `yt-8-5-5` = "AAV-MG1.1",
`yt-8-6-1` = "AAV-MG1.2", `yt-8-6-2` = "AAV-MG1.2", `yt-8-6-3` = "AAV-MG1.2",
`yt-8-6-4` = "AAV-MG1.2", `yt-8-6-5` = "AAV-MG1.2", `yt-8-6-6` = "AAV-MG1.2",
`yt-mgl-L1` = "LPS", `yt-mgl-L2` = "LPS", `yt-mgl-L3` = "LPS")
cells[["treatment"]] <- Idents(cells)
cells[["treatment"]] <- factor(cells$treatment,
levels = c("AAV-MG1.2", "AAV-MG1.1","LPS"))
Idents(cells) <- "treatment"
cells <- RenameIdents(cells, `AAV-MG1.1` = "AAV", `AAV-MG1.2` = "AAV")
cells[["group"]] <- Idents(cells)
Idents(cells) <- "treatment"
cells <- SetIdent(cells,
cells = WhichCells(cells,
expression = transduction == "Untransduced" &
treatment %in% c("AAV-MG1.1","AAV-MG1.2")),
value = "Untransduced")
cells[["transduce_serotype"]] <- Idents(cells)
cells[["transduce_serotype"]] <- factor(cells$transduce_serotype,
levels = c("Untransduced","AAV-MG1.2", "AAV-MG1.1","LPS"))
cells[["mScarlet_norm"]] <- log1p(cells[["mScarlet"]]/cells[["nCount_RNA"]]*1e4)
We use the scater3 package for quality control (QC) of the
single cell datasets
# cells <- DietSeurat(cells)
cells.sce <- as.SingleCellExperiment(cells)
cells.sce <- logNormCounts(cells.sce)
dim(cells.sce)
## [1] 30193 555
per.cell <- perCellQCMetrics(cells.sce,
subsets = list(Mito = grep("^mt-", rownames(cells.sce)),
Ribo = grep("\\bRp[sl]\\d+[^k]*\\b", rownames(cells.sce))))
qc.stats <- quickPerCellQC(per.cell, percent_subsets = c("subsets_Mito_percent","subsets_Ribo_percent"))
qc.stats$discard_size_complexity <- qc.stats$low_lib_size | qc.stats$low_n_features
qc.stats$discard_size_complexity <- qc.stats$low_lib_size | qc.stats$low_n_features
colData(cells.sce) <- cbind(colData(cells.sce), cbind(per.cell, qc.stats))
colSums(as.matrix(qc.stats))
## low_lib_size low_n_features high_subsets_Mito_percent
## 6 19 22
## high_subsets_Ribo_percent discard discard_size_complexity
## 17 54 19
cells.filtered <- cells.sce[,!qc.stats$discard]
keep_feature <- nexprs(cells.filtered, byrow=TRUE) > 0
## Remove mitochondrial and ribosomal genes which does not contribute to dimensionality reduction and clustering
keep_feature[grep("\\bRp[sl]\\d+[^k]*\\b", rownames(cells.sce))] <- FALSE
keep_feature[grep("mt-", rownames(cells.sce))] <- FALSE
cells.filtered <- cells.filtered[keep_feature,]
Back to Seurat object.
querydata <- as.Seurat(cells.filtered)
querydata[["nCount_originalexp"]] <- NULL; querydata[["nFeature_originalexp"]] <- NULL
querydata <- RenameAssays(querydata, originalexp = "RNA")
querydata$ident <- NULL
Cells expressing low level of the canonical microglia marker ‘Hexb’ were removed. These are putatively low quality cells.
querydata <- NormalizeData(querydata)
querydata <- subset(querydata, subset = Hexb > 2)
Raw data.
dims_use <- 1:25
cells <- cells %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA(npcs = 50) %>%
FindNeighbors(dims = dims_use) %>%
RunUMAP(dims = dims_use)
Filtered data.
dims_use <- 1:25
querydata <- querydata %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA(npcs = 50) %>%
FindNeighbors(dims = dims_use) %>%
FindClusters(resolution = 0.8) %>%
RunUMAP(dims = dims_use)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 488
## Number of edges: 24567
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.5671
## Number of communities: 5
## Elapsed time: 0 seconds
A general note on the UMAP visualizations:
Depending on the use of CPUs and versions of the umap
algorithm, the UMAP coordinates computed on different machines may not
entirely match what was shown in the original publication. This is a
known issue of
the umap algorithm. Nevertheless, the overall structure of
the UMAP embedding should be consistent, i.e. microglia from mouse
subjected to Saline, Poly(i:c) and LPS groups reside in different
clusters. We’ve tested this on different computing platforms and reached
the conclution that the shifts in UMAP coordinates does not affect our
biological interpretations.
as.data.frame(colData(cells.sce)[,c("nCount_RNA","nFeature_RNA","discard_size_complexity")]) %>%
ggplot(mapping = aes(x=nCount_RNA, y=nFeature_RNA)) +
geom_point(aes(color = discard_size_complexity), size = 1.2) +
# ggtitle("Reference Data (10x)") +
scale_color_manual(values = brewer.pal(n = 6,name = "Paired")[c(2,6)], labels = c("Kept","Removed"), name = "") +
scale_x_continuous(expand = c(0.15,0.1)) +
xlab("UMI counts") +
ylab("Gene counts") +
theme_classic() +
theme(
# plot.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 16),
axis.title = element_text(size = 21),
legend.title = element_blank(),
legend.box.background = element_blank(),
legend.position = c(.95, .05),
legend.justification = c("right", "bottom"),
legend.text = element_text(size = 16),
legend.margin = margin(1, 5, 5, 5)
)
as.data.frame(colData(cells.sce)[,c("treatment","subsets_Mito_percent","high_subsets_Mito_percent")]) %>%
mutate(treatment = factor(treatment, levels = c("AAV-MG1.1","AAV-MG1.2","LPS"))) %>%
ggplot(mapping = aes(x=treatment, y=subsets_Mito_percent)) +
geom_violin() +
geom_jitter(mapping = aes(color = high_subsets_Mito_percent), size = 0.3, height = 0, width = 0.1) +
scale_x_discrete(labels = c("AAV-\nMG-1.1","AAV-\nMG-1.2","LPS")) +
scale_color_manual(values = brewer.pal(n = 6,name = "Paired")[c(2,6)]) +
xlab(NULL) +
ylab("Mitochondrial RNA %") +
theme_linedraw() +
theme(
legend.position = "none",
axis.text.y = element_text(size = 14),
axis.text.x = element_text(size = 16),
axis.title = element_text(size = 20),
)
as.data.frame(colData(cells.sce)[,c("treatment","subsets_Ribo_percent","high_subsets_Ribo_percent")]) %>%
mutate(treatment = factor(treatment, levels = c("AAV-MG1.1","AAV-MG1.2","LPS"))) %>%
ggplot(mapping = aes(x=treatment, y=subsets_Ribo_percent)) +
geom_violin() +
geom_jitter(mapping = aes(color = high_subsets_Ribo_percent), size = 0.3, height = 0, width = 0.1) +
scale_x_discrete(labels = c("AAV-\nMG-1.1","AAV-\nMG-1.2","LPS")) +
scale_color_manual(values = brewer.pal(n = 6,name = "Paired")[c(2,6)]) +
xlab(NULL) +
ylab("Ribosomal RNA %") +
theme_linedraw() +
theme(
legend.position = "none",
axis.text.y = element_text(size = 14),
axis.text.x = element_text(size = 16),
axis.title = element_text(size = 20),
)
umap_queryf1 <- FeaturePlot(cells, features = "Hexb") +
ggtitle("Hexb") +
theme(
plot.title = element_text(hjust = 0, size = 24),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title = element_text(size = 18),
axis.line = element_line(arrow = grid::arrow(length = unit(0.3, "cm"),
ends = "last", type = "closed")),
legend.title = element_blank(),
legend.text = element_text(size = 16),
legend.position = c(.80, .25)
)
umap_queryf2 <- DimPlot(cells, cells.highlight = colnames(querydata)) +
NoLegend() +
ggtitle("Retained Cells") +
theme(
plot.title = element_text(hjust = 0, size = 24),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title = element_text(size = 18),
axis.line = element_line(arrow = grid::arrow(length = unit(0.3, "cm"),
ends = "last", type = "closed"))
)
umap_queryf3 <- DimPlot(querydata, group.by = "treatment") +
scale_color_manual(values = brewer.pal(n = 12, name = "Paired")[c(12,8,6)],
labels = c("AAV-MG1.1", "AAV-MG1.2", "LPS")) +
ggtitle("Final Query") +
guides(color = guide_legend(override.aes = list(size=3), byrow = TRUE)) +
theme(
plot.title = element_text(hjust = 0, size = 24),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title = element_text(size = 18),
axis.line = element_line(arrow = grid::arrow(length = unit(0.3, "cm"),
ends = "last", type = "closed")),
legend.title = element_blank(),
legend.text = element_text(size = 16),
legend.position = c(.55, .18),
legend.spacing.y = unit(0.25, 'cm')
)
In the figure below, we can clearly see that microglia from LPS
treated mice were primed and adopted a distinct transcriptomic state.
Later, we’ll use Label Transfer
to analyze them more closely.
umap_queryf1 | umap_queryf2 | umap_queryf3
Backup the processed Seurat object using h5 format.
## Avoid using factor while saving to other formats
querydata$orig.ident <- as.character(querydata$orig.ident)
querydata$batch <- as.character(querydata$batch)
querydata$transduction <- as.character(querydata$transduction)
querydata$treatment <- as.character(querydata$treatment)
querydata$group <- as.character(querydata$group)
querydata$transduce_serotype <- as.character(querydata$transduce_serotype)
querydata$seurat_clusters <- as.integer(as.character(querydata$seurat_clusters))
if(!dir.exists('../data/AAV_MG_scRNAseq')){dir.create('../data/AAV_MG_scRNAseq')}
SaveH5Seurat(querydata, filename = '../data/AAV_MG_scRNAseq/query_clean.h5Seurat', overwrite = T, verbose = F)
Convert('../data/AAV_MG_scRNAseq/query_clean.h5Seurat', dest = "h5ad", overwrite = T, verbose = F)
sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=zh_CN.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=zh_CN.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=zh_CN.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-3 forcats_0.5.1
## [3] stringr_1.4.0 dplyr_1.0.9
## [5] purrr_0.3.4 readr_2.1.2
## [7] tidyr_1.2.0 tibble_3.1.7
## [9] tidyverse_1.3.1 scater_1.18.6
## [11] ggplot2_3.3.6 SingleCellExperiment_1.12.0
## [13] SummarizedExperiment_1.20.0 Biobase_2.50.0
## [15] GenomicRanges_1.42.0 GenomeInfoDb_1.26.7
## [17] IRanges_2.24.1 S4Vectors_0.28.1
## [19] BiocGenerics_0.36.1 MatrixGenerics_1.9.0
## [21] matrixStats_0.62.0 SeuratDisk_0.0.0.9020
## [23] sp_1.5-0 SeuratObject_4.1.0
## [25] Seurat_4.1.1
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 reticulate_1.25
## [3] tidyselect_1.1.2 htmlwidgets_1.5.4
## [5] grid_4.0.5 BiocParallel_1.24.1
## [7] Rtsne_0.16 munsell_0.5.0
## [9] codetools_0.2-18 ica_1.0-2
## [11] future_1.26.1 miniUI_0.1.1.1
## [13] withr_2.5.0 spatstat.random_2.2-0
## [15] colorspace_2.0-3 progressr_0.10.1
## [17] highr_0.9 knitr_1.39
## [19] rstudioapi_0.13 ROCR_1.0-11
## [21] tensor_1.5 listenv_0.8.0
## [23] labeling_0.4.2 GenomeInfoDbData_1.2.4
## [25] polyclip_1.10-0 farver_2.1.0
## [27] bit64_4.0.5 parallelly_1.32.0
## [29] vctrs_0.4.1 generics_0.1.2
## [31] xfun_0.31 R6_2.5.1
## [33] ggbeeswarm_0.6.0 rsvd_1.0.5
## [35] hdf5r_1.3.5 bitops_1.0-7
## [37] spatstat.utils_2.3-1 DelayedArray_0.16.3
## [39] assertthat_0.2.1 promises_1.2.0.1
## [41] scales_1.2.0 rgeos_0.5-9
## [43] beeswarm_0.4.0 gtable_0.3.0
## [45] beachmat_2.6.4 globals_0.15.0
## [47] goftest_1.2-3 rlang_1.0.2
## [49] splines_4.0.5 lazyeval_0.2.2
## [51] spatstat.geom_2.4-0 broom_0.8.0
## [53] yaml_2.3.5 reshape2_1.4.4
## [55] abind_1.4-5 modelr_0.1.8
## [57] backports_1.4.1 httpuv_1.6.5
## [59] tools_4.0.5 ellipsis_0.3.2
## [61] spatstat.core_2.4-4 jquerylib_0.1.4
## [63] ggridges_0.5.3 Rcpp_1.0.8.3
## [65] plyr_1.8.7 sparseMatrixStats_1.7.0
## [67] zlibbioc_1.36.0 RCurl_1.98-1.7
## [69] rpart_4.1-15 deldir_1.0-6
## [71] pbapply_1.5-0 viridis_0.6.2
## [73] cowplot_1.1.1 zoo_1.8-10
## [75] haven_2.5.0 ggrepel_0.9.1
## [77] cluster_2.1.3 fs_1.5.2
## [79] magrittr_2.0.3 RSpectra_0.16-1
## [81] data.table_1.14.2 scattermore_0.8
## [83] lmtest_0.9-40 reprex_2.0.1
## [85] RANN_2.6.1 fitdistrplus_1.1-8
## [87] hms_1.1.1 patchwork_1.1.1
## [89] mime_0.12 evaluate_0.15
## [91] xtable_1.8-4 readxl_1.4.0
## [93] gridExtra_2.3 compiler_4.0.5
## [95] KernSmooth_2.23-20 crayon_1.5.1
## [97] htmltools_0.5.2 mgcv_1.8-40
## [99] later_1.3.0 tzdb_0.3.0
## [101] lubridate_1.8.0 DBI_1.1.2
## [103] dbplyr_2.2.0 MASS_7.3-57
## [105] Matrix_1.4-1 cli_3.3.0
## [107] igraph_1.3.2 pkgconfig_2.0.3
## [109] plotly_4.10.0 scuttle_1.0.4
## [111] spatstat.sparse_2.1-1 xml2_1.3.3
## [113] vipor_0.4.5 bslib_0.3.1
## [115] XVector_0.30.0 rvest_1.0.2
## [117] digest_0.6.29 sctransform_0.3.3
## [119] RcppAnnoy_0.0.19 spatstat.data_2.2-0
## [121] rmarkdown_2.14 cellranger_1.1.0
## [123] leiden_0.4.2 uwot_0.1.11
## [125] DelayedMatrixStats_1.12.3 shiny_1.7.1
## [127] lifecycle_1.0.1 nlme_3.1-158
## [129] jsonlite_1.8.0 BiocNeighbors_1.8.2
## [131] viridisLite_0.4.0 fansi_1.0.3
## [133] pillar_1.7.0 lattice_0.20-45
## [135] fastmap_1.1.0 httr_1.4.3
## [137] survival_3.3-1 glue_1.6.2
## [139] png_0.1-7 bit_4.0.4
## [141] stringi_1.7.6 sass_0.4.1
## [143] BiocSingular_1.6.0 irlba_2.3.5
## [145] future.apply_1.9.0
Bennett, M. L. et al. New tools for studying microglia in the mouse and human CNS. Proc Natl Acad Sci USA 113, E1738 (2016).↩︎
Zhang, S. et al. Single-cell transcriptomics identifies divergent developmental lineage trajectories during human pituitary development. Nat Commun 11, 5275 (2020).↩︎
McCarthy, D. J., Campbell, K. R., Lun, A. T. L. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics btw777 (2017) doi:10.1093/bioinformatics/btw777.↩︎