Lin et al.: AAV Microglia

Analyze the Data

Load libraries and data

Load required libraries.

suppressPackageStartupMessages({
  library(SummarizedExperiment)
  library(org.Mm.eg.db)
  library(DESeq2)
  library(ggplot2)
  library(tidyverse)
  library(RColorBrewer)
  library(pheatmap)
})

If you haven’t done so, download data from GEO.

if(!file.exists('../data/GSE197743/GSE197743_AAV_MG_RNAseq.tsv.gz')){
  library(GEOquery)
  options(timeout = 6000)  ## In case of slow Internet connections
  getGEOSuppFiles(GEO = "GSE197743", makeDirectory = T, baseDir = '../data/')
}

Load the data table and associated metadata to workspace.

counts = as.matrix(read.table(file = gzfile("../data/GSE197743/GSE197743_AAV_MG_RNAseq.tsv.gz"), sep = "\t"))
coldata = read.csv(file = "../miscs/RNAseq_sample_table.csv", header = T)

# To avoid warnings at creating dds, recode factor levels to replace "/" by "_" or ".", and convert integer type to factor
coldata <- coldata %>%
  mutate(
    stimuli = recode(stimuli, `LPS_200ng/mL` = "LPS_200ng_mL", `IL4_20ng/mL` = "IL4_20ng_mL"),
    virus = recode(virus, 
                   `AAV-cap18-msc, moi=50,000` = "cap18_mSc_moi_50000")
  ) %>%
  mutate(
    batch = factor(batch),
    stimuli = factor(stimuli),
    virus = factor(virus)
    )
coldata

##    SampleName TIC      stimuli               virus batch lib_type
## 1    yt0616.1 trt        blank               untrt     1       pe
## 2    yt0616.2 trt        blank               untrt     1       pe
## 3    yt0616.3 trt        blank               untrt     1       pe
## 4    yt0616.4 trt LPS_200ng_mL               untrt     1       pe
## 5    yt0616.5 trt LPS_200ng_mL               untrt     1       pe
## 6    yt0616.6 trt LPS_200ng_mL               untrt     1       pe
## 7    yt0616.7 trt  IL4_20ng_mL               untrt     1       pe
## 8    yt0616.8 trt  IL4_20ng_mL               untrt     1       pe
## 9    yt0616.9 trt  IL4_20ng_mL               untrt     1       pe
## 10  yt0616.16 trt        blank cap18_mSc_moi_50000     1       pe
## 11  yt0616.17 trt        blank cap18_mSc_moi_50000     1       pe
## 12  yt0616.18 trt        blank cap18_mSc_moi_50000     1       pe
## 13 YT.1111.01 trt  IL4_20ng_mL cap18_mSc_moi_50000     2       se
## 14 YT.1111.02 trt  IL4_20ng_mL cap18_mSc_moi_50000     2       se
## 15 YT.1111.03 trt  IL4_20ng_mL cap18_mSc_moi_50000     2       se
## 16 YT.1111.04 trt LPS_200ng_mL cap18_mSc_moi_50000     2       se
## 17 YT.1111.05 trt LPS_200ng_mL cap18_mSc_moi_50000     2       se
## 18 YT.1111.06 trt LPS_200ng_mL cap18_mSc_moi_50000     2       se
## 19 YT.1111.07 trt  IL4_20ng_mL               untrt     2       se
## 20 YT.1111.08 trt  IL4_20ng_mL               untrt     2       se
## 21 YT.1111.09 trt  IL4_20ng_mL               untrt     2       se
## 22 YT.1111.12 trt LPS_200ng_mL               untrt     2       se

Note that this matrix contains non-integer values, which are summarized counts from transcript level quantification by tximeta. Since we’re not working with tximeta here , we need to perform an additional step of approximation to round the values to integers.
If you’d like to work with non-approximated version of the data, you would have to download the raw fastq files and perform the analysis from there. Refer to vignettes/rnaseq_raw.Rmd for that version of the pipeline.

gse <- SummarizedExperiment(assays = round(counts), colData = coldata)

We first focus on analyzing the first experiment. Subset the data and create a DESeq2 object.

gse_1 <- gse[,1:12]
dds <- DESeqDataSet(gse_1, design = ~ stimuli + virus)

Manually retrieve rowData from org.Mm.eg.db.

gene_name <- mapIds(x = org.Mm.eg.db, keys = rownames(dds), column = "SYMBOL", keytype = "ENSEMBL", multiVals = "first")
ensembl_id <- rownames(dds)
mcols(dds) <- DataFrame(mcols(dds), gene_name, ensembl_id)

Pre-filter the genes.

nrow(dds)

## [1] 35896

keep <- rowSums(counts(dds) >= 10) >= 3  # at least 3 samples with a count of 10 or higher
dds <- dds[keep,]
nrow(dds)

## [1] 13668

Reset the orders of factor levels.

dds$stimuli <- factor(dds$stimuli, levels = c("blank","IL4_20ng_mL","LPS_200ng_mL"))
dds$virus <- factor(dds$virus, levels = c("untrt","cap18_mSc_moi_50000"))

Exploratory Data Analysis (EDA): Sample Distance & PCA

For exploratory analysis, it is common practice to work on the transformed count data. We choose to use variance stablizing transformation here for its robustness and speed.

vsd <- vst(dds, blind = FALSE)

Sample distance

Sample distance provides a powerful and intuitive way to explore the states of the transcriptomes across samples.
To perform this analysis, we compute a sample distance matrix by the dist() function using the euclidean distance as metric.

sampleDists <- dist(t(assay(vsd)))

Now plot the distance matrix with pheatmap.

sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste0(sub("_.*","",vsd$stimuli), " - ", sub("_.*","",vsd$virus))
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
out <- pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         show_colnames = F,
         col = colors)

# Store results from hierachical clustering
clusters <- cutree(out$tree_col, k = 3)
vsd$clusters <- as.factor(clusters)
dds$clusters <- as.factor(clusters)

From this plot we see that triggers such as LPS or IL4 promoted distinct states in the transcriptional states of microglia. These were conventionally described as “Reactive” (LPS) or “Alternative activation” (IL4) states. On the other hand, while microglia underwent no treatment and microglia transduced with AAV-cap18 demonstrated a clear within-group homogeneity, they were characterized by a similar transcriptional state, as illustrated by the hierachical clustering results.

PCA

Pricipal component analysis provides another way of summarizing the microglia transcription states under different conditions.

pcaData <- plotPCA(vsd, intgroup = c("SampleName","clusters","stimuli","virus"), returnData = TRUE)
pcaData$stimulus_virus <- factor(paste(pcaData$stimuli, pcaData$virus, sep = "."),
                                 levels = c("blank.untrt","blank.cap18_mSc_moi_50000","IL4_20ng_mL.untrt","LPS_200ng_mL.untrt"))
percentVar <- round(100* attr(pcaData, "percentVar"))
ggplot(pcaData, aes(x = PC1, y = PC2)) + 
  geom_point(aes(color = clusters, shape = stimuli, fill = stimulus_virus), size = 2, stroke = 1, position = position_dodge2(width = 0.8)) + 
  scale_shape_manual(values = c(21,22,24)) +
  scale_fill_manual(values=c("white",brewer.pal(n = 12, name = "Paired")[c(5,1,3)])) +
  guides(color = "none",
         fill = guide_legend(override.aes = list(shape = 21))) +  # Hack: github.com/tidyverse/ggplot2/issues/2322
  stat_ellipse(data=pcaData, mapping=aes(color = clusters), linetype = 2) +
  xlim(-27,37) + ylim(-18,16) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) + 
  ylab(paste0("PC2: ", percentVar[2], "% variance")) + 
  coord_fixed() +
  theme_bw() +
  theme(
    legend.position = "none"
  )

Using clusters obtained from sample distance matrix, we use the stat_ellipse function to draw a 95% confidence level for a multivariate t-distribution of the PC coordinates.

Visualize Gene Expression by Heatmap

To have a closer look at the microglia transcription states under different conditions, we use heatmap to plot the relative expression of canonical homeostatic and reactive marker genes.

genesubsets_to_plot <- rowData(vsd)[rowData(vsd)$gene_name %in% c("Cst3","Cxcl10","Il12b","Ccl2","Il1b","Nos2","Tnf","Il6","Il1a","Clec10a","Clec7a","Chil3","Mrc1"),]$ensembl_id
names(genesubsets_to_plot) <- rowData(vsd)[rowData(vsd)$gene_name %in% c("Cst3","Cxcl10","Il12b","Ccl2","Il1b","Nos2","Tnf","Il6","Il1a","Clec10a","Clec7a","Chil3","Mrc1"),]$gene_name  ## This keeps the order right
sample_order <- c("yt0616.1","yt0616.2","yt0616.3",
                  "yt0616.16","yt0616.17","yt0616.18",
                  "yt0616.4","yt0616.5","yt0616.6",
                  "yt0616.7","yt0616.8","yt0616.9"
                  )

pheatmap(
  assay(vsd)[genesubsets_to_plot,sample_order], 
  cluster_rows = TRUE, 
  show_rownames = TRUE, 
  cluster_cols = FALSE, 
  labels_row = names(genesubsets_to_plot),
  labels_col = colnames(vsd),
  scale = "row",
  angle_col = 90
    )

Session info

sessionInfo()

## R version 4.0.5 (2021-03-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=zh_CN.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=zh_CN.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=zh_CN.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] pheatmap_1.0.12             RColorBrewer_1.1-3         
##  [3] forcats_0.5.1               stringr_1.4.0              
##  [5] dplyr_1.0.9                 purrr_0.3.4                
##  [7] readr_2.1.2                 tidyr_1.2.0                
##  [9] tibble_3.1.7                tidyverse_1.3.1            
## [11] ggplot2_3.3.6               DESeq2_1.30.1              
## [13] org.Mm.eg.db_3.12.0         AnnotationDbi_1.52.0       
## [15] SummarizedExperiment_1.20.0 Biobase_2.50.0             
## [17] GenomicRanges_1.42.0        GenomeInfoDb_1.26.7        
## [19] IRanges_2.24.1              S4Vectors_0.28.1           
## [21] BiocGenerics_0.36.1         MatrixGenerics_1.9.0       
## [23] matrixStats_0.62.0         
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.2               bitops_1.0-7           lubridate_1.8.0       
##  [4] bit64_4.0.5            httr_1.4.3             tools_4.0.5           
##  [7] backports_1.4.1        bslib_0.3.1            utf8_1.2.2            
## [10] R6_2.5.1               DBI_1.1.2              colorspace_2.0-3      
## [13] withr_2.5.0            tidyselect_1.1.2       bit_4.0.4             
## [16] compiler_4.0.5         rvest_1.0.2            cli_3.3.0             
## [19] xml2_1.3.3             DelayedArray_0.16.3    labeling_0.4.2        
## [22] sass_0.4.1             scales_1.2.0           genefilter_1.72.1     
## [25] digest_0.6.29          rmarkdown_2.14         XVector_0.30.0        
## [28] pkgconfig_2.0.3        htmltools_0.5.2        highr_0.9             
## [31] dbplyr_2.2.0           fastmap_1.1.0          rlang_1.0.2           
## [34] readxl_1.4.0           rstudioapi_0.13        RSQLite_2.2.14        
## [37] farver_2.1.0           jquerylib_0.1.4        generics_0.1.2        
## [40] jsonlite_1.8.0         BiocParallel_1.24.1    RCurl_1.98-1.7        
## [43] magrittr_2.0.3         GenomeInfoDbData_1.2.4 Matrix_1.4-1          
## [46] Rcpp_1.0.8.3           munsell_0.5.0          fansi_1.0.3           
## [49] lifecycle_1.0.1        stringi_1.7.6          yaml_2.3.5            
## [52] MASS_7.3-57            zlibbioc_1.36.0        grid_4.0.5            
## [55] blob_1.2.3             crayon_1.5.1           lattice_0.20-45       
## [58] haven_2.5.0            splines_4.0.5          annotate_1.68.0       
## [61] hms_1.1.1              locfit_1.5-9.4         knitr_1.39            
## [64] pillar_1.7.0           geneplotter_1.68.0     reprex_2.0.1          
## [67] XML_3.99-0.10          glue_1.6.2             evaluate_0.15         
## [70] modelr_0.1.8           vctrs_0.4.1            tzdb_0.3.0            
## [73] cellranger_1.1.0       gtable_0.3.0           assertthat_0.2.1      
## [76] cachem_1.0.6           xfun_0.31              xtable_1.8-4          
## [79] broom_0.8.0            survival_3.3-1         memoise_2.0.1         
## [82] ellipsis_0.3.2

The data submitted to GEO is the preprocessed table, therefore the gene-level summarization step was omitted from this vignette.↩︎

Lin et al.: AAV Microglia

RNA-seq (Extended Data Figure 1)

Ruiyu Ray Wang

2022-06-25

TL;DR

About this Vignette

Overview of the experiment

Analysis Outline