Skip to contents

This vignette demonstrates how to perform transcriptional analyses using the macpie package, focusing on differential gene expression and pathway enrichment in high-throughput transcriptomic screening datasets. The workflow builds on a tidySeurat object and applies standardised, scalable tools to compare treatment conditions against controls.

Key points:

  • use compute_single_de to perform a differential expression analysis for one treatment group vs control
  • use compute_multi_de to perform differential expression analyses for all treatment groups vs control
  • use volcano plot, box plot and heatmap to show results from the analyses and visualise gene expression levels
  • use enrichr for pathway enrichment analysis

1. Data import

Data access
The full dataset (>10 MB of .rds files) is currently under restricted release and will become publicly available upon publication (at Zenodo).
In the meantime, please contact us for early access.

Data is imported into a tidySeurat object, which allows the usage of both the regular Seurat functions, as well as the functionality of tidyverse.

If the samples are spread across multiple plates, users can submit a vector of directories (or named directories, where names will become barcode prefixes) instead of one directory to the Read10X function.

#install.packages("macpie")  # or devtools::install_github("PMCC-BioinformaticsCore/macpie")


library(macpie)
suppressPackageStartupMessages(
  library(enrichR)
)

# Define project variables
project_name <- "PMMSq033"
project_metadata <- system.file("extdata/PMMSq033_metadata.csv", package = "macpie")
# Load metadata
metadata <- read_metadata(project_metadata)


# 1. Load your own gene counts per sample or 2. data from the publication
project_rawdata <- paste0(dir, "/macpieData/PMMSq033/raw_matrix")

project_name <- "PMMSq033"
raw_counts <- Read10X(data.dir = project_rawdata)

# Create tidySeurat object
mac <- CreateSeuratObject(counts = raw_counts,
                          project = project_name,
                          min.cells = 1,
                          min.features = 1)

# Join with metadata
mac <- mac %>%
  inner_join(metadata, by = c(".cell" = "Barcode"))

# Add unique identifier
mac <- mac %>%
  mutate(combined_id = str_c(Treatment_1, Concentration_1, sep = "_")) %>%
  mutate(combined_id = gsub(" ", "", .data$combined_id))

# Filter by read count per sample group
mac <- filter_genes_by_expression(mac, 
                                  group_by = "combined_id", 
                                  min_counts = 10, 
                                  min_samples = 2)
# Subset the working dataset
mac <- mac %>%
  filter(Project == "Current")

2. Single comparison

Similar to scRNA-seq data, MAC-seq gene expression counts have an excess of zero counts compared to bulk RNA-seq. Statistical models assuming a Poisson or negative binomial distribution may not fit the data distribution well. Additionally, replicates can be quite variable due to a large number of potential latent effect during high-throughput screening, and should be assessed during the QC process.

One way to assess the quality of normalization methods is with the average coefficient of variation across the samples.

# First we will subset the data to look at control, DMSO samples only
mac_dmso <- mac %>%
  filter(Treatment_1 == "DMSO")

# Run the RLE function to compare data normalisations
plot_rle(mac_dmso, label_column = "Row", normalisation = "edgeR")
#> tidyseurat says: Key columns are missing. A data frame is returned for independent data analysis.

plot_rle(mac_dmso, label_column = "Row", normalisation = "limma_voom")
#> tidyseurat says: Key columns are missing. A data frame is returned for independent data analysis.

# For multiple plates, once can add a vector with batch factors, for example
# plot_rle(mac_dmso, label_column = "Row", normalisation = "limma_voom", batch = mac_dmso$Plate_ID)

Normalised data for evaluation of normalizations with other methods (such as plotMA etc) can be extracted with:

normalised_counts <- compute_normalised_counts(mac_dmso, method = "SCT", 1)

Similarly, influence of DE methods on volcano plots can be easily assessed.

# First perform the differential expression analysis
treatment_samples <- "Staurosporine_10"
control_samples <- "DMSO_0"
top_table <- compute_single_de(mac, treatment_samples, control_samples, method = "limma_voom")

top_table_2 <- compute_single_de(mac, treatment_samples, control_samples, method = "edgeR")
# Let's visualise the results with a volcano plot
p1 <- plot_volcano(top_table, max.overlaps = 18) + ggtitle("limma_voom")
p2 <- plot_volcano(top_table_2, max.overlaps = 18) + ggtitle("edgeR")
p1+p2
#> Warning: ggrepel: 853 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps
#> Warning: ggrepel: 384 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps

Based on the results, we can quickly check gene expression levels in counts per million (CPM) for selected genes between treatment and control samples as described below.

genes <- top_table$gene[1:6]
group_by <- "combined_id"
plot_counts(mac,genes, group_by, treatment_samples, control_samples, normalisation = "cpm", color_by = "combined_id")
#> Normalizing layer: counts

Some plotting functions also have a “summarise” version that provides collapsed versions of the results in a table format.

print(summarise_de(top_table, lfc_threshold = 1, padj_threshold = 0.01), width = Inf)
#> # A tibble: 1 × 6
#>   Total_genes_tested Significantly_upregulated Significantly_downregulated
#>                <int>                     <int>                       <int>
#> 1               5660                       128                         234
#>   Total_significant Padj_threshold Log2FC_threshold
#>               <int>          <dbl>            <dbl>
#> 1               362           0.01                1

3.2. Pathway analysis

Differential gene expression results for individual comparisons of treatment vs control in macpie are performed with enrichR, which has access to a number of curated gene sets available through enrichR::listEnrichrDbs(). In the following case, the effect of Staurosporine on breast cancer cells through Myc inactivation can be observed through pathway enrichment analyses.

top_genes <- top_table %>%
  filter(p_value_adj < 0.01) %>%
  select(gene) %>%
  pull()

enriched <- enrichR::enrichr(top_genes, c("MSigDB_Hallmark_2020","DisGeNET",
                                 "RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO"))
#> Uploading data to Enrichr... Done.
#>   Querying MSigDB_Hallmark_2020... Done.
#>   Querying DisGeNET... Done.
#>   Querying RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO... Done.
#> Parsing results... Done.
p1 <- enrichR::plotEnrich(enriched[[1]]) + 
  macpie_theme(legend_position_ = 'right') + 
  scale_fill_gradientn(colors = macpie_colours$divergent)
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.

gridExtra::grid.arrange(p1, ncol = 1)

While using “MSigDB_Hallmark_2020” is a standard, if you check the data from “DisGeNET”, you will see that our MCF7 (breast cancer cell line) samples are correctly enriched for breast cancer profiles.

p1 <- enrichR::plotEnrich(enriched[[2]]) + 
  macpie_theme(legend_position_ = 'right') + 
  scale_fill_gradientn(colors = macpie_colours$divergent)
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.
gridExtra::grid.arrange(p1, ncol = 1)

3.3. Differential gene expression - multiple comparisons

Since MAC-seq is commonly used for high-throughput screening of compound libraries, we often want to compare multiple samples in a screen vs the control. This process can easily be parallelised. First we select a vector of “treatments” as combined_ids that do not contain the word “DMSO”. (Warning, due to the limitations of “mclapply”, parallelisation speedup currently only works on OSX and Linux machines, and not on Windows.)

mac$combined_id <- make.names(mac$combined_id)

treatments <- mac %>%
  filter(Concentration_1 == 10) %>%
  select(combined_id) %>%
  filter(!grepl("DMSO", combined_id)) %>%
  pull() %>%
  unique()
#> tidyseurat says: Key columns are missing. A data frame is returned for independent data analysis.
mac <- compute_multi_de(mac, treatments, control_samples = "DMSO_0", method = "limma_voom", num_cores = 1)

If we want to see how individual genes are expressed across the treatment groups, we can use two approaches. First, we can visualise expression of a specific list of genes on a heatmap.

plot_multi_de(mac, group_by = "combined_id", value = "log2FC", p_value_cutoff = 0.01, direction="up", control = "DMSO_0", by="fc", gene_list = head(top_genes, 10))

Second, we can visualise shared differentially expressed (DE) genes, defined as the top 5 DE genes from each single drug comparison (treatment vs control) that are found in at least 2 different treatment groups. The heatmap below represents log2FC values of DE genes.

plot_multi_de(mac, group_by = "combined_id", value = "log2FC", p_value_cutoff = 0.01, direction="up", n_genes = 5, control = "DMSO_0", by="fc")

If you prefer to see the expression level on replicate level, you can specify logCPM = “lcpm”. Since we are observing log CPM of individual samples, and not the comparisons, we can also visualise the DMSO control.

plot_multi_de(mac, group_by = "combined_id", value = "lcpm", p_value_cutoff = 0.01, direction="up", n_genes = 5, control = "DMSO_0", by="fc")

The outputs from the analyses above can be represented in table format.

summarise_de(mac, lfc_threshold = 1, padj_threshold = 0.01, multi=TRUE)
#> # A tibble: 27 × 7
#>    combined_id  Total_genes_tested Significantly_upregu…¹ Significantly_downre…²
#>    <chr>                     <int>                  <int>                  <int>
#>  1 Paclitaxel_…               5660                    629                     91
#>  2 SN01004236_…               5660                      0                      0
#>  3 SN01004272_…               5660                      1                      0
#>  4 SN01004569_…               5660                    257                     49
#>  5 SN01004614_…               5660                    220                     36
#>  6 SN01004734_…               5660                      0                      0
#>  7 SN01005060_…               5660                     13                      4
#>  8 SN01005979_…               5660                    936                    853
#>  9 SN01731107_…               5660                      4                      3
#> 10 SN01731137_…               5660                    148                     69
#> # ℹ 17 more rows
#> # ℹ abbreviated names: ¹​Significantly_upregulated, ²​Significantly_downregulated
#> # ℹ 3 more variables: Total_significant <int>, padj_threshold <dbl>,
#> #   Log2FC_threshold <dbl>

3.4. Pathway analysis - multiple comparisons

Pathway enrichment analysis can then also be performed across all treatments, and summarised in a heatmap.


# Load genesets from enrichr for a specific species or define your own
enrichr_genesets <- download_geneset("human", "MSigDB_Hallmark_2020")
mac <- compute_multi_enrichr(mac, genesets = enrichr_genesets)

enriched_pathways_mat <- mac@tools$pathway_enrichment %>%
  bind_rows() %>%
  select(combined_id, Term, Combined.Score) %>%
  pivot_wider(names_from = combined_id, values_from = Combined.Score) %>%
  column_to_rownames(var = "Term") %>%
  mutate(across(everything(), ~ ifelse(is.na(.), 0, log1p(.)))) %>%  # Replace NA with 0 across all columns
  as.matrix()


pheatmap(enriched_pathways_mat, color = macpie_colours$continuous_rev)

Quick check of some treatments:

Nutlin.3a is a MDM2-P53 inhibitor and stablises the p53 protein. It induces cell autophagy and apotopsis. Nutlin-activated p53 induces G1 and G2 arrest in cancer cell lines (see in the pathway enrichment heatmap).

Ref: Tovar C, et al. Proc Natl Acad Sci USA. 2006;103(6):1888–1893. Shows Nutlin-3’s effect on various p53 targets in cancer cell lines.