- Preparation
- Results
- Visualize
- Transform
- Quality Assesment
- Exploratory Data Analysis
- Exporting results
- Next Analysis: Enrichment/Over-representation analysis
Work in progress (https://github.com/MikiasHWT/Bulk-RNA-Seq)
Purpose : Comparison of RNA expression of three innate immune cell populations across three patient groups.
Load libraries
library(prettydoc)
# library(GEOquery) # import datasets from GEO (unused)
library(DESeq2) # Diverential gene expression analysis
library(vsn) # Transformation
library(apeglm) # log fold shrinking (adaptive t prior shrinkage estimator)
library(ashr) # log fold shrinking (adaptive shrinkage estimator)
library(pheatmap) # heatmaps
library(RColorBrewer) # Color themes
library(EnhancedVolcano) # Pleasing volcano plots
library(gridExtra) # GGplot extention
library(DT) # nice data tables
library(SparseArray)
library(ggvenn)
library(qvalue)
library(org.Mm.eg.db) # Mouse database
library(org.Hs.eg.db) # Human database
library(DOSE)
library(pathview)
library(clusterProfiler) # over representation/gene enrichment
library(AnnotationHub)
library(ensembldb)
library(enrichplot)
library(biomaRt)
library(ReactomePA)
library(ggupset)
library(cowplot)
library(europepmc)
library(ggstance)
# library(GO.db) # Gene ontology
# library(GOstats) # Gene ontology
library(tidyverse) # data wrangling & visualization
library(conflicted)
Conflicted functions
# Set function preferences
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::slice)
conflicts_prefer(SparseArray::rowSds)
Define MA Plotting function MA plot visualizes relationships between log ratio & mean values of two variables: • “M” = minus in the log scale. Log ratios on (y) axis. • “A” = average in the log scale. The mean values on (x) axis.
# MA Plots
# Arranged into 2 rows
# Color & translucency by DEG
# Plot titles fed in as arguments
# Mean counts (x-axis) Log10 transformed
plotMA_arrange_iterative <- function(res_list) {
plots <- list()
for (i in seq_along(res_list)) {
res <- res_list[[i]]
res_name <- names(res_list)[i] # Get the name of the res object
p <- plotMA(res, returnData = TRUE) # DESeq2's PlotMA function
p_plot <- ggplot(p, aes(x = mean, y = lfc, color = isDE)) +
geom_point(aes(alpha = ifelse(isDE == "TRUE", 1, 0.1)), show.legend = FALSE) + # Reduce non DEG genes transperancy
scale_x_continuous(trans = "log10") + # Scale mean counts
ylim(-2, 2) +
labs(title = res_name) # Use the name in the plot title
plots[[i]] <- p_plot
}
do.call(grid.arrange, c(plots, nrow = 2))
}
Define Heatmap plotting function
Preparation
Note: Using GEOquery will require some troubleshooting as the GSE submission for this dataset lacked feature information. Instead the CSV files for gene counts and metadata can be download directly from GEO and saved locally.
Load data
# Load CSV's for gene counts & metadata. Retain untidy names for ease
counts <- read.csv("Data/GSE183973_bulkRNA_gene_counts.csv",
check.names = FALSE
)
metadata <- read.csv("Data/GSE183973_metadata_samples.csv",
check.names = FALSE
)
head(counts)
Initially, Two sets of variables are considered for this analysis.
Cell types: mono, ma, mreg
Patient Groups: smoker, non_smoker, copd
There are 9 patients evenly spread across the 3 groups. Each patient contributing Blood Monocytes (mono), Alveolar Macrophages (ma), and Regulatory Macrophages (mreg), the latter two being extracted from bronchoalveolar lavage fluid. Cells were isolated by Flow Cytometry and sequenced in Bulk.
Clean up
Match samples
# Remove genes column (gene names retained as index)
# Assign to new df to retain genes column for later data exploration
geneCounts <- counts |>
select(-genes)
# Match counts column names and metadata row names
desired_order <- metadata$samples
geneCounts <- geneCounts[, desired_order]
# Confirm match
all(colnames(geneCounts) %in% metadata$samples)
## [1] TRUE
## [1] TRUE
Construct DESeqDataSet object
# Define design matrix for DESeq2
dds <- DESeqDataSetFromMatrix(
countData = geneCounts,
colData = metadata,
design = ~ cell_type + patient_group
)
# normalization by estimating size factor
dds <- estimateSizeFactors(dds)
# remove low expressed genes
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
# set factor level (reference group) : Im interested in comparing copd and smokers to the reference group non-smokers in this case.
dds$patient_group <- relevel(dds$patient_group, ref = "non_smoker")
# Run DEG analysis
dds <- DESeq(dds)
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## class: DESeqDataSet
## dim: 21090 27
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(21090): WASH7P RP11-34P13.7 ... RP11-65G9.1 PARP4P1
## rowData names(34): baseMean baseVar ... deviance maxCooks
## colnames(27): 10.smoker.ag.mono_NGS19.K315 11.smoker.ag.ma_NGS19.K316
## ... 8.smoker.lc.ma_NGS19.K313 9.smoker.lc.mreg_NGS19.K314
## colData names(5): samples patient_group patient_initials cell_type
## sizeFactor
## [1] "Intercept" "cell_type_mono_vs_ma"
## [3] "cell_type_mreg_vs_ma" "patient_group_copd_vs_non_smoker"
## [5] "patient_group_smoker_vs_non_smoker"
Results
Note U-shaped p-value histograms-can indicate that a one-sided test was performed on data where there is signal in both directions, or it can indicate that there is dependence among the variables in the data.
Might be worth checking cell_type:patient_group
interaction term. As Pvalue histogram has a little Ushape to it.
View results of default DEG analysis
# defaul results with padj = 0.1
res <- results(dds)
# Specifying a strict FDR (p-adj) cutoff for posterity. (default is 0.1)
res05 <- results(dds, alpha = 0.05)
# Default contrast will be "smoker" vs reference group "non_smoker"
head(as.data.frame(res05))
##
## out of 21090 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 86, 0.41%
## LFC < 0 (down) : 102, 0.48%
## outliers [1] : 60, 0.28%
## low counts [2] : 4078, 19%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## [1] 311
## [1] 188
FDR estimations
##
## Call:
## qvalue(p = res$pvalue)
##
## pi0: 1
##
## Cumulative number of significant calls:
##
## <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1 <1
## p-value 97 240 698 1103 1673 2724 21030
## q-value 9 18 74 113 169 258 21030
## local FDR 8 10 38 67 89 143 2155
Test interaction term
# dds2 <- DESeqDataSetFromMatrix(
# countData = geneCounts,
# colData = metadata,
# design = ~ cell_type + patient_group + cell_type:patient_group
# )
#
# dds2 <- estimateSizeFactors(dds2)
#
# keep <- rowSums(counts(dds2)) >= 10
#
# dds2 <- dds2[keep, ]
#
# dds2$patient_group <- relevel(dds2$patient_group, ref = "non_smoker")
#
# dds2 <- DESeq(dds2)
#
# resultsNames(dds2)
#
# res105 <- results(dds2, alpha = 0.1)
#
# summary(res105)
#
# sum(res105$padj < 0.1, na.rm = TRUE)
#
# hist(res105$pvalue)
#
# qobj2 <- qvalue(res105$pvalue)
#
# summary(qobj2)
#
# hist(qobj2)
#
# plot(qobj2)
Viewing DEG’s alternate contracts
# Next im curious about seeing the remaining group comparisons
# COPD vs Non-Smoker
resCOPD <- results(dds, alpha = 0.05, contrast = c("patient_group", "copd", "non_smoker"))
head(as.data.frame(resCOPD))
##
## out of 21090 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 779, 3.7%
## LFC < 0 (down) : 704, 3.3%
## outliers [1] : 60, 0.28%
## low counts [2] : 5699, 27%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## [1] 1483
##
## Call:
## qvalue(p = resCOPD$pvalue)
##
## pi0: 0.7801735
##
## Cumulative number of significant calls:
##
## <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1 <1
## p-value 327 791 2011 2934 3970 5580 21021
## q-value 60 154 463 926 1473 2317 21030
## local FDR 39 96 273 476 799 1257 11013
# COPD vs Smoker
resWild <- results(dds, alpha = 0.05, contrast = c("patient_group", "copd", "smoker"))
head(as.data.frame(resWild))
##
## out of 21090 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 434, 2.1%
## LFC < 0 (down) : 357, 1.7%
## outliers [1] : 60, 0.28%
## low counts [2] : 8145, 39%
## (mean count < 13)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## [1] 791
##
## Call:
## qvalue(p = resWild$pvalue)
##
## pi0: 0.7705233
##
## Cumulative number of significant calls:
##
## <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1 <1
## p-value 167 472 1578 2493 3536 5112 21023
## q-value 21 36 174 327 708 1539 21030
## local FDR 13 25 88 187 356 785 19802
Log fold Skrink DEG’s
Aids in visualization and gene ranking
# Adaptive t prior shrinkage estimator
resLFC <- lfcShrink(dds,
coef = "patient_group_smoker_vs_non_smoker",
type = "apeglm"
)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
# Original DESeq2 shrinkage estimator, an adaptive Normal distribution as prior.
resNorm <- lfcShrink(dds,
coef = "patient_group_smoker_vs_non_smoker",
type = "normal"
)
## using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).
##
## Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
## See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
## Reference: https://doi.org/10.1093/bioinformatics/bty895
# Adaptive shrinkage estimator. Fits a mixture of Normal distributions to form the prior
resAsh <- lfcShrink(dds,
coef = "patient_group_smoker_vs_non_smoker",
type = "ashr"
)
## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
Visualize
Visualize DEG results
# Visualize log2 fold changes of a given variable over the mean of normalized counts for all samples
res_list <- list(
"padj=0.1" = res,
"padj=0.05" = res05,
"Log fold shrunk padj=0.1" = resLFC
)
plotMA_arrange_iterative(res_list)
## Warning: Removed 357 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 357 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 38 rows containing missing values or values outside the scale range
## (`geom_point()`).
Visualizing alternate contrasts
res_list <- list(
"Smoker vs Non_Smoker DEG's" = res,
"COPD v Non Smoker DEG's" = resCOPD,
"COPD v Smoker DEG's" = resWild
)
plotMA_arrange_iterative(res_list)
## Warning: Removed 357 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 455 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 364 rows containing missing values or values outside the scale range
## (`geom_point()`).
Visualize Log Fold Shrunken MA Plots
# Plot log fold shrunken DEG's
res_list <- list(
"apeglm" = resLFC,
"normal" = resNorm,
"ashr" = resAsh
)
plotMA_arrange_iterative(res_list)
## Warning: Removed 38 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_point()`).
visualize subset results
# Plot log fold shrunken DEG's
res_list <- list(
"Subset 0.05" = subRes05,
"Subset 0.01" = subRes01,
"Subset 0.005" = subRes005
)
plotMA_arrange_iterative(res_list)
## Warning: Removed 58 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 15 rows containing missing values or values outside the scale range
## (`geom_point()`).
Note: If there is unwanted variation present in the data (e.g. batch effects) it is always recommend to correct for this, which can be accommodated in DESeq2 by including in the design any known batch variables or by using functions/packages such as svaseq in sva (Leek 2014) or the RUV functions in RUVSeq (Risso et al. 2014) to estimate variables that capture the unwanted variation. In addition, the ashr developers have a specific method for accounting for unwanted variation in combination with ashr (Gerard and Stephens 2017).
Transform
Raw gene variance
Raw RNA sequencing counts have a high degree of variability; which increases with a genes average expression level. This variability needs to be accounted before some downstream analysis such as visualization and clustering. We do this by transforming the data such that the variance is stabilized across the entire spectrum of expression values, allowing for use of statistical models/methods where that is a requirement.
# Filter out low expressed genes
keep <- rowSums(geneCounts) > 10
filtCounts <- as.matrix(geneCounts[keep, ])
# Visualize gene expression values by sample
boxplot(filtCounts, main = "Raw counts", ylab = "Gene Expression", las = 2)
Various Transformations
Log2 fold transformation were the standard in the past, but they can result in over representation of low expressed genes. As such various other methods are implemented. The goal is that the standard deviation stays consistent across all mean values of gene expression. vst and rlog have pros and cons determined by the number of samples.
# Get log2 counts
logcounts <- log2(filtCounts + 1)
# Plot mean vs SD
Log2 <- ggplot(logcounts, aes(
x = rowMeans(logcounts),
y = rowSds(logcounts)
)) +
geom_point() +
geom_smooth() +
labs(title = "Log2", x = "Mean", y = "SD")
# Variance stabilizing transformation
vst_counts <- vst(filtCounts)
# Plot
VST <- ggplot(vst_counts, aes(
x = rowMeans(vst_counts),
y = rowSds(vst_counts)
)) +
geom_point() +
geom_smooth() +
labs(title = "VST", x = "Mean", y = "SD")
# Rlog counts
rlog_counts <- rlog(filtCounts)
# Plot
Rlog <- ggplot(rlog_counts, aes(
x = rowMeans(rlog_counts),
y = rowSds(rlog_counts)
)) +
geom_point() +
geom_smooth() +
labs(title = "Rlog", x = "Mean", y = "SD")
# Arrange Plots
grid.arrange(Raw, Log2, VST, Rlog, nrow = 2)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Quality Assesment
Quality assesment with Heatmaps
select <- order(rowMeans(counts(dds, normalized = TRUE)),
decreasing = TRUE
)[1:20]
df <- as.data.frame(colData(dds)[, c("patient_group", "cell_type")])
pheatmap(assay(ntd)[select, ],
cluster_rows = TRUE, show_rownames = TRUE,
cluster_cols = TRUE, annotation_col = df, show_colnames = FALSE,
main = "ntd", scale = "row"
)
Sample distance
# Determine between group variety
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$cell_type, vsd$patient_group, sep = "-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors
)
PCA plots
pcaData <- plotPCA(vsd,
intgroup = c("cell_type", "patient_group"),
returnData = TRUE
)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color = patient_group, shape = cell_type)) +
geom_point(size = 3) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
coord_fixed()
Plot Counts
# Plot the gene with the smallest p adj value across specified groupings
c1 <- plotCounts(dds, gene = which.min(res$padj), intgroup = c("cell_type"), returnData = TRUE)
c1_plot <- ggplot(c1, aes(x = cell_type, y = count)) +
geom_point(position = position_jitter(w = 0.1, h = 0)) +
scale_y_log10(breaks = c(25, 100, 400))
c2 <- plotCounts(dds, gene = which.min(res$padj), intgroup = c("patient_group"), returnData = TRUE)
c2_plot <- ggplot(c2, aes(x = patient_group, y = count)) +
geom_point(position = position_jitter(w = 0.1, h = 0)) +
scale_y_log10(breaks = c(25, 100, 400))
c3 <- plotCounts(dds, gene = which.min(res$padj), intgroup = c("cell_type", "patient_group"), returnData = TRUE)
c3_plot <- ggplot(c3, aes(x = patient_group, y = count, color = cell_type)) +
geom_point(position = position_jitter(w = 0.1, h = 0)) +
scale_y_log10(breaks = c(25, 100, 400))
gridExtra::grid.arrange(c1_plot, c2_plot, c3_plot, nrow = 2)
Exploratory Data Analysis
Merge counts & Metadata for Data exploration
# Pivot DF for exploratory data analysis
countsLong <- counts |>
pivot_longer(cols = !genes, names_to = "samples", values_to = "counts")
head(countsLong)
# Annotate counts with metadata
counts <- countsLong |>
left_join(metadata, by = c("samples" = "samples"))
head(counts)
Random EDA
# barplot
counts |>
filter(genes == "CD101") |>
ggplot(aes(x = samples, y = counts, fill = cell_type)) +
geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# density plot
counts |>
filter(genes == "FTL") |>
ggplot(aes(x = counts, fill = cell_type)) +
geom_density(alpha = 0.5)
# boxplot
counts |>
filter(genes == "LYZ") |>
ggplot(aes(x = cell_type, y = counts)) +
geom_boxplot()
# voilinplot
counts |>
filter(genes == "CD74") |>
ggplot(aes(x = cell_type, y = counts)) +
geom_violin()
# scatterplot
counts |>
filter(genes == "FTL" | genes == "LYZ") |>
pivot_wider(names_from = genes, values_from = counts) |>
ggplot(aes(x = FTL, y = LYZ, color = cell_type)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Volcano Plot
EnhancedVolcano(res,
lab = rownames(res),
x = "log2FoldChange",
y = "pvalue",
title = "Smoker vs Non-smoker",
pCutoff = 1e-05,
FCcutoff = 0.5,
pointSize = 2,
labSize = 6.0,
colAlpha = 0.3
)
resLFC_DF <- as.data.frame(resLFC) |>
rownames_to_column("GeneID")
resLFC_DF <- resLFC_DF |>
mutate(`-log10(pvalue)` = -log10(pvalue))
ggplot(resLFC_DF, aes(x = log2FoldChange, y = `-log10(pvalue)`)) +
geom_point(aes(colour = padj < 0.05), size = 1) +
geom_text(data = ~ top_n(.x, 5, wt = -padj), aes(label = GeneID)) +
labs(title = "Smoker vs NonSmoker")
## Warning: Removed 60 rows containing missing values or values outside the scale range
## (`geom_point()`).
Subset results
##
## out of 189 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 86, 46%
## LFC < 0 (down) : 103, 54%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Exporting results
## log2 fold change (MLE): patient group smoker vs non smoker
## Wald test p-value: patient group smoker vs non smoker
## DataFrame with 21090 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## HLA-DPB2 31.5128 -2.169615 0.2259472 -9.60231 7.81732e-22
## PLEKHG3 486.4914 -0.668327 0.0919471 -7.26860 3.63227e-13
## YPEL4 101.9154 1.368702 0.1981824 6.90628 4.97540e-12
## PLTP 546.4092 -2.160822 0.3264863 -6.61841 3.63075e-11
## SARDH 15.2484 4.876028 0.7729927 6.30799 2.82688e-10
## ... ... ... ... ... ...
## NRARP 3.15119 -2.9343118 0.927136 -3.1649196 NA
## P2RY8 641.71420 -0.8002904 0.569591 -1.4050256 NA
## FAM9B 2.24115 -4.5838024 1.719012 -2.6665326 NA
## XIST 2725.38469 0.0420529 2.928501 0.0143599 NA
## CTD-2328D6.1 3.14600 9.0527378 3.163472 2.8616462 NA
## padj
## <numeric>
## HLA-DPB2 1.29361e-17
## PLEKHG3 3.00534e-09
## YPEL4 2.74443e-08
## PLTP 1.50204e-07
## SARDH 9.03196e-07
## ... ...
## NRARP NA
## P2RY8 NA
## FAM9B NA
## XIST NA
## CTD-2328D6.1 NA
##
## out of 21090 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 140, 0.66%
## LFC < 0 (down) : 171, 0.81%
## outliers [1] : 60, 0.28%
## low counts [2] : 4482, 21%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## log2 fold change (MLE): patient group smoker vs non smoker
## Wald test p-value: patient group smoker vs non smoker
## DataFrame with 189 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## HLA-DPB2 31.5128 -2.169615 0.2259472 -9.60231 7.81732e-22 1.29361e-17
## PLEKHG3 486.4914 -0.668327 0.0919471 -7.26860 3.63227e-13 3.00534e-09
## YPEL4 101.9154 1.368702 0.1981824 6.90628 4.97540e-12 2.74443e-08
## PLTP 546.4092 -2.160822 0.3264863 -6.61841 3.63075e-11 1.50204e-07
## SARDH 15.2484 4.876028 0.7729927 6.30799 2.82688e-10 9.03196e-07
## ... ... ... ... ... ... ...
## PI3 7.24133 -3.157026 0.912284 -3.46058 0.000539023 0.0482149
## PFKFB3 1163.56592 -0.637206 0.184499 -3.45372 0.000552915 0.0489829
## OR2A1-AS1 35.49370 1.192636 0.345349 3.45342 0.000553529 0.0489829
## LPAR5 54.24131 -1.032522 0.299226 -3.45065 0.000559241 0.0492251
## PTPN3 14.19870 -1.807723 0.524275 -3.44805 0.000564660 0.0494391
#Venn Diagram
vennTest <- tibble(Geneid = rownames(res)) |>
mutate(NonSmokerVSSmoker = res$padj < 0.05 &
!is.na(res$padj) &
res$log2FoldChange > 0) |>
mutate(Smoker = res$padj < 0.05 &
!is.na(res$padj) &
res$log2FoldChange < 0) |>
mutate(NonSmokerVSCOPD = resCOPD$padj < 0.05 &
!is.na(resCOPD$padj) &
resCOPD$log2FoldChange > 0) |>
mutate(COPD = resCOPD$padj < 0.05 &
!is.na(resCOPD$padj) &
resCOPD$log2FoldChange < 0)
ggvenn(vennTest, set_name_size = 4)
Next Analysis: Enrichment/Over-representation analysis
https://yulab-smu.top/biomedical-knowledge-mining-book/index.html
file:///C:/Users/Owner/Downloads/RNASEQ20_Day3_HandsOn.pdf ## ORA & Enrichment
# Create background dataset for hypergeometric testing using all genes tested for significance in the results
all_genes <- as.character(rownames(res))
# Extract significant results
signif_res <- res[res$padj < 0.05 & !is.na(res$padj), ]
signif_genes <- as.character(rownames(signif_res))
# Run GO enrichment analysis
egoBP <- enrichGO(
gene = signif_genes,
universe = all_genes,
keyType = "SYMBOL",
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
qvalueCutoff = 0.05,
readable = TRUE
)
head(egoBP)
egoCC <- enrichGO(
gene = signif_genes,
universe = all_genes,
keyType = "SYMBOL",
OrgDb = org.Hs.eg.db,
ont = "CC",
pAdjustMethod = "BH",
qvalueCutoff = 0.05,
readable = TRUE
)
head(egoCC)
egoMF <- enrichGO(
gene = signif_genes,
universe = all_genes,
keyType = "SYMBOL",
OrgDb = org.Hs.eg.db,
ont = "MF",
pAdjustMethod = "BH",
qvalueCutoff = 0.05,
readable = TRUE
)
head(egoMF)
# Output results from GO analysis to a table
clusterBP_summary <- data.frame(egoBP)
clusterCC_summary <- data.frame(egoCC)
clusterMF_summary <- data.frame(egoMF)
# Visualize
dotplot(egoBP, showCategory = 12)
## Warning: ggrepel: 43 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# dotplot(egoCC, showCategory = 12) # no results for Cellular Component
# goplot(egoCC)
dotplot(egoMF, showCategory = 12)
# # Plot enrichment map
# emapplot(ego, showCategory = 50)
#
# # Calculate term similarity matrix
# sim_matrix <- pairwise_termsim(ego)
#
# # Plot enrichment map
# emapplot(ego, showCategory = 50, sim_matrix = sim_matrix)
# To color genes by log2 fold changes
signif_res_lFC <- signif_res$log2FoldChange
cnetplot(egoBP,
categorySize = "pvalue",
showCategory = 5,
foldChange = signif_res_lFC,
vertex.label.font = 6
)
## Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(foldChange = your_value)' instead of 'foldChange'.
## The foldChange parameter will be removed in the next version.
## Scale for size is already present.
## Adding another scale for size, which will replace the existing scale.
cnetplot(egoMF,
categorySize = "pvalue",
showCategory = 5,
foldChange = signif_res_lFC,
vertex.label.font = 6
)
## Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(foldChange = your_value)' instead of 'foldChange'.
## The foldChange parameter will be removed in the next version.
## Scale for size is already present.
## Adding another scale for size, which will replace the existing scale.
GO over-representation analysis
data(geneList, package = "DOSE")
gene <- names(geneList)[abs(geneList) > 2]
# Entrez gene ID
head(gene)
## [1] "4312" "8318" "10874" "55143" "55388" "991"
# GroupGO
ggo <- groupGO(
gene = signif_genes,
OrgDb = org.Hs.eg.db,
keyType = "SYMBOL",
ont = "CC",
level = 3,
readable = TRUE
)
head(ggo)
# # EnrichGO
# ego <- enrichGO(gene = gene,
# universe = names(geneList),
# OrgDb = org.Hs.eg.db,
# ont = "CC",
# pAdjustMethod = "BH",
# pvalueCutoff = 0.01,
# qvalueCutoff = 0.05,
# readable = TRUE)
#
# head(ego)
#
# goplot(ego)
#
# gene.df <- bitr(gene, fromType = "ENTREZID",
# toType = c("ENSEMBL", "SYMBOL"),
# OrgDb = org.Hs.eg.db)
#
# ego2 <- enrichGO(gene = gene.df$ENSEMBL,
# OrgDb = org.Hs.eg.db,
# keyType = 'ENSEMBL',
# ont = "CC",
# pAdjustMethod = "BH",
# pvalueCutoff = 0.01,
# qvalueCutoff = 0.05)
#
# head(ego2, 3)
#
# goplot(ego2)
# Gene IDs can be mapped to gene Symbols by using the parameter readable=TRUE or setReadable() function.
GO Gene Set Enrichment Analysis
ego3 <- gseGO(
geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "CC",
minGSSize = 100,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE
)
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some pathways, in reality P-values are less than 1e-10. You can
## set the `eps` argument to zero for better estimation.
KEGG enrichment analysis
## [1] 65 3
# KEGG pathway over-representation analysis
data(geneList, package = "DOSE")
gene <- names(geneList)[abs(geneList) > 2]
kk <- enrichKEGG(
gene = gene,
organism = "hsa",
pvalueCutoff = 0.05
)
## Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
## Warning in utils::download.file(url, quiet = TRUE, method = method, ...): the
## 'wininet' method is deprecated for http:// and https:// URLs
## Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...
## Warning in utils::download.file(url, quiet = TRUE, method = method, ...): the
## 'wininet' method is deprecated for http:// and https:// URLs
# KEGG pathway gene set enrichment analysis
kk2 <- gseKEGG(
geneList = geneList,
organism = "hsa",
minGSSize = 120,
pvalueCutoff = 0.05,
verbose = FALSE
)
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some pathways, in reality P-values are less than 1e-10. You can
## set the `eps` argument to zero for better estimation.
# KEGG module over-representation analysis
mkk <- enrichMKEGG(
gene = gene,
organism = "hsa",
pvalueCutoff = 1,
qvalueCutoff = 1
)
## Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/module"...
## Warning in utils::download.file(url, quiet = TRUE, method = method, ...): the
## 'wininet' method is deprecated for http:// and https:// URLs
## Reading KEGG annotation online: "https://rest.kegg.jp/list/module"...
## Warning in utils::download.file(url, quiet = TRUE, method = method, ...): the
## 'wininet' method is deprecated for http:// and https:// URLs
# KEGG module gene set enrichment analysis
mkk2 <- gseMKEGG(
geneList = geneList,
organism = "hsa",
pvalueCutoff = 1
)
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
Visualize & Exporting enriched KEGG pathways
Reactome enrichment analysis
The input gene ID should be Entrez gene ID. Use clusterProfiler::bitr() to convert biological IDs.
# Reactome pathway over-representation analysis
data(geneList, package = "DOSE")
de <- names(geneList)[abs(geneList) > 1.5]
head(de)
## [1] "4312" "8318" "10874" "55143" "55388" "991"
# Reactome pathway gene set enrichment analysis
y <- gsePathway(geneList,
pvalueCutoff = 0.2,
pAdjustMethod = "BH",
verbose = FALSE
)
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some pathways, in reality P-values are less than 1e-10. You can
## set the `eps` argument to zero for better estimation.
# Pathway Visualization
viewPathway("E2F mediated regulation of DNA replication",
readable = TRUE,
foldChange = geneList
)
## Loading required package: graphite
Disease enrichment analysis
# Over-representation analysis for disease ontology
data(geneList)
gene <- names(geneList)[abs(geneList) > 1.5]
head(gene)
## [1] "4312" "8318" "10874" "55143" "55388" "991"
x <- enrichDO(
gene = gene,
ont = "DO",
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
universe = names(geneList),
minGSSize = 5,
maxGSSize = 500,
qvalueCutoff = 0.05,
readable = FALSE
)
head(x)
Biological theme comparison
## List of 8
## $ X1: chr [1:216] "4597" "7111" "5266" "2175" ...
## $ X2: chr [1:805] "23450" "5160" "7126" "26118" ...
## $ X3: chr [1:392] "894" "7057" "22906" "3339" ...
## $ X4: chr [1:838] "5573" "7453" "5245" "23450" ...
## $ X5: chr [1:929] "5982" "7318" "6352" "2101" ...
## $ X6: chr [1:585] "5337" "9295" "4035" "811" ...
## $ X7: chr [1:582] "2621" "2665" "5690" "3608" ...
## $ X8: chr [1:237] "2665" "4735" "1327" "3192" ...
## Warning in utils::download.file(url, quiet = TRUE, method = method, ...): the
## 'wininet' method is deprecated for http:// and https:// URLs
## Warning in utils::download.file(url, quiet = TRUE, method = method, ...): the
## 'wininet' method is deprecated for http:// and https:// URLs
# Formula interface of compareCluster
# compareCluster() function also supports passing a formula to describe more complicated experimental designs
mydf <- data.frame(Entrez = names(geneList), FC = geneList)
mydf <- mydf[abs(mydf$FC) > 1, ]
mydf$group <- "upregulated"
mydf$group[mydf$FC < 0] <- "downregulated"
mydf$othergroup <- "A"
mydf$othergroup[abs(mydf$FC) > 2] <- "B"
formula_res <- compareCluster(Entrez ~ group + othergroup,
data = mydf,
fun = "enrichKEGG"
)
head(formula_res)
## Warning: ggrepel: 154 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Visualization of functional enrichment result
# Bar Plot
data(geneList)
de <- names(geneList)[abs(geneList) > 2]
edo <- enrichDGN(de)
barplot(edo, showCategory = 20)
## preparing geneSet collections...
## GSEA analysis...
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some pathways, in reality P-values are less than 1e-10. You can
## set the `eps` argument to zero for better estimation.
## leading edge analysis...
## done...
# Gene-Concept Network
## convert gene ID to Symbol
edox <- setReadable(edo, "org.Hs.eg.db", "ENTREZID")
p1 <- cnetplot(edox, foldChange = geneList)
## Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(foldChange = your_value)' instead of 'foldChange'.
## The foldChange parameter will be removed in the next version.
## Scale for size is already present.
## Adding another scale for size, which will replace the existing scale.
## categorySize can be scaled by 'pvalue' or 'geneNum'
p2 <- cnetplot(edox, categorySize = "pvalue", foldChange = geneList)
## Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(foldChange = your_value)' instead of 'foldChange'.
## The foldChange parameter will be removed in the next version.
## Scale for size is already present.
## Adding another scale for size, which will replace the existing scale.
## Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(foldChange = your_value)' instead of 'foldChange'.
## The foldChange parameter will be removed in the next version.
## Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(edge = your_value)' instead of 'colorEdge'.
## The colorEdge parameter will be removed in the next version.
## Scale for size is already present.
## Adding another scale for size, which will replace the existing scale.
## Warning: ggrepel: 14 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 14 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 28 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Labelling nodes by selected subset.
p1 <- cnetplot(edox,
node_label = "category",
cex_label_category = 1.2
)
## Warning in cnetplot.enrichResult(x, ...): Use 'cex.params = list(category_label = your_value)' instead of 'cex_label_category'.
## The cex_label_category parameter will be removed in the next version.
## Warning in cnetplot.enrichResult(x, ...): Use 'cex.params = list(gene_label = your_value)' instead of 'cex_label_gene'.
## The cex_label_gene parameter will be removed in the next version.
p3 <- cnetplot(edox, node_label = "all")
p4 <- cnetplot(edox,
node_label = "none",
color_category = "firebrick",
color_gene = "steelblue"
)
## Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(category = your_value)' instead of 'color_category'.
## The color_category parameter will be removed in the next version.
## Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(gene = your_value)' instead of 'color_gene'.
## The color_gene parameter will be removed in the next version.
## Warning: Removed 57 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 25 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Using cnetplot to visualize data relationships
set.seed(123)
x <- list(A = letters[1:10], B = letters[5:12], C = letters[sample(1:26, 15)])
p1 <- cnetplot(x)
set.seed(123)
d <- setNames(rnorm(26), letters)
p2 <- cnetplot(x, foldChange = d) +
scale_color_gradient2(name = "associated data", low = "darkgreen", high = "firebrick")
## Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(foldChange = your_value)' instead of 'foldChange'.
## The foldChange parameter will be removed in the next version.
## Scale for size is already present.
## Adding another scale for size, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
# Heatmap-like functional classification
p1 <- heatplot(edox, showCategory = 5)
p2 <- heatplot(edox, foldChange = geneList, showCategory = 5)
cowplot::plot_grid(p1, p2, ncol = 1, labels = LETTERS[1:2])
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## Warning in treeplot.enrichResult(x, ...): Use 'cluster.params = list(method = your_value)' instead of 'hclust_method'.
## The hclust_method parameter will be removed in the next version.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
## ! # Invaild edge matrix for <phylo>. A <tbl_df> is returned.
# Enrichment Map
edo <- pairwise_termsim(edo)
p1 <- emapplot(edo)
p2 <- emapplot(edo, cex_category = 1.5)
## Warning in emapplot.enrichResult(x, showCategory = showCategory, ...): Use 'cex.params = list(category_node = your_value)' instead of 'cex_category'.
## The cex_category parameter will be removed in the next version.
## Warning in emapplot.enrichResult(x, showCategory = showCategory, ...): Use 'layout.params = list(layout = your_value)' instead of 'layout'.
## The layout parameter will be removed in the next version.
## Warning in emapplot.enrichResult(x, showCategory = showCategory, ...): Use 'layout.params = list(layout = your_value)' instead of 'layout'.
## The layout parameter will be removed in the next version.
## Warning in emapplot.enrichResult(x, showCategory = showCategory, ...): Use 'cex.params = list(category_node = your_value)' instead of 'cex_category'.
## The cex_category parameter will be removed in the next version.
## Warning: ggrepel: 20 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 15 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 28 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## ggrepel: 28 unlabeled data points (too many overlaps). Consider increasing max.overlaps
# Biological theme comparison
data(gcSample)
xx <- compareCluster(gcSample,
fun = "enrichKEGG",
organism = "hsa", pvalueCutoff = 0.05
)
xx <- pairwise_termsim(xx)
p1 <- emapplot(xx)
p2 <- emapplot(xx, legend_n = 2)
## Warning in emapplot.compareClusterResult(x, showCategory = showCategory, : Use 'pie.params = list(legend_n = your_value)' instead of 'legend_n'.
## The legend_n parameter will be removed in the next version.
## Warning in emapplot.compareClusterResult(x, showCategory = showCategory, : Use 'pie.params = list(pie = your_value)' instead of 'pie'.
## The pie parameter will be removed in the next version.
## Warning in emapplot.compareClusterResult(x, showCategory = showCategory, : Use 'pie.params = list(pie = your_value)' instead of 'pie'.
## The pie parameter will be removed in the next version.
## Warning in emapplot.compareClusterResult(x, showCategory = showCategory, : Use 'layout.params = list(layout = your_value)' instead of 'layout'.
## The layout parameter will be removed in the next version.
## Warning in emapplot.compareClusterResult(x, showCategory = showCategory, : Use 'cex.params = list(category_node = your_value)' instead of 'cex_category'.
## The cex_category parameter will be removed in the next version.
## Warning: ggrepel: 34 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 36 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## ggrepel: 36 unlabeled data points (too many overlaps). Consider increasing max.overlaps
## Warning: ggrepel: 38 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# UpSet Plot
# For over-representation analysis, upsetplot will calculate the overlaps among different gene sets
upsetplot(edo)
# For GSEA result, it will plot the fold change distributions of different categories (e.g. unique to pathway, overlaps among different pathways).
upsetplot(kk2)
## Picking joint bandwidth of 0.224
# running score and preranked list of GSEA result
# gseaplot for GSEA result(by = "runningScore")
p1 <- gseaplot(edo2,
geneSetID = 1,
by = "runningScore",
title = edo2$Description[1]
)
p2 <- gseaplot(edo2,
geneSetID = 1,
by = "preranked",
title = edo2$Description[1]
)
p3 <- gseaplot(edo2,
geneSetID = 1,
title = edo2$Description[1]
)
cowplot::plot_grid(p1, p2, p3, ncol = 1, labels = LETTERS[1:3])
## Warning in as_grob.default(plot): Cannot convert object of class gglistlist
## into a grob.
# Gseaplot2 for GSEA result of multile gene sets(add pvalue_table)
gseaplot2(edo2,
geneSetID = 1:3, pvalue_table = TRUE,
color = c("#E495A5", "#86B875", "#7DB0DD"), ES_geom = "dot"
)
# Gseaplot2 for GSEA result of multile gene sets(add subplots)
p1 <- gseaplot2(edo2, geneSetID = 1:3, subplots = 1)
p2 <- gseaplot2(edo2, geneSetID = 1:3, subplots = 1:2)
cowplot::plot_grid(p1, p2, ncol = 1, labels = LETTERS[1:2])
## Warning in as_grob.default(plot): Cannot convert object of class gglistlist
## into a grob.
# Ranked list of genes belong to the specific gene set.
gsearank(edo2, 1, title = edo2[1, "Description"])
# Gsearank for multiple gene sets
pp <- lapply(1:3, function(i) {
anno <- edo2[i, c("NES", "pvalue", "p.adjust")]
lab <- paste0(names(anno), "=", round(anno, 3), collapse = "\n")
gsearank(edo2, i, edo2[i, 2]) + xlab(NULL) + ylab(NULL) +
annotate("text", 10000, edo2[i, "enrichmentScore"] * .75, label = lab, hjust = 0, vjust = 0)
})
plot_grid(plotlist = pp, ncol = 1)
Manipulating enrichment result
data(geneList)
de <- names(geneList)[1:100]
x <- enrichDO(de)
# Filter
filter(x, p.adjust < .05, qvalue < 0.2)
## #
## # over-representation test
## #
## #...@organism Homo sapiens
## #...@ontology DO
## #...@keytype ENTREZID
## #...@gene chr [1:100] "4312" "8318" "10874" "55143" "55388" "991" "6280" "2305" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05
## #...77 enriched terms found
## 'data.frame': 77 obs. of 9 variables:
## $ ID : chr "DOID:11054" "DOID:1107" "DOID:10534" "DOID:10907" ...
## $ Description: chr "urinary bladder cancer" "esophageal carcinoma" "stomach cancer" "microcephaly" ...
## $ GeneRatio : chr "11/84" "10/84" "13/84" "6/84" ...
## $ BgRatio : chr "246/10312" "219/10312" "416/10312" "76/10312" ...
## $ pvalue : num 4.55e-06 1.06e-05 2.86e-05 3.38e-05 3.83e-05 ...
## $ p.adjust : num 0.00235 0.00275 0.00311 0.00311 0.00311 ...
## $ qvalue : num 0.00176 0.00206 0.00233 0.00233 0.00233 ...
## $ geneID : chr "4312/6280/4605/6279/7153/6241/983/5080/332/2146/6790" "4312/6280/3868/6279/8140/890/5918/332/2146/4321" "4312/2305/6279/10403/7153/259266/8140/890/4085/81930/332/2146/3887" "1062/23397/259266/4085/5080/1111" ...
## $ Count : int 11 10 13 6 10 10 7 10 8 6 ...
## #...Citation
## Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an
## R/Bioconductor package for Disease Ontology Semantic and Enrichment
## analysis. Bioinformatics 2015, 31(4):608-609
## #
## # over-representation test
## #
## #...@organism Homo sapiens
## #...@ontology DO
## #...@keytype ENTREZID
## #...@gene chr [1:100] "4312" "8318" "10874" "55143" "55388" "991" "6280" "2305" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05
## #...77 enriched terms found
## 'data.frame': 77 obs. of 10 variables:
## $ ID : chr "DOID:10534" "DOID:3459" "DOID:11054" "DOID:2394" ...
## $ Description: chr "stomach cancer" "breast carcinoma" "urinary bladder cancer" "ovarian cancer" ...
## $ GeneRatio : chr "13/84" "12/84" "11/84" "11/84" ...
## $ BgRatio : chr "416/10312" "493/10312" "246/10312" "441/10312" ...
## $ pvalue : num 2.86e-05 6.09e-04 4.55e-06 8.55e-04 1.06e-05 ...
## $ p.adjust : num 0.00311 0.01199 0.00235 0.01474 0.00275 ...
## $ qvalue : num 0.00233 0.00898 0.00176 0.01105 0.00206 ...
## $ geneID : chr "4312/2305/6279/10403/7153/259266/8140/890/4085/81930/332/2146/3887" "4312/6280/6279/7153/6241/4751/890/4085/332/6286/2146/6790" "4312/6280/4605/6279/7153/6241/983/5080/332/2146/6790" "4312/259266/820/983/10232/6362/332/6286/9212/4321/6790" ...
## $ Count : int 13 12 11 11 10 10 10 10 10 9 ...
## $ geneRatio : num 0.155 0.143 0.131 0.131 0.119 ...
## #...Citation
## Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an
## R/Bioconductor package for Disease Ontology Semantic and Enrichment
## analysis. Bioinformatics 2015, 31(4):608-609
## #
## # over-representation test
## #
## #...@organism Homo sapiens
## #...@ontology DO
## #...@keytype ENTREZID
## #...@gene chr [1:100] "4312" "8318" "10874" "55143" "55388" "991" "6280" "2305" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05
## #...77 enriched terms found
## 'data.frame': 77 obs. of 10 variables:
## $ ID : chr "DOID:11054" "DOID:1107" "DOID:10534" "DOID:10907" ...
## $ Description: chr "urinary bladder cancer" "esophageal carcinoma" "stomach cancer" "microcephaly" ...
## $ GeneRatio : chr "11/84" "10/84" "13/84" "6/84" ...
## $ BgRatio : chr "246/10312" "219/10312" "416/10312" "76/10312" ...
## $ pvalue : num 4.55e-06 1.06e-05 2.86e-05 3.38e-05 3.83e-05 ...
## $ p.adjust : num 0.00235 0.00275 0.00311 0.00311 0.00311 ...
## $ qvalue : num 0.00176 0.00206 0.00233 0.00233 0.00233 ...
## $ geneID : chr "4312/6280/4605/6279/7153/6241/983/5080/332/2146/6790" "4312/6280/3868/6279/8140/890/5918/332/2146/4321" "4312/2305/6279/10403/7153/259266/8140/890/4085/81930/332/2146/3887" "1062/23397/259266/4085/5080/1111" ...
## $ Count : int 11 10 13 6 10 10 7 10 8 6 ...
## $ richFactor : num 0.0447 0.0457 0.0312 0.0789 0.0394 ...
## #...Citation
## Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an
## R/Bioconductor package for Disease Ontology Semantic and Enrichment
## analysis. Bioinformatics 2015, 31(4):608-609
# Visualizing rich factor of enriched terms using lolliplot.
ggplot(y,
showCategory = 20,
aes(richFactor, fct_reorder(Description, richFactor))
) +
geom_segment(aes(xend = 0, yend = Description)) +
geom_point(aes(color = p.adjust, size = Count)) +
scale_color_viridis_c(guide = guide_colorbar(reverse = TRUE)) +
scale_size_continuous(range = c(2, 10)) +
theme_minimal() +
xlab("rich factor") +
ylab(NULL) +
ggtitle("Enriched Disease Ontology")
## preparing geneSet collections...
## GSEA analysis...
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some pathways, in reality P-values are less than 1e-10. You can
## set the `eps` argument to zero for better estimation.
## leading edge analysis...
## done...
## Warning: The `add` argument of `group_by()` is deprecated as of dplyr 1.0.0.
## ℹ Please use the `.add` argument instead.
## ℹ The deprecated feature was likely used in the dplyr package.
## Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
y <- arrange(x, abs(NES)) |>
group_by(sign(NES)) |>
slice(1:5)
# ggplot(y, aes(NES, fct_reorder(Description, NES), fill = qvalues),
# showCategory = 10
# ) +
# geom_col(orientation = "y") +
# scale_fill_continuous(
# low = "red", high = "blue",
# guide = guide_colorbar(reverse = TRUE)
# ) +
# theme_minimal() +
# ylab(NULL)
# Summarise
pbar <- function(x) {
pi=seq(0, 1, length.out=11)
mutate(x, pp = cut(p.adjust, pi)) %>%
group_by(pp) %>%
summarise(cnt = n()) %>%
ggplot(aes(pp, cnt)) + geom_col() +
theme_minimal() +
xlab("p value intervals") +
ylab("Frequency") +
ggtitle("p value distribution")
}
x <- enrichDO(de, pvalueCutoff=1, qvalueCutoff=1)
set.seed(2020-09-10)
random_genes <- sample(names(geneList), 100)
y <- enrichDO(random_genes, pvalueCutoff=1, qvalueCutoff=1)
p1 <- pbar(x)
p2 <- pbar(y)
cowplot::plot_grid(p1, p2, ncol=1, labels = LETTERS[1:2])
Prepare geneList
GSEA analysis requires a ranked gene list, which contains three features:
numeric vector: fold change or other type of numerical variable
named vector: every number has a name, the corresponding gene ID
sorted vector: number should be sorted in decreasing order
If you import your data from a csv file, the file should contains two columns, one for gene ID (no duplicated ID allowed) and another one for fold change. You can prepare your own geneList via the following command:
Showing specific pathways
data(geneList)
de <- names(geneList)[1:100]
x <- enrichDO(de)
## show top 10 most significant pathways and want to exclude the second one
## dotplot(x, showCategory = x$Description[1:10][-2])
set.seed(123)
selected_pathways <- sample(x$Description, 10)
selected_pathways
## [1] "esophagus adenocarcinoma"
## [2] "aortic disease"
## [3] "bronchiolitis obliterans"
## [4] "constipation"
## [5] "endometrial cancer"
## [6] "lymph node disease"
## [7] "acute myeloid leukemia"
## [8] "myasthenia gravis"
## [9] "malignant pleural mesothelioma"
## [10] "gastrointestinal system benign neoplasm"
p1 <- dotplot(x, showCategory = 10, font.size=14)
p2 <- dotplot(x, showCategory = selected_pathways, font.size=14)
cowplot::plot_grid(p1, p2, labels=LETTERS[1:2])
## [1] "DOID:11054" "DOID:1107" "DOID:10534"
## [1] "4312" "6280" "4605" "6279" "7153" "6241" "983" "5080" "332" "2146"
## [11] "6790"
## $`DOID:11054`
## [1] "4312" "6280" "4605" "6279" "7153" "6241" "983" "5080" "332" "2146"
## [11] "6790"
##
## $`DOID:1107`
## [1] "4312" "6280" "3868" "6279" "8140" "890" "5918" "332" "2146" "4321"
##
## $`DOID:10534`
## [1] "4312" "2305" "6279" "10403" "7153" "259266" "8140" "890"
## [9] "4085" "81930" "332" "2146" "3887"
# Wrap long axis labels
y <- enrichPathway(de)
p1 <- dotplot(y, label_format = 20)
p2 <- dotplot(y, label_format = function(x) stringr::str_wrap(x, width=20))
cowplot::plot_grid(p1, p2, ncol=2, labels=c("A", "B"))
# # Gene set enrichment analysis (GSEA) using clusterProfiler and Pathview
# mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
#
# mart <- useEnsembl(biomart = "ensembl",
# dataset = "hsapiens_gene_ensembl")
#
# genes <- getBM(
# filters = "chromosome_name",
# attributes = "ensembl_gene_id",
# values = all_genes,
# mart = mart
# )
#
#
# indNA <- which(is.na(genes$entrezgene_id))
#
# genes_noNA <- genes[-indNA, ]
#
# indnodup <- which(duplicated(genes_noNA$ entrezgene_id) == F)
#
# genes_noNA_nodup <- genes_noNA[indnodup, ]
#
# lFC <- res$log2FoldChange[-indNA]
#
# lFC <- lFC[indnodup]
#
# names(lFC) <- genes_noNA_nodup$entrezgene_id
#
# # Sort fold changes in decreasing order
# lFC <- sort(lFC, decreasing = TRUE)
#
#
# # Perform the GSEA using KEGG gene sets:
# gseaKEGG <- gseKEGG(
# geneList = lFC,
# organism = "mmu",
# nPerm = 1000, # default number permutations
# minGSSize = 5, # minimum gene set size
# pvalueCutoff = 0.1, # padj cutoff value
# verbose = FALSE
# )
#
# # Extract the GSEA results
# gseaKEGG_results <- gseaKEGG@result
Session Info
## R version 4.3.3 (2024-02-29 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] graphite_1.46.0 conflicted_1.2.0
## [3] lubridate_1.9.3 forcats_1.0.0
## [5] stringr_1.5.1 purrr_1.0.2
## [7] readr_2.1.5 tidyr_1.3.1
## [9] tibble_3.2.1 tidyverse_2.0.0
## [11] ggstance_0.3.7 europepmc_0.4.3
## [13] cowplot_1.1.3 ggupset_0.3.0
## [15] ReactomePA_1.44.0 biomaRt_2.56.1
## [17] enrichplot_1.20.3 ensembldb_2.24.1
## [19] AnnotationFilter_1.24.0 GenomicFeatures_1.52.2
## [21] AnnotationHub_3.8.0 BiocFileCache_2.8.0
## [23] dbplyr_2.5.0 clusterProfiler_4.8.3
## [25] pathview_1.40.0 DOSE_3.26.2
## [27] org.Hs.eg.db_3.17.0 org.Mm.eg.db_3.17.0
## [29] AnnotationDbi_1.62.2 qvalue_2.32.0
## [31] ggvenn_0.1.10 dplyr_1.1.4
## [33] SparseArray_1.0.12 S4Arrays_1.0.6
## [35] abind_1.4-5 Matrix_1.6-5
## [37] DT_0.33 gridExtra_2.3
## [39] EnhancedVolcano_1.18.0 ggrepel_0.9.5
## [41] ggplot2_3.5.0 RColorBrewer_1.1-3
## [43] pheatmap_1.0.12 ashr_2.2-63
## [45] apeglm_1.22.1 vsn_3.68.0
## [47] DESeq2_1.40.2 SummarizedExperiment_1.30.2
## [49] Biobase_2.60.0 MatrixGenerics_1.12.3
## [51] matrixStats_1.3.0 GenomicRanges_1.52.1
## [53] GenomeInfoDb_1.36.4 IRanges_2.34.1
## [55] S4Vectors_0.38.2 BiocGenerics_0.46.0
## [57] prettydoc_0.4.1
##
## loaded via a namespace (and not attached):
## [1] fs_1.6.3 ProtGenerics_1.32.0
## [3] bitops_1.0-7 HDO.db_0.99.1
## [5] httr_1.4.7 Rgraphviz_2.44.0
## [7] numDeriv_2016.8-1.1 tools_4.3.3
## [9] utf8_1.2.4 R6_2.5.1
## [11] mgcv_1.9-1 lazyeval_0.2.2
## [13] withr_3.0.0 prettyunits_1.2.0
## [15] preprocessCore_1.62.1 cli_3.6.2
## [17] scatterpie_0.2.2 labeling_0.4.3
## [19] sass_0.4.9 KEGGgraph_1.60.0
## [21] SQUAREM_2021.1 mvtnorm_1.2-4
## [23] ggridges_0.5.6 mixsqp_0.3-54
## [25] Rsamtools_2.16.0 yulab.utils_0.1.4
## [27] gson_0.1.0 invgamma_1.1
## [29] bbmle_1.0.25.1 limma_3.56.2
## [31] rstudioapi_0.16.0 RSQLite_2.3.6
## [33] generics_0.1.3 gridGraphics_0.5-1
## [35] BiocIO_1.10.0 GO.db_3.17.0
## [37] fansi_1.0.6 lifecycle_1.0.4
## [39] yaml_2.3.8 blob_1.2.4
## [41] promises_1.3.0 crayon_1.5.2
## [43] bdsmatrix_1.3-7 lattice_0.22-5
## [45] KEGGREST_1.40.1 pillar_1.9.0
## [47] knitr_1.46 fgsea_1.26.0
## [49] rjson_0.2.21 codetools_0.2-19
## [51] fastmatch_1.1-4 glue_1.7.0
## [53] downloader_0.4 ggfun_0.1.4
## [55] data.table_1.15.4 vctrs_0.6.5
## [57] png_0.1-8 treeio_1.24.3
## [59] urltools_1.7.3 gtable_0.3.4
## [61] emdbook_1.3.13 cachem_1.0.8
## [63] xfun_0.43 mime_0.12
## [65] tidygraph_1.3.1 coda_0.19-4.1
## [67] interactiveDisplayBase_1.38.0 nlme_3.1-164
## [69] ggtree_3.8.2 bit64_4.0.5
## [71] progress_1.2.3 filelock_1.0.3
## [73] bslib_0.7.0 affyio_1.70.0
## [75] irlba_2.3.5.1 colorspace_2.1-0
## [77] DBI_1.2.2 tidyselect_1.2.1
## [79] bit_4.0.5 compiler_4.3.3
## [81] curl_5.2.1 graph_1.78.0
## [83] xml2_1.3.6 DelayedArray_0.26.7
## [85] triebeard_0.4.1 shadowtext_0.1.3
## [87] rtracklayer_1.60.1 scales_1.3.0
## [89] affy_1.78.2 rappdirs_0.3.3
## [91] digest_0.6.35 rmarkdown_2.26
## [93] XVector_0.40.0 htmltools_0.5.8.1
## [95] pkgconfig_2.0.3 highr_0.10
## [97] fastmap_1.1.1 rlang_1.1.3
## [99] htmlwidgets_1.6.4 shiny_1.8.1.1
## [101] farver_2.1.1 jquerylib_0.1.4
## [103] jsonlite_1.8.8 BiocParallel_1.34.2
## [105] GOSemSim_2.26.1 RCurl_1.98-1.14
## [107] magrittr_2.0.3 GenomeInfoDbData_1.2.10
## [109] ggplotify_0.1.2 patchwork_1.2.0
## [111] munsell_0.5.1 Rcpp_1.0.12
## [113] ggnewscale_0.4.10 ape_5.8
## [115] viridis_0.6.5 stringi_1.8.3
## [117] ggraph_2.2.1 zlibbioc_1.46.0
## [119] MASS_7.3-60.0.1 plyr_1.8.9
## [121] parallel_4.3.3 Biostrings_2.68.1
## [123] graphlayouts_1.1.1 splines_4.3.3
## [125] hms_1.1.3 locfit_1.5-9.9
## [127] igraph_2.0.3 reshape2_1.4.4
## [129] BiocVersion_3.17.1 XML_3.99-0.16.1
## [131] evaluate_0.23 BiocManager_1.30.22
## [133] tzdb_0.4.0 tweenr_2.0.3
## [135] httpuv_1.6.15 polyclip_1.10-6
## [137] ggforce_0.4.2 xtable_1.8-4
## [139] restfulr_0.0.15 reactome.db_1.84.0
## [141] tidytree_0.4.6 later_1.3.2
## [143] viridisLite_0.4.2 snow_0.4-4
## [145] truncnorm_1.0-9 aplot_0.2.2
## [147] memoise_2.0.1 GenomicAlignments_1.36.0
## [149] timechange_0.3.0