RNA-seq数据差异基因分析

本文描述了bulk RNA-seq分析流程中差异分析子模块,该模块可以得到差异表达基因,是后续功能富集等分析的基础。该模块用R语言编写。

分析流程示意图

该模块需要输入两方面的信息,利用基因水平的表达量和分组信息来完成差异基因的分析。表达量可用来自于DataSummary模块生成的raw.RData文件,分组信息需要自行提供,如下图所示。

Drawing

软件包准备

In [2]:
suppressWarnings(library("DESeq2"))
suppressWarnings(library("ggplot2"))

options(repr.plot.width=4, repr.plot.height=3, repr.plot.res = 200)

rawRdFile = "RNAseq/DataSummary/WPS_raw.RData"
samplesheet = "sampleSheet/LUSC_samplesheet_pass.txt"
outFolder = "RNAseq/DGE"

if(!dir.exists(outFolder))
    dir.create(outFolder)

两组的差异表达分析

首先导入数据和sample sheet

In [3]:
load(rawRdFile)
assign("raw", WPS_raw)
txi = raw$txi

sampleT = read.table(file = samplesheet, row.names = 1, header = TRUE, sep = "\t")
kSamples = intersect(colnames(txi$counts), rownames(sampleT))
sampleTable = sampleT[kSamples, , drop = FALSE]

newTxi = list(abundance = txi$abundance[, kSamples],
			  counts = txi$counts[, kSamples],
			  length = txi$length[, kSamples],
			  countsFromAbundance = "no")

head(sampleT)
A data.frame: 6 x 1
group
<fct>
1302003-CCancer
1302003-NNormal
1436156-CCancer
1436156-NNormal
1436468-CCancer
1436468-NNormal

我们要比较Cancer和Normal这两类的基因表达差异。

In [4]:
dds <- DESeqDataSetFromTximport(newTxi, sampleTable, ~group)
dds <- DESeq(dds)
res <- results(dds, contrast=c("group", "Cancer", "Normal"))
using counts and average transcript lengths from tximport

estimating size factors

using 'avgTxLength' from assays(dds), correcting for library size

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

-- replacing outliers and refitting for 2020 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)

estimating dispersions

fitting model and testing

DESeq2的差异表达分析结果如下所示:

In [5]:
rTable <- data.frame(res)
head(rTable)
A data.frame: 6 x 6
baseMeanlog2FoldChangelfcSEstatpvaluepadj
<dbl><dbl><dbl><dbl><dbl><dbl>
ENSG00000000003.15 928.020209 0.96824940.08919835 10.855015 1.887758e-27 1.549429e-26
ENSG00000000005.6 1.500884-1.94511030.42980104 -4.525606 6.022261e-06 1.476815e-05
ENSG00000000419.121073.031028 0.83085680.07264395 11.437384 2.719630e-30 2.509226e-29
ENSG00000000457.14 508.431133 0.18685440.06761521 2.763497 5.718560e-03 1.003688e-02
ENSG00000000460.17 307.574873 1.76263030.07891334 22.3362781.641644e-1101.511235e-108
ENSG00000000938.131332.132714-2.70901870.17133083-15.811625 2.587081e-56 5.884496e-55

结果保存

In [6]:
gene <- raw$geneAnn[rownames(rTable), "geneString"]
s1 <- rownames(sampleTable)[sampleTable[, "group"] == "Cancer"]
s2 <- rownames(sampleTable)[sampleTable[, "group"] == "Normal"]
treatment_TPM <- signif(apply(txi$abundance[rownames(rTable), s1], 1, mean), 4)
control_TPM <- signif(apply(txi$abundance[rownames(rTable), s2], 1, mean), 4)
DGE <- rep("NC", nrow(rTable))
DGE[((rTable$padj) < 0.05) & (rTable$log2FoldChange > 0)] = "UP"
DGE[((rTable$padj) < 0.05) & (rTable$log2FoldChange < 0)] = "DN"
rTable = data.frame(gene, treatment_TPM, control_TPM, rTable[, c("log2FoldChange", "pvalue", "padj")], DGE)
head(rTable)
A data.frame: 6 x 7
genetreatment_TPMcontrol_TPMlog2FoldChangepvaluepadjDGE
<fct><dbl><dbl><dbl><dbl><dbl><fct>
ENSG00000000003.15ENSG00000000003.15 TSPAN6 20.0400011.7700 0.9682494 1.887758e-27 1.549429e-26UP
ENSG00000000005.6ENSG00000000005.6 TNMD 0.03598 0.2433-1.9451103 6.022261e-06 1.476815e-05DN
ENSG00000000419.12ENSG00000000419.12 DPM1 73.9900048.8200 0.8308568 2.719630e-30 2.509226e-29UP
ENSG00000000457.14ENSG00000000457.14 SCYL3 7.25400 7.1880 0.1868544 5.718560e-03 1.003688e-02UP
ENSG00000000460.17ENSG00000000460.17 C1orf11210.13000 3.3990 1.76263031.641644e-1101.511235e-108UP
ENSG00000000938.13ENSG00000000938.13 FGR 7.9900065.7600-2.7090187 2.587081e-56 5.884496e-55DN
In [19]:
write.table(rTable, file = paste0(outFolder, "/WPS_Cancer_vs_Normal_DESeq2.txt"),
            row.names = FALSE, col.names = TRUE, sep = "\t")

差异表达火山图

In [7]:
rTable$LogFDR = -log10(rTable$padj)
p <- ggplot(rTable, aes(x = log2FoldChange, y = LogFDR, color = DGE)) + xlim(-10, 10)
p <- p + theme_bw() + labs(x = "Log2 fold-change", y = "-log10(FDR)", title = "")
p <- p + geom_point(size = 0.1) + theme(plot.title = element_text(hjust = 0.5))
p <- p + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p <- p + scale_color_manual(values = c("UP" = "red", "DN" = "dodgerblue4", "NC" = "grey"))
p <- p + theme(legend.title=element_blank())
p
Warning message:
"Removed 19715 rows containing missing values (geom_point)."

由于Cancer和Normal之间的巨大差异,相当大一部分基因在这两组之间存在差异表达(FDR < 5%)。

In [8]:
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.6 (Maipo)

Matrix products: default
BLAS: /sibcb2/bioinformatics/software/BcbioNG/anaconda/lib/R/lib/libRblas.so
LAPACK: /sibcb2/bioinformatics/software/BcbioNG/anaconda/lib/R/lib/libRlapack.so

locale:
[1] C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] ggplot2_3.2.1               DESeq2_1.22.2              
 [3] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
 [5] BiocParallel_1.16.6         matrixStats_0.55.0         
 [7] Biobase_2.42.0              GenomicRanges_1.34.0       
 [9] GenomeInfoDb_1.18.2         IRanges_2.16.0             
[11] S4Vectors_0.20.1            BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            jsonlite_1.6           splines_3.5.1         
 [4] Formula_1.2-3          latticeExtra_0.6-28    blob_1.2.0            
 [7] GenomeInfoDbData_1.2.1 RSQLite_2.1.5          pillar_1.5.1          
[10] backports_1.1.5        lattice_0.20-38        glue_1.4.2            
[13] uuid_0.1-4             digest_0.6.23          RColorBrewer_1.1-2    
[16] XVector_0.22.0         checkmate_1.9.4        colorspace_1.4-1      
[19] htmltools_0.5.1.1      Matrix_1.2-18          XML_3.98-1.20         
[22] pkgconfig_2.0.3        genefilter_1.64.0      zlibbioc_1.28.0       
[25] xtable_1.8-4           purrr_0.3.3            scales_1.1.0          
[28] annotate_1.60.1        tibble_3.1.0           htmlTable_1.13.3      
[31] farver_2.0.1           generics_0.0.2         ellipsis_0.3.0        
[34] withr_2.1.2            repr_1.1.0             nnet_7.3-12           
[37] lazyeval_0.2.2         survival_3.1-8         magrittr_1.5          
[40] crayon_1.3.4           memoise_1.1.0          evaluate_0.14         
[43] fansi_0.4.0            foreign_0.8-74         data.table_1.12.8     
[46] tools_3.5.1            lifecycle_1.0.0        stringr_1.4.0         
[49] locfit_1.5-9.1         munsell_0.5.0          cluster_2.1.0         
[52] AnnotationDbi_1.44.0   compiler_3.5.1         rlang_0.4.10          
[55] grid_3.5.1             RCurl_1.95-4.12        pbdZMQ_0.3-3          
[58] IRkernel_1.1.1         rstudioapi_0.10        htmlwidgets_1.5.1     
[61] labeling_0.3           bitops_1.0-6           base64enc_0.1-3       
[64] gtable_0.3.0           DBI_1.1.0              R6_2.4.1              
[67] gridExtra_2.3          knitr_1.26             dplyr_1.0.5           
[70] bit_1.1-14             utf8_1.1.4             Hmisc_4.3-0           
[73] stringi_1.4.3          IRdisplay_0.7.0        Rcpp_1.0.3            
[76] geneplotter_1.60.0     vctrs_0.3.7            rpart_4.1-15          
[79] acepack_1.4.1          tidyselect_1.1.0       xfun_0.18             
In [ ]: