RNA-seq数据及质量控制汇集

本文描述了bulk RNA-seq分析流程中数据汇集子模块,该模块收集个样本质量控制及基因定量的信息,是后续差异表达分析、聚类、分类、功能富集分析的基础。该模块用R语言编写。

分析流程示意图

如下图所示,该模块汇集了两个方面的信息:质量控制和基因定量。质量控制主要来自于STAR比对及后续的RNASeQC分析,转录本定量来自Kallisto。

Drawing

软件包准备

In [2]:
suppressWarnings(library("ggplot2"))
suppressWarnings(library("ggpubr"))
suppressWarnings(library("tximport"))
suppressWarnings(library("reshape2"))

tx2g = "/sibcb2/bioinformatics/iGenome/STAR/GENCODE/human_hg38/ID/tx2g.txt"
runFolder = "/sibcb2/bioinformatics/WPS/Run_RNAseq/workflow/"
outFolder = "RNAseq/DataSummary"
studyTag = "WPS"

if(!dir.exists(outFolder))
    dir.create(outFolder)

读取序列比对信息

In [3]:
SRX = list.files(runFolder)
Ns = length(SRX)
cM = matrix(NA, 4, Ns)
for(i in 1:Ns){
	tag = SRX[i]
	fileIn = paste0(runFolder, "/", tag, "/STAR/", tag, "_STAR_QC.txt")
	text = readLines(fileIn)
	totalReads = as.numeric(gsub("\t", "", strsplit(grep("Number of input reads", text, value = TRUE), "\\|")[[1]][2]))
	umapReads = as.numeric(gsub("\t", "", strsplit(grep("Uniquely mapped reads number", text, value = TRUE), "\\|")[[1]][2]))
	multiReads = as.numeric(gsub("\t", "", strsplit(grep("Number of reads mapped to multiple loci", text, value = TRUE), "\\|")[[1]][2]))
	chimericReads = as.numeric(gsub("\t", "", strsplit(grep("Number of chimeric reads", text, value = TRUE), "\\|")[[1]][2]))
	unmappedReads = totalReads - umapReads - multiReads - chimericReads
	cM[,i] = c(umapReads, multiReads, chimericReads, unmappedReads)
}
dimnames(cM) = list(c("unique_mapping", "multi_mapping", "chimericReads", "unmapped"), SRX)
write.table(cM, file = paste0(outFolder,"/Alignment_summary.txt"), row.names = TRUE, col.names = TRUE, sep = "\t")

合并的结果保存在Alignment_summary.txt文件中,其中的内容示例如下:

In [4]:
t(cM[,1:8])
A matrix: 8 x 4 of type dbl
unique_mappingmulti_mappingchimericReadsunmapped
1302003-C2740463510664780319711
1302003-N25366544 7249550398773
1436156-C23897909 9156160520952
1436156-N25697479 8173960477152
1436468-C2878681414390510352613
1436468-N2986913913657270692314
1437862-C26464389 9134120519750
1437862-N25809821 9118270550134

读取比对位置分布信息

In [5]:
mM = matrix(NA, 3, Ns)
for(i in 1:Ns){
	tag = SRX[i]
	fileIn = paste0(runFolder, "/", tag, "/report/", tag, ".bam.metrics.tsv")
	text = readLines(fileIn)
	ExonicReads = as.numeric(gsub("\t", "", strsplit(grep("Exonic Reads", text, value = TRUE), "\t")[[1]][2]))
	IntronicReads = as.numeric(gsub("\t", "", strsplit(grep("Intronic Reads", text, value = TRUE), "\t")[[1]][2]))
	IntergenicReads = as.numeric(gsub("\t", "", strsplit(grep("Intergenic Reads", text, value = TRUE), "\t")[[1]][2]))
	mM[,i] = c(ExonicReads, IntronicReads, IntergenicReads)
}
dimnames(mM) = list(c("ExonicReads", "IntronicReads", "IntergenicReads"), SRX)
write.table(cM, file = paste0(outFolder,"/RNASeQC_summary.txt"), row.names = TRUE, col.names = TRUE, sep = "\t")

合并的结果保存在RNASeQC_summary.txt文件中,其中的内容示例如下:

In [6]:
t(mM[,1:8])
A matrix: 8 x 3 of type dbl
ExonicReadsIntronicReadsIntergenicReads
1302003-C4720186748471751134482
1302003-N428849414939891 943846
1436156-C3952384953971181251224
1436156-N450069723778992 906476
1436468-C5134393230875232692702
1436468-N5460178030120531081514
1437862-C4160905975312591648443
1437862-N447912964374256 892390

读取基因定量结果

In [9]:
files = paste0(runFolder, "/", SRX, "/kallisto/", SRX, ".tsv")
names(files) = SRX
tx2gene = read.table(file = tx2g, row.names = NULL, header = TRUE, sep = "\t")
# txi <- tximport(files, type = "kallisto", tx2gene = tx2gene, ignoreAfterBar = FALSE)

geneString = unique(paste(tx2gene$ENSG, tx2gene$Symbol))
geneAnn = data.frame(geneString)
rownames(geneAnn) = sapply(strsplit(geneString, " "), function(z) z[1])

findType = function(z) paste0(sort(unique(z)), collapse = " ")
aggT = aggregate(tx2gene$mType, by = list(tx2gene$ENSG),FUN="findType")
rownames(aggT) = as.character(aggT[,1])
geneAnn$mType = aggT[rownames(geneAnn),2]
geneAnn = geneAnn[rownames(txi$abundance),]

基因定量的结果保存在txi中:

In [10]:
names(txi)
  1. 'abundance'
  2. 'counts'
  3. 'length'
  4. 'countsFromAbundance'
In [11]:
txi$abundance[1:4, 1:4]
A matrix: 4 x 4 of type dbl
1302003-C1302003-N1436156-C1436156-N
ENSG00000000003.15 16.12379211.423050028.193645012.969580
ENSG00000000005.6 0.000000 0.0628489 0.0551964 0.579678
ENSG00000000419.12101.08748955.551720068.372538045.048764
ENSG00000000457.14 8.21967910.6502872 9.3021600 8.135405

基因注释保存在geneAnn中:

In [12]:
geneAnn[1:4, ]
A data.frame: 4 x 2
geneStringmType
<fct><chr>
ENSG00000000003.15ENSG00000000003.15 TSPAN6processed_transcript protein_coding
ENSG00000000005.6ENSG00000000005.6 TNMD processed_transcript protein_coding
ENSG00000000419.12ENSG00000000419.12 DPM1 processed_transcript protein_coding
ENSG00000000457.14ENSG00000000457.14 SCYL3 processed_transcript protein_coding
In [56]:
assign("WPS_raw", list(geneAnn = geneAnn, txi = txi))
save(list = "WPS_raw", file = paste0(outFolder, "/WPS_raw.RData"))