scRNA-seq数据比对及定量¶

本文描述了分子细胞科学卓越中心生物信息平台所开发的scRNA-seq分析流程的比对子模块，该模块基于kb_python软件，支持很多单细胞测序技术如10X genomic(V1-V3)、CELSEQ(V1-V3)、DROPSEQ、INDROPS、SCRUBSEQ、SURECELL。

分析流程示意图¶

该模块汇集了两个方面的内容：比对和基因定量。该模块输入原始fastq文件、参考基因组以及注释文件，使用软件kb_python完成分析,该软件包装了Kallisto和bustools两个软件，分别用来比对和定量。如下图所示。

数据下载¶

wget -c http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar
mv pbmc_1k_v3_fastqs.tar scRNAseq/HumanPbmc3k/Fastq/pbmc_1k_v3_fastqs.tar
tar xvf scRNAseq/HumanPbmc3k/Fastq/pbmc_1k_v3_fastqs.tar

print(system("ls -la scRNAseq/HumanPbmc1k/Fastq/pbmc_1k_v3_fastqs", intern = TRUE)[c(-1,-2,-3)])

[1] "-rw-r--r-- 1 local1 bioinformatics  255933715 Jul  7 15:04 pbmc_1k_v3_S1_L001_I1_001.fastq.gz"
[2] "-rw-r--r-- 1 local1 bioinformatics  753851810 Jul  7 15:04 pbmc_1k_v3_S1_L001_R1_001.fastq.gz"
[3] "-rw-r--r-- 1 local1 bioinformatics 1772725195 Jul  7 15:04 pbmc_1k_v3_S1_L001_R2_001.fastq.gz"
[4] "-rw-r--r-- 1 local1 bioinformatics  254507460 Jul  7 15:04 pbmc_1k_v3_S1_L002_I1_001.fastq.gz"
[5] "-rw-r--r-- 1 local1 bioinformatics  748651163 Jul  7 15:04 pbmc_1k_v3_S1_L002_R1_001.fastq.gz"
[6] "-rw-r--r-- 1 local1 bioinformatics 1763628623 Jul  7 15:04 pbmc_1k_v3_S1_L002_R2_001.fastq.gz"

软件包准备¶

library("Matrix")
library("ggplot2")
library("scico")
library("tibble")
library("DropletUtils")
library("Seurat")

read_count_output <- function(dir, name) {
  dir <- normalizePath(dir, mustWork = TRUE)
  m <- readMM(paste0(dir, "/", name, ".mtx"))
  m <- Matrix::t(m)
  m <- as(m, "dgCMatrix")
  # The matrix read has cells in rows
  ge <- ".genes.txt"
  genes <- readLines(file(paste0(dir, "/", name, ge)))
  barcodes <- readLines(file(paste0(dir, "/", name, ".barcodes.txt")))
  colnames(m) <- barcodes
  rownames(m) <- genes
  return(m)
}

基因定量¶

如上文述所，我们推荐使用Kallisto|bustools进行定量，该方法快速准确，是scRNAseq数据基因定量的首选。

kb count -i /sibcb2/bioinformatics/iGenome/Kallisto/GENECODE_hg38/transcriptome.idx \
    -g /sibcb2/bioinformatics/iGenome/Kallisto/GENECODE_hg38/tx2g_symbol.txt \
    -x 10XV3 \
    -o scRNAseq/HumanPbmc1k/kb \
    --filter bustools \
    -t 2 \
    scRNAseq/HumanPbmc1k/Fastq/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz \
    scRNAseq/HumanPbmc1k/Fastq/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz \
    scRNAseq/HumanPbmc1k/Fastq/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz \
    scRNAseq/HumanPbmc1k/Fastq/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz

运行时间大约需要10分钟，结果文件目录大致如下：

    |-- 10xv3_whitelist.txt
    |-- counts_filtered
    |   |-- cells_x_genes
    |   |-- cells_x_genes.barcodes.txt
    |   |-- cells_x_genes.genes.txt
    |   `-- cells_x_genes.mtx
    |-- counts_unfiltered
    |   |-- cells_x_genes
    |   |-- cells_x_genes.barcodes.txt
    |   |-- cells_x_genes.genes.txt
    |   `-- cells_x_genes.mtx
    |-- filter_barcodes.txt
    |-- inspect.json
    |-- matrix.ec
    |-- output.bus
    |-- output.filtered.bus
    |-- output.unfiltered.bus
    |-- run_info.json
    `-- transcripts.txt

产生的结果文件大致如下：

10xv3_whitelist.txt：10X genomic V3的barcodes白名单文件
counts_filtered：根据barcodes白名单过滤后的表达矩阵文件，以MEF(Market Exchange Formats)格式存储
counts_unfiltered：原始的表达矩阵文件，以MEF格式存储
filter_barcodes.txt：根据barcodes白名单过滤后的barcodes文件
inspect.json：定量结果的概要信息
matrix.ec：表达量的矩阵等价类文件
output.bus：原始的BUS(Barcode,UMI,Set format)格式比对结果文件
output.filtered.bus：barcodes矫正且根据白名单过滤后的bus格式比对结果文件
output.unfiltered.bus：barcodes矫正未过滤的bus格式比对结果文件
run_info.json：比对结果的概要信息
transcripts.txt：定量生成的转录本文件

res_mat <- read_count_output("scRNAseq/HumanPbmc1k/kb/counts_unfiltered", name = "cells_x_genes")
dim(res_mat)