Processing scRNA-data with scanpy

Scanpy是非常流行的scRNA-seq处理软件之一。本文档复制了Pachter实验室的分析示例及官方文档,请参照原文获取更多信息。

数据下载

wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz

加载Python包

In [1]:
import anndata
import matplotlib
import matplotlib.pyplot as pyplot
import numpy as np
import pandas as pd
import scanpy as sc
import os
import warnings

加载数据

In [2]:
adata = sc.read_10x_mtx('scRNAseq/HumanPbmc3k/filtered_gene_bc_matrices/hg19', var_names='gene_symbols')
adata.var_names_make_unique()

质量控制:Knee Plot

In [3]:
matplotlib.rcParams.update({'font.size': 16})
warnings.filterwarnings('ignore')

expected_num_cells =  min(adata.obs.shape[0] - 1, 3000)
knee = np.sort((np.array(adata.X.sum(axis=1))).flatten())[::-1]
fig = pyplot.figure(figsize=(8, 8))
ax = fig.add_subplot(111)
ax.loglog(knee, range(len(knee)), linewidth=5, color="g")
ax.axvline(x=knee[expected_num_cells], linewidth=3, color="k")
ax.axhline(y=expected_num_cells, linewidth=3, color="k")
ax.set_xlabel("UMI Counts")
ax.set_ylabel("Set of Barcodes")
pyplot.grid(True, which="both")

质量控制:文库饱和度

In [4]:
fig = pyplot.figure(figsize=(7, 7))
ax = fig.add_subplot(111)

x = np.asarray(adata.X.sum(axis=1))[:,0]
y = np.asarray(np.sum(adata.X>0, axis=1))[:,0]

ax.scatter(x, y, color="green", alpha=0.25)
ax.set_xlabel("UMI Counts")
ax.set_ylabel("Genes Detected")
ax.set_xscale('log')
ax.set_yscale('log', nonposy='clip')
ax.set_xlim((0.5, 10000))
ax.set_ylim((0.5, 10000))
pyplot.grid(True, which="both")

质控:表达量最高的基因

In [5]:
sc.set_figure_params(dpi_save=300, fontsize = 14)
sc.pl.highest_expr_genes(adata, n_top=20, )

质控:基因、细胞水平的过滤

In [6]:
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
In [7]:
adata.obs
Out[7]:
n_genes_by_counts total_counts total_counts_mt pct_counts_mt
AAACATACAACCAC-1 781 2421.0 73.0 3.015283
AAACATTGAGCTAC-1 1352 4903.0 186.0 3.793596
AAACATTGATCAGC-1 1131 3149.0 28.0 0.889171
AAACCGTGCTTCCG-1 960 2639.0 46.0 1.743085
AAACCGTGTATGCG-1 522 981.0 12.0 1.223242
... ... ... ... ...
TTTCGAACTCTCAT-1 1155 3461.0 73.0 2.109217
TTTCTACTGAGGCA-1 1227 3447.0 32.0 0.928343
TTTCTACTTCCTCG-1 622 1684.0 37.0 2.197150
TTTGCATGAGAGGC-1 454 1024.0 21.0 2.050781
TTTGCATGCCTCAC-1 724 1985.0 16.0 0.806045

2700 rows × 4 columns

In [8]:
sc.pl.violin(adata, keys = ['pct_counts_mt'], multi_panel=False, rotation=60)
In [9]:
sc.pp.filter_cells(adata, min_genes = 200)
sc.pp.filter_genes(adata, min_cells = 3)
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs['pct_counts_mt'] < 5, :]
adata
Out[9]:
View of AnnData object with n_obs × n_vars = 2638 × 13714
    obs: 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes'
    var: 'gene_ids', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'
In [10]:
sc.pl.violin(adata, ['pct_counts_mt'], jitter=0.4, multi_panel=False, rotation=60)

均一化

In [11]:
adata.counts = adata.X
sc.pp.normalize_total(adata, target_sum=1e4, inplace=True)
sc.pp.log1p(adata)

高可变基因

In [12]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)

主成分分析

In [13]:
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True)
In [14]:
sc.pl.pca(adata, color='CST3')

降维

In [15]:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
In [16]:
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])

细胞聚类

In [17]:
sc.tl.leiden(adata)
In [18]:
sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])

差异表达基因

In [19]:
sc.tl.rank_genes_groups(adata, 'leiden', method='logreg')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
In [20]:
sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')
In [21]:
sc.pl.rank_genes_groups_heatmap(adata, n_genes=3, show_gene_labels=True, cmap='bwr')
WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.

细胞群体的手工注释

In [22]:
sc.pl.umap(adata, color='leiden', legend_loc='on data', title='', frameon=False)
In [24]:
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
sc.pl.dotplot(adata, marker_genes, groupby='leiden', dendrogram = False)
In [46]:
old_to_new = {
    '0': 'CD4 T', 
    '1': 'CD4 T', 
    '2': 'CD14 Monocytes',
    '3': 'B', 
    '4': 'CD8 T',
    '5': 'FCGR3A Monocytes', 
    '6': 'NK',
    '7': 'Dendritic', 
    '8': 'Megakaryocytes'}

adata.obs['new_clusters'] = adata.obs['leiden'].map(old_to_new).astype('category')

sc.pl.umap(adata, color='new_clusters', legend_loc='on data', title='', frameon=False)
In [50]:
sc.pl.dotplot(adata, marker_genes, groupby='new_clusters', dendrogram = True)
WARNING: dendrogram data not found (using key=dendrogram_['new_clusters']). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
In [48]:
adata
Out[48]:
AnnData object with n_obs × n_vars = 2638 × 13714
    obs: 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes', 'leiden', 'new_clusters'
    var: 'gene_ids', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors', 'rank_genes_groups', 'dendrogram_leiden', "dendrogram_['leiden']", 'new_clusters_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
In [49]:
adata.write('scRNAseq/HumanPbmc3k/HumanPbmc3k.h5ad', compression='gzip')