Accurate differential analysis of transcription factor activity from gene expression¶

从基因表达谱预测转录因子的活性是非常有困难的，一个很重要的原因是不同转录因子的靶基因有相当程度的重合性。EPEE就是为了解决这个问题而开发出来的方法，具体方法可参考全文。

程序安装¶

该程序依赖的包已经严重老旧，为了方便使用，我们在GPU节点上利用Docker安装了该程序。如果您没有GPU队列的登录权限，请联系计算平台开通。

ssh gpu002 -q gup.q

nvidia-docker run --rm -e NVIDIA_VISIBLE_DEVICES=all -v /sibcb2/bioinformatics/Projects/EPEE:/home epee:gpu python /home/script/run_epee.py -h

usage: run_epee.py [-h] -a CONDITIONA -b CONDITIONB -na NETWORKA -nb NETWORKB
                   [-o OUTPUT] [-reg1 LREGULARIZATION] [-reg2 GREGULARIZATION]
                   [-s STEP] [-c CONDITIONING] [-r RUNS] [-i ITERATIONS]
                   [-ag AGGREGATION] [-n NORMALIZE] [-m MODEL] [-v VERBOSE]
                   [-eval EVALUATE] [-pr PREFIX] [-w] [-mp] [-null] [-d SEED]
                   [-p PERTURB] [-sg]

optional arguments:
  -h, --help            show this help message and exit
  -a CONDITIONA, --conditiona CONDITIONA
                        RNA-seq data for Condition A
  -b CONDITIONB, --conditionb CONDITIONB
                        RNA-seq data for Condition B
  -na NETWORKA, --networka NETWORKA
                        Network for condition A
  -nb NETWORKB, --networkb NETWORKB
                        Network for condition B
  -o OUTPUT, --output OUTPUT
                        output directory
  -reg1 LREGULARIZATION, --lregularization LREGULARIZATION
                        lasso regularization parameter
  -reg2 GREGULARIZATION, --gregularization GREGULARIZATION
                        graph contrained regularization parameter
  -s STEP, --step STEP  optimizer learning-rate
  -c CONDITIONING, --conditioning CONDITIONING
                        Weight for the interactions not known
  -r RUNS, --runs RUNS  Number of independent runs
  -i ITERATIONS, --iterations ITERATIONS
                        Number of iterations
  -ag AGGREGATION, --aggregation AGGREGATION
                        Method for aggregating runs. Default: "sum" Valid
                        options: {"mean", "median", "sum"}
  -n NORMALIZE, --normalize NORMALIZE
                        Weight normalization strategy. Default:"minmax" Valid
                        options: {"minmax", "log", "log10", "no"}
  -m MODEL, --model MODEL
                        Model regularization choice. Default: "epee-gcl" Valid
                        options: {"epee-gcl","epee-l","no-penalty"
  -v VERBOSE, --verbose VERBOSE
                        logging info levels 10, 20, or 30
  -eval EVALUATE, --evaluate EVALUATE
                        Evaluation mode available for Th1, Th2, Th17, Bmem,
                        COAD, and AML
  -pr PREFIX, --prefix PREFIX
                        Add prefix to the log
  -w, --store_weights   Store all the inferred weights
  -mp, --multiprocess   multiprocess the calculation of perturb and regulator
                        scores
  -null, --null         Generate null scores by label permutation
  -d SEED, --seed SEED  Starting seed number
  -p PERTURB, --perturb PERTURB
                        True label perturb scores. Required when running
                        permutations for null model
  -sg, --shuffle_genes  Generate null scores by gene permutation

运行测试¶

运行该程序需要两类输入文件：

基因表达谱：

head data/COAD_tumor.txt

gene    value
A1BG    6.6
A1CF    0
A2BP1   0
A2LD1   6.6
A2ML1   1
A2M 14
A4GALT  8.4
A4GNT   1.1
AAA1    0

head data/COAD_normal.txt

gene    value
A1BG    6.3
A1CF    0
A2BP1   0
A2LD1   6.1
A2ML1   0.68
A2M 16
A4GALT  8.6
A4GNT   2.3
AAA1    0

网络文件可以从如下网页下载：FANTOM5_individual_networks.tar和Network_compendium.zip。

zcat 20_gastrointestinal_system.txt.gz |head

FOXO3   MBP 2.16216012E-3
ALX1    CD209   2.06986338E-3
ZIC4    PDLIM7  1.25342086E-2
PAX8    NKD2    1.42917932E-3
MAFF    PRC1    3.17588671E-2
DBX1    LGALSL  5.93835673E-2
HOXC10  IRS1    3.02063335E-4
TCF12   FUCA2   9.88682006E-3
TFAP2A  PLCD1   8.69630824E-4
ZSCAN4  GCDH    2.7480217E-3

运行代码如下：

nvidia-docker run --rm -e NVIDIA_VISIBLE_DEVICES=all \
    -v /sibcb2/bioinformatics/Projects/EPEE:/home \
    -v /sibcb2/bioinformatics/iGenome/FANTOM5_Network/Network_compendium/Tissue-specific_regulatory_networks_FANTOM5-v1:/network \
    epee:gpu python /home/script/run_epee.py \
    -a /home/data/COAD_tumor.txt \
    -b /home/data/COAD_normal.txt \
    -na /network/32_high-level_networks/20_gastrointestinal_system.txt.gz \
    -nb /network/32_high-level_networks/20_gastrointestinal_system.txt.gz \
    -o /home/res2/ \
    -pr COAD

结果解读¶

原作者删除了测试数据，我们从TCGA中提取了COAD肿瘤以及癌旁的数据，分别存为COAD_tumor.txt、COAD_normal.txt,运行产生四个结果文件：

res/COAD_epee-gcl_0.01_0.01_0/
|-- log.txt
|-- model
|   |-- loss1_arr_y1.txt
|   |-- loss2_arr_y2.txt
|   `-- loss_runs.txt
`-- scores
    |-- all_perturb_scores.txt
    |-- all_regulator_scores.txt
    |-- perturb_scores.txt
    `-- regulator_scores.txt

其中regulator_scores.txt是排在前面的转录因子：

    gene    score
0   HOXC13  0.6924611278809607
1   TP73    0.6397721986286342
2   PITX2   0.5111408117227256
3   POU4F1  0.4897183245047927
4   POU6F2  0.48886519484221935
5   NKX6-1  0.48417521407827735
6   DMBX1   0.4784999266266823
7   ONECUT1 0.4712283406406641
8   BHLHA15 0.46911837393417954