Quickstart

This section gets you started quickly, the I/O described in I/O File Fomats, more usage examples please find in More Usage Examples, discussing each of the sub-modules introduced in How it Work.

Running a Quick Test

Please use following command to run a quick test, you will obtain all the outputs of GenEpi in your current folder.

$ GenEpi -g example -p example -o ./

The progress will print on console:

step1: Down load UCSC Database. DONE!
step2: Estimate LD. DONE!
Warning of step3: .gen file should be sorted by chromosome and position
step3: Split by gene. DONE!
step4: Detect single gene epistasis. DONE!
step5: Detect cross gene epistasis. DONE! (Training score:0.63; 2-fold Test Score:0.61)
step6: Ensemble with covariates. DONE! (Training score:0.63; 2-fold Test Score:0.60)

GenEpi will automatically generate three folders (snpSubsets, singleGeneResult, crossGeneResult) in output path (arg: -o). The following tree structure is the contents of the output folder. You could go to the folder crossGeneResult directly to obtain your main result table for episatasis in Result.csv.

./
├── GenEpi_Log_DATE-TIME.txt
├── crossGeneResult
│   ├── Classifier.pkl
│   ├── Classifier_Covariates.pkl
│   ├── Feature.csv
│   └── Result.csv
├── sample.LDBlock
├── sample.csv
├── sample.gen
├── sample_LDReduced.gen
├── singleGeneResult
│   ├── All_Logistic_k2.csv
│   ├── APOC1_Feature.csv
│   ├── APOC1_Result.csv
│   ├── APOE_Feature.csv
│   ├── APOE_Result.csv
│   ├── PVRL2_Feature.csv
│   ├── PVRL2_Result.csv
│   ├── TOMM40_Feature.csv
│   └── TOMM40_Result.csv
└── snpSubsets
    ├── APOC1_23.gen
    ├── APOE_11.gen
    ├── PVRL2_48.gen
    └── TOMM40_67.gen

Interpreting the Main Result Table

Here is the contents of Result.csv, which mean the episatasis seleted by GenEpi.

RSID Weight -Log10(χ2 p-value) Odds Ratio Genotype Frequency Gene Symbol
rs157580_BB rs2238681_AA 0.9729 8.4002 9.3952 0.1044 TOMM40
rs449647_AA rs769449_AB 0.7065 8.0278 5.0877 0.2692 APOE
rs59007384_BB rs11668327_AA 1.0807 8.0158 12.0408 0.0824 TOMM40
rs283811_BB rs7254892_AA 1.0807 8.0158 12.0408 0.0824 PVRL2
rs429358_AA -0.7587 5.7628 0.1743 0.5962 APOE
rs73052335_AA rs429358_AA -0.7289 5.6548 0.1867 0.5714 APOC1*APOE

We listed the statistical significance of the selected genetic features in Result.csv. The first column lists each feature by its RSID and the genotype (denoted as RSID_genotype), the pairwise epistatis features are represented using two SNPs. The weights in the second column were extracted from the L1-regularized regression model. The last column describes the genes where the SNPs are located according to the genomic coordinates. We used a star sign to denote the epistasis between genes. The p-values of the χ2 test (the quantitative task will use student t-test) are also included. The odds ratio significantly away from 1 also indicates whether the features are potential causal or protective genotypes. Since low genotype frequency may cause unreliable odds ratios, we also listed this information in the table.