Quickstart¶
This section gets you started quickly, the I/O described in I/O File Fomats, more usage examples please find in More Usage Examples, discussing each of the sub-modules introduced in How it Work.
Running a Quick Test¶
Please use following command to run a quick test, you will obtain all the outputs of GenEpi in your current folder.
$ GenEpi -g example -p example -o ./
The progress will print on console:
step1: Down load UCSC Database. DONE!
step2: Estimate LD. DONE!
Warning of step3: .gen file should be sorted by chromosome and position
step3: Split by gene. DONE!
step4: Detect single gene epistasis. DONE!
step5: Detect cross gene epistasis. DONE! (Training score:0.63; 2-fold Test Score:0.61)
step6: Ensemble with covariates. DONE! (Training score:0.63; 2-fold Test Score:0.60)
GenEpi will automatically generate three folders (snpSubsets, singleGeneResult, crossGeneResult) in output path (arg: -o). The following tree structure is the contents of the output folder. You could go to the folder crossGeneResult directly to obtain your main result table for episatasis in Result.csv.
./
├── GenEpi_Log_DATE-TIME.txt
├── crossGeneResult
│ ├── Classifier.pkl
│ ├── Classifier_Covariates.pkl
│ ├── Feature.csv
│ └── Result.csv
├── sample.LDBlock
├── sample.csv
├── sample.gen
├── sample_LDReduced.gen
├── singleGeneResult
│ ├── All_Logistic_k2.csv
│ ├── APOC1_Feature.csv
│ ├── APOC1_Result.csv
│ ├── APOE_Feature.csv
│ ├── APOE_Result.csv
│ ├── PVRL2_Feature.csv
│ ├── PVRL2_Result.csv
│ ├── TOMM40_Feature.csv
│ └── TOMM40_Result.csv
└── snpSubsets
├── APOC1_23.gen
├── APOE_11.gen
├── PVRL2_48.gen
└── TOMM40_67.gen
Interpreting the Main Result Table¶
Here is the contents of Result.csv, which mean the episatasis seleted by GenEpi.
RSID | Weight | -Log10(χ2 p-value) | Odds Ratio | Genotype Frequency | Gene Symbol |
---|---|---|---|---|---|
rs157580_BB rs2238681_AA | 0.9729 | 8.4002 | 9.3952 | 0.1044 | TOMM40 |
rs449647_AA rs769449_AB | 0.7065 | 8.0278 | 5.0877 | 0.2692 | APOE |
rs59007384_BB rs11668327_AA | 1.0807 | 8.0158 | 12.0408 | 0.0824 | TOMM40 |
rs283811_BB rs7254892_AA | 1.0807 | 8.0158 | 12.0408 | 0.0824 | PVRL2 |
rs429358_AA | -0.7587 | 5.7628 | 0.1743 | 0.5962 | APOE |
rs73052335_AA rs429358_AA | -0.7289 | 5.6548 | 0.1867 | 0.5714 | APOC1*APOE |
We listed the statistical significance of the selected genetic features in Result.csv. The first column lists each feature by its RSID and the genotype (denoted as RSID_genotype), the pairwise epistatis features are represented using two SNPs. The weights in the second column were extracted from the L1-regularized regression model. The last column describes the genes where the SNPs are located according to the genomic coordinates. We used a star sign to denote the epistasis between genes. The p-values of the χ2 test (the quantitative task will use student t-test) are also included. The odds ratio significantly away from 1 also indicates whether the features are potential causal or protective genotypes. Since low genotype frequency may cause unreliable odds ratios, we also listed this information in the table.