Welcome to GenEpi’s docs!¶
GenEpi[1] is a package to uncover epistasis associated with phenotypes by a machine learning approach, developed by Yu-Chuan Chang at c4Lab of National Taiwan University and AILabs.
Introduction¶
GenEpi is designed to group SNPs by a set of loci in the gnome. For examples, a locus could be a gene. In other words, we use gene boundaries to group SNPs. A locus can be generalized to any particular regions in the genome, e.g. promoters, enhancers, etc. GenEpi first considers the genetic variants within a particular region as features in the first stage, because it is believed that SNPs within a functional region might have a higher chance to interact with each other and to influence molecular functions.
GenEpi adopts two-element combinatorial encoding when producing features and models them by L1-regularized regression with stability selection In the first stage (STAGE 1) of GenEpi, the genotype features from each single gene will be combinatorically encoded and modeled independently by L1-regularized regression with stability selection. In this way, we can estimate the prediction performance of each gene and detect within-gene epistasis with a low false positive rate. In the second stage (STAGE 2), both of the individual SNP and the within-gene epistasis features selected by STAGE 1 are pooled together to generate cross-gene epistasis features, and modeled again by L1-regularized regression with stability selection as STAGE 1. Finally, the user can combine the selected genetic features with environmental factors such as clinical features to build the final prediction models.
Citing¶
Please considering cite the following paper if you use GenEpi in a scientific publication.
[1] | Yu-Chuan Chang, June-Tai Wu, Ming-Yi Hong, Yi-An Tung, Ping-Han Hsieh, Sook Wah Yee, Kathleen M. Giacomini, Yen-Jen Oyang, and Chien-Yu Chen. “Genepi: Gene-Based Epistasis Discovery Using Machine Learning.” BMC Bioinformatics 21, 68 (2020). https://doi.org/10.1186/s12859-020-3368-2 |
Installation¶
How to Install¶
GenEpi supports Python 3.7 and up. Use pip to install GenEpi and its dependencies.
$ pip install GenEpi
Check that you installed the GenEpi sucessfully.
$ GenEpi --help
After executed previous command on console, you will see:
usage: GenEpi [-h] -g G -p P [-s S] [-o O] [-m {c,r}] [-k K] [-t T]
[--updatedb] [-b {hg19,hg38}] [--compressld] [-d D] [-r R]
optional arguments:
-h, --help show this help message and exit
-g G filename of the input .gen file
-p P filename of the input phenotype
-s S self-defined genome regions
-o O output file path
-m {c,r} choose model type: c for classification; r for regression
-k K k of k-fold cross validation
-t T number of threads
update UCSC database:
--updatedb enable this function
-b {hg19,hg38} human genome build
compress data by LD block:
--compressld enable this function
-d D threshold for compression: D prime
-r R threshold for compression: R square
Dependencies¶
Here is the dependency list for running GenEpi. pip takes care of these dependencies automatically when you install GenEpi.
- numpy >= 1.13.0
- psutil >= 4.3.0
- pymysql >= 0.8.0
- scipy >= 0.19.0
- scikit-learn == 0.21.2
System Requirements¶
For running a quick test, you could install GenEpi on any laptop e.g. a MacBook. When applying GenEpi on a real whole genome-wide dataset, here are recommended system requirements:
Processor: | 2.3 GHz Intel XEON® E5-2673 v4 * 32 |
---|---|
RAM: | 256 GiB |
Storage: | 500 GiB |
These requirements are refer to the specification of Microsoft Azure E32 v3.
Note
GenEpi is a memory-consuming package, which might cause memory errors when calculating the epistasis of a gene containing a large number of SNPs. We recommend that the memory for running GenEpi should be over 256 GB.
Quickstart¶
This section gets you started quickly, the I/O described in I/O File Fomats, more usage examples please find in More Usage Examples, discussing each of the sub-modules introduced in How it Work.
Running a Quick Test¶
Please use following command to run a quick test, you will obtain all the outputs of GenEpi in your current folder.
$ GenEpi -g example -p example -o ./
The progress will print on console:
step1: Down load UCSC Database. DONE!
step2: Estimate LD. DONE!
Warning of step3: .gen file should be sorted by chromosome and position
step3: Split by gene. DONE!
step4: Detect single gene epistasis. DONE!
step5: Detect cross gene epistasis. DONE! (Training score:0.63; 2-fold Test Score:0.61)
step6: Ensemble with covariates. DONE! (Training score:0.63; 2-fold Test Score:0.60)
GenEpi will automatically generate three folders (snpSubsets, singleGeneResult, crossGeneResult) in output path (arg: -o). The following tree structure is the contents of the output folder. You could go to the folder crossGeneResult directly to obtain your main result table for episatasis in Result.csv.
./
├── GenEpi_Log_DATE-TIME.txt
├── crossGeneResult
│ ├── Classifier.pkl
│ ├── Classifier_Covariates.pkl
│ ├── Feature.csv
│ └── Result.csv
├── sample.LDBlock
├── sample.csv
├── sample.gen
├── sample_LDReduced.gen
├── singleGeneResult
│ ├── All_Logistic_k2.csv
│ ├── APOC1_Feature.csv
│ ├── APOC1_Result.csv
│ ├── APOE_Feature.csv
│ ├── APOE_Result.csv
│ ├── PVRL2_Feature.csv
│ ├── PVRL2_Result.csv
│ ├── TOMM40_Feature.csv
│ └── TOMM40_Result.csv
└── snpSubsets
├── APOC1_23.gen
├── APOE_11.gen
├── PVRL2_48.gen
└── TOMM40_67.gen
Interpreting the Main Result Table¶
Here is the contents of Result.csv, which mean the episatasis seleted by GenEpi.
RSID | Weight | -Log10(χ2 p-value) | Odds Ratio | Genotype Frequency | Gene Symbol |
---|---|---|---|---|---|
rs157580_BB rs2238681_AA | 0.9729 | 8.4002 | 9.3952 | 0.1044 | TOMM40 |
rs449647_AA rs769449_AB | 0.7065 | 8.0278 | 5.0877 | 0.2692 | APOE |
rs59007384_BB rs11668327_AA | 1.0807 | 8.0158 | 12.0408 | 0.0824 | TOMM40 |
rs283811_BB rs7254892_AA | 1.0807 | 8.0158 | 12.0408 | 0.0824 | PVRL2 |
rs429358_AA | -0.7587 | 5.7628 | 0.1743 | 0.5962 | APOE |
rs73052335_AA rs429358_AA | -0.7289 | 5.6548 | 0.1867 | 0.5714 | APOC1*APOE |
We listed the statistical significance of the selected genetic features in Result.csv. The first column lists each feature by its RSID and the genotype (denoted as RSID_genotype), the pairwise epistatis features are represented using two SNPs. The weights in the second column were extracted from the L1-regularized regression model. The last column describes the genes where the SNPs are located according to the genomic coordinates. We used a star sign to denote the epistasis between genes. The p-values of the χ2 test (the quantitative task will use student t-test) are also included. The odds ratio significantly away from 1 also indicates whether the features are potential causal or protective genotypes. Since low genotype frequency may cause unreliable odds ratios, we also listed this information in the table.
I/O File Fomats¶
We provided test data sample.gen and sample.csv in example folder. After running a quick test, GenEpi will automatically copy these test data to output path and generate all of the output folders and files as following tree structure. Please see the following detail about the format of these input and output data.
./
├── GenEpi_Log_DATE-TIME.txt
├── crossGeneResult
│ ├── Classifier.pkl
│ ├── Classifier_Covariates.pkl
│ ├── Feature.csv
│ └── Result.csv
├── sample.LDBlock
├── sample.csv
├── sample.gen
├── sample_LDReduced.gen
├── singleGeneResult
│ ├── All_Logistic_k2.csv
│ ├── APOC1_Feature.csv
│ ├── APOC1_Result.csv
│ ├── APOE_Feature.csv
│ ├── APOE_Result.csv
│ ├── PVRL2_Feature.csv
│ ├── PVRL2_Result.csv
│ ├── TOMM40_Feature.csv
│ └── TOMM40_Result.csv
└── snpSubsets
├── APOC1_23.gen
├── APOE_11.gen
├── PVRL2_48.gen
└── TOMM40_67.gen
Input: Genotype Data¶
Sample.gen is an example of genotype data. GenEpi takes the Genotype File Format (.GEN) used by Oxford statistical genetics tools, such as IMPUTE2 and SNPTEST as the input format for genotype data. If your files are in PLINK format (.BED/.BIM/.FAM) or 1000 Genomes Project text Variant Call Format (.VCF), you could use PLINK with the following command to convert them to the .GEN file.
If your files are in the .BED/.BIM/.FAM format.
$ plink --bfile prefixOfTheFilename --recode oxford --out prefixOfTheFilename
If your file is in the .VCF format.
$ plink --vcf filename.vcf --recode oxford --out prefixOfTheFilename
Input: Phenotype Data¶
Sample.gen is an example of phenotype data with environmental factor (e.g. the age of each sample). GenEpi takes the common .CSV file without header line as the input format for phenotype and environmental factor data. The last column of the file will be considered as the phenotype data (e.g. 1 or 0, which indicate case/control, respectively) and the other columns will be considered as the environmental factor data.
Warning
The sequential order of the phenotype data should be the same as that in the .GEN file.
Output: LDBlock File¶
GenEpi has a moudle, which can reduce the dimension of the input feature by estimating the linkage disequilibrium (LD). After performing dimension reduction, GenEpi will generate two files, a dimension-reduced .GEN file and a file containing LD blocks (.LDBlock file). Each row in the .LDBlock file indicates a LD block (see below for examples). The SNPs in front of colon signs are the representative SNPs of each LD block, and only these SNPs will be retained in the dimension-reduced .GEN file.
rs429358:rs429358
rs7412:rs7412
rs117656888:rs117656888
rs1081105:rs1081105
rs1081106:rs1081106,rs191315680
Output: SnpSubsets Folder¶
Since GenEpi is a gene-based epistasis discovering method, the input genotype will first be splited into group of each gene. The subsets of the .GEN file for each gene will be stored in the folder snpSubsets. The naming rule of the filename is GeneSymbol_NumberOfVariantsOnGene.gen
Output: SingleGeneResult Folder¶
In first stage of GenEpi, all the .GEN file for each gene will be modeled gene by gene. Every models will output two kinds of data the GeneSymbol_Result.csv and GeneSymbol_Feature.csv. The format of GeneSymbol_Result.csv please refer to the section Interpreting the Main Result Table. The only difference is the GeneSymbol_Result.csv is a result table for single gene. Moreover, GeneSymbol_Feature.csv are the raw features corresponds to the episatsis in GeneSymbol_Result.csv. Beside these two types of files, there is a file named All_Logistic/Lasso.csv, which contains all the prediction scores of each gene, please see as following example.
$ head All_Logistic_k2.csv
GeneSymbol,F1Score
PVRL2,0.5745454545454547
APOC1,0.5681818181818181
TOMM40,0.602510460251046
APOE,0.592
Output: CrossGeneResult Folder¶
The results of the second stage of GenEpi - cross gene modeling will be generated in this folder. The formats are as same as the description in previous section SingleGeneResult Folder. Moreover, the final models will be persisted in this folder as Classifier/Regressor.pkl and Classifier/Regressor_Covariates.pkl, respectively. You could keep these models for future use without reconstructing them.
Output: Porcess Log¶
The performance of genetic feature only and genetic + environmental factor models will be logged into GenEpi_Log_DATE-TIME.txt. Other process information such as the setting of arguments, the time cost will also be recorded.
start analysis at: 20191009-17:54:45
Arguments in effect:
-g (input genotype filename): ./sample.gen
-p (input phenotype filename): ./sample.csv
-s (self-defined genome regions): None
-o (output filepath): ./
-m (model type): Classification -k (k-fold cross validation): 2
-t (number of threads): 4
--updatedb (enable function of update UCSC database): True
-b (human genome build): hg19
--compressld (enable function of LD data compression): True
-d (D prime threshold): 0.9
-r (R square threshold): 0.9
Number of variants: 223
Number of samples: 364
Overall genetic feature performance (F1 score)
Training: 0.6307053941908715
Testing (2-fold CV): 0.6134453781512604
Ensemble with co-variate performance (F1 score)
Training: 0.632
Testing (2-fold CV): 0.6016260162601627
end analysis at: 20191009-17:54:58
Seld-defined Genome Regions¶
GenEpi supports seld-defined genome regions for first stage to subset the data. Please prepare your genome regions in .TXT with the columns [chromosome, start, end, strand, geneSymbol], for eample:
1,10873,14409,+,DDX11L1
1,14361,30370,-,WASH7P
1,34610,37081,-,FAM138F
1,68090,70008,+,OR4F5
...
More Usage Examples¶
For checking all the optional arguments, please use –help.
$ GenEpi --help
You will obtain the following argument list:
usage: GenEpi [-h] -g G -p P [-s S] [-o O] [-m {c,r}] [-k K] [-t T]
[--updatedb] [-b {hg19,hg38}] [--compressld] [-d D] [-r R]
optional arguments:
-h, --help show this help message and exit
-g G filename of the input .gen file
-p P filename of the input phenotype
-s S self-defined genome regions
-o O output file path
-m {c,r} choose model type: c for classification; r for regression
-k K k of k-fold cross validation
-t T number of threads
update UCSC database:
--updatedb enable this function
-b {hg19,hg38} human genome build
compress data by LD block:
--compressld enable this function
-d D threshold for compression: D prime
-r R threshold for compression: R square
Applying on Your Data¶
$ GenEpi -g full_path_of_your_.GEN_file -p full_path_of_your_.CSV_file -o ./
For Quantitative Study¶
GenEpi can support both case/control and quantitative studies, for quantitative studies please modify the parameter -m. You could download the test data and excute the following command.
$ GenEpi -g sample.gen -p sample_q.csv -o ./ -m r
Changing the Genome Build¶
For changing the build of USCS genome browser, please modify the parameter -b.
$ GenEpi -g example -p example -o ./ --updatedb -b hg38
Threshold for Dimension Reduction¶
You could modify the threshold for Linkage Disequilibrium dimension reduction by following command.
$ GenEpi -g example -p example -o ./ --compressld -d 0.9 -r 0.9
How it Works¶
The main procedures of GenEpi described in Introduction. In addition to the main procedures, two pre-processing steps are also implemented in GenEpi: retrieving the gene information from public databases and reducing the gene information from public databases and reducing the dimensionality of the features using linkage disequilibrium. All of the precedures be implemented as six steps in GenEpi submoudles. The interations bewteen the I/O and these submodules please find in the figure below. For API documentations of these submodules please find in next section.
API Documentations¶
GenEpi module¶
Created on Apr 2019
@author: Chester (Yu-Chuan Chang)
-
genepi.GenEpi.
ArgumentsParser
()[source]¶ To obtain and parse the arguments from user.
Parameters: None – Returns: argparse.ArgumentParser
-
genepi.GenEpi.
InputChecking
(str_inputFileName_genotype, str_inputFileName_phenotype, args)[source]¶ To check the numbers of sample are consistent in genotype and phenotype data.
Parameters: Returns: tuple containing:
- int_num_genotype (int): The sample number of genotype data
- int_num_phenotype (int): The sample number of phenotype data
Return type: (tuple)
step1_downloadUCSCDB¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step1_downloadUCSCDB.
DownloadUCSCDB
(str_outputFilePath='/home/docs/checkouts/readthedocs.org/user_builds/genepi/envs/stable/lib/python3.7/site-packages/genepi-2.0.10-py3.7.egg/genepi', str_hgbuild='hg19')[source]¶ To obtain the gene information such as official gene symbols and genomic coordinates, this function is for retrieving kgXref and knownGene data table from the UCSC human genome annotation database
Parameters: Returns: Expected Success Response:
"step1: Down load UCSC Database. DONE!"
step2_estimateLD¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step2_estimateLD.
EstimateAlleleFrequency
(gen_snp)[source]¶ A function for estimating allele frequency of a single varaint
Parameters: gen_snp (list) – The genotypes of a variant of all samples Returns: tuple containing: - float_frequency_A (float): The reference allele type frequency
- float_frequency_B (float): The alternative allele type frequency
Return type: (tuple)
-
genepi.step2_estimateLD.
EstimateLDBlock
(str_inputFileName_genotype, str_outputFilePath='', float_threshold_DPrime=0.8, float_threshold_RSquare=0.8)[source]¶ A function for implementing linkage disequilibrium (LD) dimension reduction. In genotype data, a variant often exhibits high dependency with its nearby variants because of LD. In the practical implantation, we prefer to group these dependent features to reduce the dimension of features. In other words, we can take the advantages of LD to reduce the dimensionality of genetic features. In this regard, this function adopted the same approach developed by Lewontin (1964) to estimate LD. We used D’ and r2 as the criteria to group highly dependent genetic features as blocks. In each block, we chose the features with the largest minor allele frequency to represent other features in the same block.
Parameters: - str_inputFileName_genotype (str) – File name of input genotype data
- str_outputFilePath (str) – File path of output file
- float_threshold_DPrime (float) – The Dprime threshold for discriminating a LD block (default: 0.8)
- float_threshold_RSquare (float) – The RSquare threshold for discriminating a LD block (default: 0.8)
Returns: Expected Success Response:
"step2: Estimate LD. DONE!"
-
genepi.step2_estimateLD.
EstimatePairwiseLD
(gen_snp_1, gen_snp_2)[source]¶ Lewontin (1964) linkage disequilibrium (LD) estimation.
Parameters: Returns: tuple containing:
- float_D_prime (float): The DPrime of these two variants
- float_R_square (float): The RSquare of these two variants
Return type: (tuple)
step3_splitByGene¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step3_splitByGene.
SplitByGene
(str_inputFileName_genotype, str_inputFileName_UCSCDB='/home/docs/checkouts/readthedocs.org/user_builds/genepi/envs/stable/lib/python3.7/site-packages/genepi-2.0.10-py3.7.egg/genepi/UCSCGenomeDatabase.txt', str_outputFilePath='')[source]¶ In order to extract genetic features for a gene, this function used the start and end positions of each gene from the local UCSC database to split the genetic features. Then, generate the .GEN files for each gene in the folder named snpSubsets.
Parameters: Returns: Expected Success Response:
"step3: Split by gene. DONE!"
Warning
“Warning of step3: .gen file should be sorted by chromosome and position”
-
genepi.step3_splitByGene.
SplitMegaGene
(list_snpsOnGene, int_window, int_step, str_outputFilePath, str_outputFileName)[source]¶ In order to extract genetic features for a gene, this function used the start and end positions of each gene from the local UCSC database to split the genetic features. Then, generate the .GEN files for each gene in the folder named snpSubsets.
Parameters: Returns: None
step4_singleGeneEpistasis_Lasso¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step4_singleGeneEpistasis_Lasso.
BatchSingleGeneEpistasisLasso
(str_inputFilePath_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=2)[source]¶ Batch running for the single gene workflow.
Parameters: - str_inputFilePath_genotype (str) – File path of input genotype data
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: Expected Success Response:
"step4: Detect single gene epistasis. DONE!"
-
genepi.step4_singleGeneEpistasis_Lasso.
FeatureEncoderLasso
(np_genotype_rsid, np_genotype, np_phenotype, int_dim)[source]¶ Implementation of the two-element combinatorial encoding.
Parameters: - np_genotype_rsid (ndarray) – 1D array containing rsid of genotype data with str type
- np_genotype (ndarray) – 2D array containing genotype data with int8 type
- np_phenotype (ndarray) – 2D array containing phenotype data with float type
- int_dim (int) – The dimension of a variant (default: 3. AA, AB and BB)
Returns: tuple containing:
- list_interaction_rsid (ndarray): 1D array containing rsid of genotype data with str type
- np_interaction (ndarray): 2D array containing genotype data with int8 type
Return type: (tuple)
-
genepi.step4_singleGeneEpistasis_Lasso.
FilterInLoading
(np_genotype, np_phenotype)[source]¶ This function is for filtering low quality varaint. Before modeling each subset of genotype features, two criteria were adopted to exclude low quality data. The first criterion is that the genotype frequency of a feature should exceed 5%, where the genotype frequency means the proportion of genotype among the total samples in the dataset. The second criterion is regarding the association between the feature and the phenotype. We used χ2 test to estimate the association between the feature and the phenotype, and the p-value should be smaller than 0.01.
Parameters: - np_genotype (ndarray) – 2D array containing genotype data with int8 type
- np_phenotype (ndarray) – 2D array containing phenotype data with float type
Returns: np_genotype
2D array containing genotype data with int8 type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Lasso.
LassoRegressionCV
(np_X, np_y, int_kOfKFold=2, int_nJobs=1)[source]¶ Implementation of the L1-regularized Lasso regression with k-fold cross validation.
Parameters: Returns: estimator.scores
1D array containing the scores of each genetic features with float type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Lasso.
RandomizedLassoRegression
(np_X, np_y)[source]¶ Implementation of the stability selection.
Parameters: - np_X (ndarray) – 2D array containing genotype data with int8 type
- np_y (ndarray) – 2D array containing phenotype data with float type
Returns: estimator.scores
1D array containing the scores of each genetic features with float type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Lasso.
SingleGeneEpistasisLasso
(str_inputFileName_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ A workflow to model a single gene containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Lasso regression with k-fold cross validation.
Parameters: - str_inputFileName_genotype (str) – File name of input genotype data
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: float_AVG_S_P
The average of the Peason’s and Spearman’s correlation of the model
Return type: (float)
step4_singleGeneEpistasis_Logistic¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step4_singleGeneEpistasis_Logistic.
BatchSingleGeneEpistasisLogistic
(str_inputFilePath_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=2)[source]¶ Batch running for the single gene workflow.
Parameters: - str_inputFilePath_genotype (str) – File path of input genotype data
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: Expected Success Response:
"step4: Detect single gene epistasis. DONE!"
-
genepi.step4_singleGeneEpistasis_Logistic.
FeatureEncoderLogistic
(np_genotype_rsid, np_genotype, np_phenotype, int_dim)[source]¶ Implementation of the two-element combinatorial encoding.
Parameters: - np_genotype_rsid (ndarray) – 1D array containing rsid of genotype data with str type
- np_genotype (ndarray) – 2D array containing genotype data with int8 type
- np_phenotype (ndarray) – 2D array containing phenotype data with float type
- int_dim (int) – The dimension of a variant (default: 3. AA, AB and BB)
Returns: tuple containing:
- list_interaction_rsid (ndarray): 1D array containing rsid of genotype data with str type
- np_interaction (ndarray): 2D array containing genotype data with int8 type
Return type: (tuple)
-
genepi.step4_singleGeneEpistasis_Logistic.
FilterInLoading
(np_genotype, np_phenotype)[source]¶ This function is for filtering low quality varaint. Before modeling each subset of genotype features, two criteria were adopted to exclude low quality data. The first criterion is that the genotype frequency of a feature should exceed 5%, where the genotype frequency means the proportion of genotype among the total samples in the dataset. The second criterion is regarding the association between the feature and the phenotype. We used χ2 test to estimate the association between the feature and the phenotype, and the p-value should be smaller than 0.01.
Parameters: - np_genotype (ndarray) – 2D array containing genotype data with int8 type
- np_phenotype (ndarray) – 2D array containing phenotype data with float type
Returns: np_genotype
2D array containing genotype data with int8 type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Logistic.
GenerateContingencyTable
(np_genotype, np_phenotype)[source]¶ Generating the contingency table for chi-square test.
Parameters: - np_X (ndarray) – 2D array containing genotype data with int8 type
- np_y (ndarray) – 2D array containing phenotype data with float type
Returns: np_contingency
2D array containing the contingency table with int type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Logistic.
LogisticRegressionL1CV
(np_X, np_y, int_kOfKFold=2, int_nJobs=1)[source]¶ Implementation of the L1-regularized Logistic regression with k-fold cross validation.
Parameters: Returns: estimator.scores
1D array containing the scores of each genetic features with float type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Logistic.
RandomizedLogisticRegression
(np_X, np_y)[source]¶ Implementation of the stability selection.
Parameters: - np_X (ndarray) – 2D array containing genotype data with int8 type
- np_y (ndarray) – 2D array containing phenotype data with float type
Returns: estimator.scores
1D array containing the scores of each genetic features with float type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Logistic.
SingleGeneEpistasisLogistic
(str_inputFileName_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ A workflow to model a single gene containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Logistic regression with k-fold cross validation.
Parameters: - str_inputFileName_genotype (str) – File name of input genotype data
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: float_f1Score
The F1 score of the model
Return type: (float)
step5_crossGeneEpistasis_Lasso¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step5_crossGeneEpistasis_Lasso.
CrossGeneEpistasisLasso
(str_inputFilePath_feature, str_inputFileName_phenotype, str_inputFileName_score='', str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ A workflow to model a cross gene epistasis containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Lasso regression with k-fold cross validation.
Parameters: - str_inputFilePath_feature (str) – File path of input feature files from stage 1 - singleGeneEpistasis
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_inputFileName_score (str) – File name of input score file from stage 1 - singleGeneEpistasis
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: tuple containing:
- float_AVG_S_P_train (float): The average of the Peason’s and Spearman’s correlation of the model for training set
- float_AVG_S_P_test (float): The average of the Peason’s and Spearman’s correlation of the model for testing set
Expected Success Response:
"step5: Detect cross gene epistasis. DONE!"
Return type: (tuple)
-
genepi.step5_crossGeneEpistasis_Lasso.
LassoRegression
(np_X, np_y, int_nJobs=1)[source]¶ Implementation of the L1-regularized Lasso regression with k-fold cross validation.
Parameters: - np_X (ndarray) – 2D array containing genotype data with int8 type
- np_y (ndarray) – 2D array containing phenotype data with float type
- int_nJobs (int) – The number of thread (default: 1)
Returns: float_AVG_S_P
The average of the Peason’s and Spearman’s correlation of the model
Return type: (float)
step5_crossGeneEpistasis_Logistic¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step5_crossGeneEpistasis_Logistic.
ClassifierModelPersistence
(np_X, np_y, str_outputFilePath='', int_nJobs=1)[source]¶ Dumping classifier for model persistence
Parameters: Returns: None
-
genepi.step5_crossGeneEpistasis_Logistic.
CrossGeneEpistasisLogistic
(str_inputFilePath_feature, str_inputFileName_phenotype, str_inputFileName_score='', str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ A workflow to model a cross gene epistasis containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Logistic regression with k-fold cross validation.
Parameters: - str_inputFilePath_feature (str) – File path of input feature files from stage 1 - singleGeneEpistasis
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_inputFileName_score (str) – File name of input score file from stage 1 - singleGeneEpistasis
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: tuple containing:
- float_f1Score_train (float): The F1 score of the model for training set
- float_f1Score_test (float): The F1 score of the model for testing set
Expected Success Response:
"step5: Detect cross gene epistasis. DONE!"
Return type: (tuple)
-
genepi.step5_crossGeneEpistasis_Logistic.
GenerateContingencyTable
(np_genotype, np_phenotype)[source]¶ Generating the contingency table for chi-square test.
Parameters: - np_X (ndarray) – 2D array containing genotype data with int8 type
- np_y (ndarray) – 2D array containing phenotype data with float type
Returns: np_contingency
2D array containing the contingency table with int type
Return type: (ndarray)
-
genepi.step5_crossGeneEpistasis_Logistic.
LogisticRegressionL1
(np_X, np_y, int_nJobs=1)[source]¶ Implementation of the L1-regularized Logistic regression with k-fold cross validation.
Parameters: - np_X (ndarray) – 2D array containing genotype data with int8 type
- np_y (ndarray) – 2D array containing phenotype data with float type
- int_nJobs (int) – The number of thread (default: 1)
Returns: float_f1Score
The F1 score of the model
Return type: (float)
step6_ensembleWithCovariates¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step6_ensembleWithCovariates.
ClassifierModelPersistence
(np_X, np_y, str_outputFilePath='', int_nJobs=1)[source]¶ Dumping ensemble classifier for model persistence
Parameters: Returns: None
-
genepi.step6_ensembleWithCovariates.
EnsembleWithCovariatesClassifier
(str_inputFileName_feature, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ A workflow to ensemble genetic features with covariates for L1-regularized Logistic regression.
Parameters: - str_inputFilePath_feature (str) – File path of input feature files from stage 2 - crossGeneEpistasis
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: tuple containing:
- float_f1Score_train (float): The F1 score of the model for training set
- float_f1Score_test (float): The F1 score of the model for testing set
Expected Success Response:
"step6: Ensemble with covariates. DONE!"
Return type: (tuple)
-
genepi.step6_ensembleWithCovariates.
EnsembleWithCovariatesRegressor
(str_inputFileName_feature, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ A workflow to ensemble genetic features with covariates for L1-regularized Lasso regression.
Parameters: - str_inputFilePath_feature (str) – File path of input feature files from stage 2 - crossGeneEpistasis
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: tuple containing:
- float_AVG_S_P_train (float): The average of the Peason’s and Spearman’s correlation of the model for training set
- float_AVG_S_P_test (float): The average of the Peason’s and Spearman’s correlation of the model for testing set
Expected Success Response:
"step6: Ensemble with covariates. DONE!"
Return type: (tuple)
-
genepi.step6_ensembleWithCovariates.
LoadDataForEnsemble
(str_inputFileName_feature, str_inputFileName_phenotype)[source]¶ Loading genetic features for ensembling with covariates
Parameters: Returns: tuple containing:
- np_genotype (ndarray): 2D array containing genotype data with int8 type
- np_phenotype (ndarray): 2D array containing phenotype data with float type
Return type: (tuple)
Release History¶
All notable changes to this project will be documented in this file.
[2.0.10] - 2019-07-29¶
Added
- Add genome database for intergenic region (hg19, hg38)
- Add functions for modeling intergenic region
- Add warning message if there is no variant past the feature selection
- Add warning message if there is no variant remained in previous step
Fixed
- Update the UCSC mysql host
- Fix error when modeling empty feature set (step5)
- Fix error when modeling empty feature set (step6)
- Refine PRS plot
[2.0.3] - 2019-10-13¶
Added
- Add GenEpi’s documentation to Read the Docs
- Add model persistance for classifier and regressor
[2.0.1] - 2019-07-12¶
Added
- Support self-defiend genome regions (for any species)
- Add argument parser (GenEpi -h)
- Add “GenEpi” to bin when installing
- Add output log
- Add quick test
- Add change log in Changelog.md
Changed
- Change feature encoder for reducing memory usage
Removed
- Remove random forest ensemble
[1.0.3] - 2018-04-22¶
First preview release