API Documentations¶
GenEpi module¶
Created on Apr 2019
@author: Chester (Yu-Chuan Chang)
-
genepi.GenEpi.
ArgumentsParser
()[source]¶ To obtain and parse the arguments from user.
Parameters: None – Returns: argparse.ArgumentParser
-
genepi.GenEpi.
InputChecking
(str_inputFileName_genotype, str_inputFileName_phenotype, args)[source]¶ To check the numbers of sample are consistent in genotype and phenotype data.
Parameters: Returns: tuple containing:
- int_num_genotype (int): The sample number of genotype data
- int_num_phenotype (int): The sample number of phenotype data
Return type: (tuple)
step1_downloadUCSCDB¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step1_downloadUCSCDB.
DownloadUCSCDB
(str_outputFilePath='/home/docs/checkouts/readthedocs.org/user_builds/genepi/envs/latest/lib/python3.7/site-packages/genepi-2.0.10-py3.7.egg/genepi', str_hgbuild='hg19')[source]¶ To obtain the gene information such as official gene symbols and genomic coordinates, this function is for retrieving kgXref and knownGene data table from the UCSC human genome annotation database
Parameters: Returns: Expected Success Response:
"step1: Down load UCSC Database. DONE!"
step2_estimateLD¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step2_estimateLD.
EstimateAlleleFrequency
(gen_snp)[source]¶ A function for estimating allele frequency of a single varaint
Parameters: gen_snp (list) – The genotypes of a variant of all samples Returns: tuple containing: - float_frequency_A (float): The reference allele type frequency
- float_frequency_B (float): The alternative allele type frequency
Return type: (tuple)
-
genepi.step2_estimateLD.
EstimateLDBlock
(str_inputFileName_genotype, str_outputFilePath='', float_threshold_DPrime=0.8, float_threshold_RSquare=0.8)[source]¶ A function for implementing linkage disequilibrium (LD) dimension reduction. In genotype data, a variant often exhibits high dependency with its nearby variants because of LD. In the practical implantation, we prefer to group these dependent features to reduce the dimension of features. In other words, we can take the advantages of LD to reduce the dimensionality of genetic features. In this regard, this function adopted the same approach developed by Lewontin (1964) to estimate LD. We used D’ and r2 as the criteria to group highly dependent genetic features as blocks. In each block, we chose the features with the largest minor allele frequency to represent other features in the same block.
Parameters: - str_inputFileName_genotype (str) – File name of input genotype data
- str_outputFilePath (str) – File path of output file
- float_threshold_DPrime (float) – The Dprime threshold for discriminating a LD block (default: 0.8)
- float_threshold_RSquare (float) – The RSquare threshold for discriminating a LD block (default: 0.8)
Returns: Expected Success Response:
"step2: Estimate LD. DONE!"
-
genepi.step2_estimateLD.
EstimatePairwiseLD
(gen_snp_1, gen_snp_2)[source]¶ Lewontin (1964) linkage disequilibrium (LD) estimation.
Parameters: Returns: tuple containing:
- float_D_prime (float): The DPrime of these two variants
- float_R_square (float): The RSquare of these two variants
Return type: (tuple)
step3_splitByGene¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step3_splitByGene.
SplitByGene
(str_inputFileName_genotype, str_inputFileName_UCSCDB='/home/docs/checkouts/readthedocs.org/user_builds/genepi/envs/latest/lib/python3.7/site-packages/genepi-2.0.10-py3.7.egg/genepi/UCSCGenomeDatabase.txt', str_outputFilePath='')[source]¶ In order to extract genetic features for a gene, this function used the start and end positions of each gene from the local UCSC database to split the genetic features. Then, generate the .GEN files for each gene in the folder named snpSubsets.
Parameters: Returns: Expected Success Response:
"step3: Split by gene. DONE!"
Warning
“Warning of step3: .gen file should be sorted by chromosome and position”
-
genepi.step3_splitByGene.
SplitMegaGene
(list_snpsOnGene, int_window, int_step, str_outputFilePath, str_outputFileName)[source]¶ In order to extract genetic features for a gene, this function used the start and end positions of each gene from the local UCSC database to split the genetic features. Then, generate the .GEN files for each gene in the folder named snpSubsets.
Parameters: Returns: None
step4_singleGeneEpistasis_Lasso¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step4_singleGeneEpistasis_Lasso.
BatchSingleGeneEpistasisLasso
(str_inputFilePath_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ Batch running for the single gene workflow.
Parameters: - str_inputFilePath_genotype (str) – File path of input genotype data
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: Expected Success Response:
"step4: Detect single gene epistasis. DONE!"
-
genepi.step4_singleGeneEpistasis_Lasso.
FeatureEncoderLasso
(np_genotype_rsid, np_genotype, np_phenotype, int_dim)[source]¶ Implementation of the two-element combinatorial encoding.
Parameters: - np_genotype_rsid (ndarray) – 1D array containing rsid of genotype data with str type
- np_genotype (ndarray) – 2D array containing genotype data with int8 type
- np_phenotype (ndarray) – 2D array containing phenotype data with float type
- int_dim (int) – The dimension of a variant (default: 3. AA, AB and BB)
Returns: tuple containing:
- list_interaction_rsid (ndarray): 1D array containing rsid of genotype data with str type
- np_interaction (ndarray): 2D array containing genotype data with int8 type
Return type: (tuple)
-
genepi.step4_singleGeneEpistasis_Lasso.
FilterInLoading
(np_genotype, np_phenotype)[source]¶ This function is for filtering low quality varaint. Before modeling each subset of genotype features, two criteria were adopted to exclude low quality data. The first criterion is that the genotype frequency of a feature should exceed 5%, where the genotype frequency means the proportion of genotype among the total samples in the dataset. The second criterion is regarding the association between the feature and the phenotype. We used χ2 test to estimate the association between the feature and the phenotype, and the p-value should be smaller than 0.01.
Parameters: - np_genotype (ndarray) – 2D array containing genotype data with int8 type
- np_phenotype (ndarray) – 2D array containing phenotype data with float type
Returns: np_genotype
2D array containing genotype data with int8 type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Lasso.
LassoRegressionCV
(np_X, np_y, int_kOfKFold=2, int_nJobs=1)[source]¶ Implementation of the L1-regularized Lasso regression with k-fold cross validation.
Parameters: Returns: estimator.scores
1D array containing the scores of each genetic features with float type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Lasso.
RandomizedLassoRegression
(np_X, np_y)[source]¶ Implementation of the stability selection.
Parameters: - np_X (ndarray) – 2D array containing genotype data with int8 type
- np_y (ndarray) – 2D array containing phenotype data with float type
Returns: estimator.scores
1D array containing the scores of each genetic features with float type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Lasso.
SingleGeneEpistasisLasso
(str_inputFileName_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ A workflow to model a single gene containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Lasso regression with k-fold cross validation.
Parameters: - str_inputFileName_genotype (str) – File name of input genotype data
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: float_AVG_S_P
The average of the Peason’s and Spearman’s correlation of the model
Return type: (float)
step4_singleGeneEpistasis_Logistic¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step4_singleGeneEpistasis_Logistic.
BatchSingleGeneEpistasisLogistic
(str_inputFilePath_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ Batch running for the single gene workflow.
Parameters: - str_inputFilePath_genotype (str) – File path of input genotype data
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: Expected Success Response:
"step4: Detect single gene epistasis. DONE!"
-
genepi.step4_singleGeneEpistasis_Logistic.
FeatureEncoderLogistic
(np_genotype_rsid, np_genotype, np_phenotype, int_dim)[source]¶ Implementation of the two-element combinatorial encoding.
Parameters: - np_genotype_rsid (ndarray) – 1D array containing rsid of genotype data with str type
- np_genotype (ndarray) – 2D array containing genotype data with int8 type
- np_phenotype (ndarray) – 2D array containing phenotype data with float type
- int_dim (int) – The dimension of a variant (default: 3. AA, AB and BB)
Returns: tuple containing:
- list_interaction_rsid (ndarray): 1D array containing rsid of genotype data with str type
- np_interaction (ndarray): 2D array containing genotype data with int8 type
Return type: (tuple)
-
genepi.step4_singleGeneEpistasis_Logistic.
FilterInLoading
(np_genotype, np_phenotype)[source]¶ This function is for filtering low quality varaint. Before modeling each subset of genotype features, two criteria were adopted to exclude low quality data. The first criterion is that the genotype frequency of a feature should exceed 5%, where the genotype frequency means the proportion of genotype among the total samples in the dataset. The second criterion is regarding the association between the feature and the phenotype. We used χ2 test to estimate the association between the feature and the phenotype, and the p-value should be smaller than 0.01.
Parameters: - np_genotype (ndarray) – 2D array containing genotype data with int8 type
- np_phenotype (ndarray) – 2D array containing phenotype data with float type
Returns: np_genotype
2D array containing genotype data with int8 type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Logistic.
GenerateContingencyTable
(np_genotype, np_phenotype)[source]¶ Generating the contingency table for chi-square test.
Parameters: - np_X (ndarray) – 2D array containing genotype data with int8 type
- np_y (ndarray) – 2D array containing phenotype data with float type
Returns: np_contingency
2D array containing the contingency table with int type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Logistic.
LogisticRegressionL1CV
(np_X, np_y, int_kOfKFold=2, int_nJobs=1)[source]¶ Implementation of the L1-regularized Logistic regression with k-fold cross validation.
Parameters: Returns: estimator.scores
1D array containing the scores of each genetic features with float type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Logistic.
RandomizedLogisticRegression
(np_X, np_y)[source]¶ Implementation of the stability selection.
Parameters: - np_X (ndarray) – 2D array containing genotype data with int8 type
- np_y (ndarray) – 2D array containing phenotype data with float type
Returns: estimator.scores
1D array containing the scores of each genetic features with float type
Return type: (ndarray)
-
genepi.step4_singleGeneEpistasis_Logistic.
SingleGeneEpistasisLogistic
(str_inputFileName_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ A workflow to model a single gene containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Logistic regression with k-fold cross validation.
Parameters: - str_inputFileName_genotype (str) – File name of input genotype data
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: float_f1Score
The F1 score of the model
Return type: (float)
step5_crossGeneEpistasis_Lasso¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step5_crossGeneEpistasis_Lasso.
CrossGeneEpistasisLasso
(str_inputFilePath_feature, str_inputFileName_phenotype, str_inputFileName_score='', str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ A workflow to model a cross gene epistasis containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Lasso regression with k-fold cross validation.
Parameters: - str_inputFilePath_feature (str) – File path of input feature files from stage 1 - singleGeneEpistasis
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_inputFileName_score (str) – File name of input score file from stage 1 - singleGeneEpistasis
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: tuple containing:
- float_AVG_S_P_train (float): The average of the Peason’s and Spearman’s correlation of the model for training set
- float_AVG_S_P_test (float): The average of the Peason’s and Spearman’s correlation of the model for testing set
Expected Success Response:
"step5: Detect cross gene epistasis. DONE!"
Return type: (tuple)
-
genepi.step5_crossGeneEpistasis_Lasso.
LassoRegression
(np_X, np_y, int_nJobs=1)[source]¶ Implementation of the L1-regularized Lasso regression with k-fold cross validation.
Parameters: - np_X (ndarray) – 2D array containing genotype data with int8 type
- np_y (ndarray) – 2D array containing phenotype data with float type
- int_nJobs (int) – The number of thread (default: 1)
Returns: float_AVG_S_P
The average of the Peason’s and Spearman’s correlation of the model
Return type: (float)
step5_crossGeneEpistasis_Logistic¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step5_crossGeneEpistasis_Logistic.
ClassifierModelPersistence
(np_X, np_y, str_outputFilePath='', int_nJobs=1)[source]¶ Dumping classifier for model persistence
Parameters: Returns: None
-
genepi.step5_crossGeneEpistasis_Logistic.
CrossGeneEpistasisLogistic
(str_inputFilePath_feature, str_inputFileName_phenotype, str_inputFileName_score='', str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ A workflow to model a cross gene epistasis containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Logistic regression with k-fold cross validation.
Parameters: - str_inputFilePath_feature (str) – File path of input feature files from stage 1 - singleGeneEpistasis
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_inputFileName_score (str) – File name of input score file from stage 1 - singleGeneEpistasis
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: tuple containing:
- float_f1Score_train (float): The F1 score of the model for training set
- float_f1Score_test (float): The F1 score of the model for testing set
Expected Success Response:
"step5: Detect cross gene epistasis. DONE!"
Return type: (tuple)
-
genepi.step5_crossGeneEpistasis_Logistic.
GenerateContingencyTable
(np_genotype, np_phenotype)[source]¶ Generating the contingency table for chi-square test.
Parameters: - np_X (ndarray) – 2D array containing genotype data with int8 type
- np_y (ndarray) – 2D array containing phenotype data with float type
Returns: np_contingency
2D array containing the contingency table with int type
Return type: (ndarray)
-
genepi.step5_crossGeneEpistasis_Logistic.
LogisticRegressionL1
(np_X, np_y, int_nJobs=1)[source]¶ Implementation of the L1-regularized Logistic regression with k-fold cross validation.
Parameters: - np_X (ndarray) – 2D array containing genotype data with int8 type
- np_y (ndarray) – 2D array containing phenotype data with float type
- int_nJobs (int) – The number of thread (default: 1)
Returns: float_f1Score
The F1 score of the model
Return type: (float)
-
genepi.step5_crossGeneEpistasis_Logistic.
PlotPolygenicScore
(list_target, list_predict, list_proba, str_outputFilePath='', str_label='')[source]¶ Plot figure for polygenic score, including group distribution and prevalence to PGS
Parameters: - list_target (list) – A list containing the target of each samples
- list_predict (list) – A list containing the predition value of each samples
- list_proba (list) – A list containing the predition probability of each samples
- str_outputFilePath (str) – File path of output file
- str_label (str) – The label of the output plots
Returns: None
step6_ensembleWithCovariates¶
Created on Feb 2018
@author: Chester (Yu-Chuan Chang)
-
genepi.step6_ensembleWithCovariates.
ClassifierModelPersistence
(np_X, np_y, str_outputFilePath='', int_nJobs=1)[source]¶ Dumping ensemble classifier for model persistence
Parameters: Returns: None
-
genepi.step6_ensembleWithCovariates.
EnsembleWithCovariatesClassifier
(str_inputFileName_feature, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ A workflow to ensemble genetic features with covariates for L1-regularized Logistic regression.
Parameters: - str_inputFilePath_feature (str) – File path of input feature files from stage 2 - crossGeneEpistasis
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: tuple containing:
- float_f1Score_train (float): The F1 score of the model for training set
- float_f1Score_test (float): The F1 score of the model for testing set
Expected Success Response:
"step6: Ensemble with covariates. DONE!"
Return type: (tuple)
-
genepi.step6_ensembleWithCovariates.
EnsembleWithCovariatesRegressor
(str_inputFileName_feature, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶ A workflow to ensemble genetic features with covariates for L1-regularized Lasso regression.
Parameters: - str_inputFilePath_feature (str) – File path of input feature files from stage 2 - crossGeneEpistasis
- str_inputFileName_phenotype (str) – File name of input phenotype data
- str_outputFilePath (str) – File path of output file
- int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
- int_nJobs (int) – The number of thread (default: 1)
Returns: tuple containing:
- float_AVG_S_P_train (float): The average of the Peason’s and Spearman’s correlation of the model for training set
- float_AVG_S_P_test (float): The average of the Peason’s and Spearman’s correlation of the model for testing set
Expected Success Response:
"step6: Ensemble with covariates. DONE!"
Return type: (tuple)
-
genepi.step6_ensembleWithCovariates.
LoadDataForEnsemble
(str_inputFileName_feature, str_inputFileName_phenotype)[source]¶ Loading genetic features for ensembling with covariates
Parameters: Returns: tuple containing:
- np_genotype (ndarray): 2D array containing genotype data with int8 type
- np_phenotype (ndarray): 2D array containing phenotype data with float type
Return type: (tuple)