API Documentations¶

GenEpi module¶

Created on Apr 2019

@author: Chester (Yu-Chuan Chang)

genepi.GenEpi.ArgumentsParser()[source]¶

To obtain and parse the arguments from user.

Parameters:	None –
Returns:	argparse.ArgumentParser

genepi.GenEpi.InputChecking(str_inputFileName_genotype, str_inputFileName_phenotype, args)[source]¶

To check the numbers of sample are consistent in genotype and phenotype data.

Parameters:

str_inputFileName_genotype (str) – File name of input genotype data
str_inputFileName_phenotype (str) – File name of input phenotype data

Returns:

tuple containing:

int_num_genotype (int): The sample number of genotype data

int_num_phenotype (int): The sample number of phenotype data

Return type:

(tuple)

genepi.GenEpi.main(args=None)[source]¶

Main function for obtaining user arguments, controling workflow and recording log file.

Parameters:	None –
Returns:	None

step1_downloadUCSCDB¶

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step1_downloadUCSCDB.DownloadUCSCDB(str_outputFilePath='/home/docs/checkouts/readthedocs.org/user_builds/genepi/envs/latest/lib/python3.7/site-packages/genepi-2.0.10-py3.7.egg/genepi', str_hgbuild='hg19')[source]¶

To obtain the gene information such as official gene symbols and genomic coordinates, this function is for retrieving kgXref and knownGene data table from the UCSC human genome annotation database

Parameters:

str_outputFilePath (str) – File path of output database
str_hgbuild (str) – Genome build (eg. “hg19”)

Returns:

Expected Success Response:

"step1: Down load UCSC Database. DONE!"

step2_estimateLD¶

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step2_estimateLD.EstimateAlleleFrequency(gen_snp)[source]¶

A function for estimating allele frequency of a single varaint

Parameters:	gen_snp (list) – The genotypes of a variant of all samples
Returns:	tuple containing: float_frequency_A (float): The reference allele type frequency float_frequency_B (float): The alternative allele type frequency
Return type:	(tuple)

genepi.step2_estimateLD.EstimateLDBlock(str_inputFileName_genotype, str_outputFilePath='', float_threshold_DPrime=0.8, float_threshold_RSquare=0.8)[source]¶

A function for implementing linkage disequilibrium (LD) dimension reduction. In genotype data, a variant often exhibits high dependency with its nearby variants because of LD. In the practical implantation, we prefer to group these dependent features to reduce the dimension of features. In other words, we can take the advantages of LD to reduce the dimensionality of genetic features. In this regard, this function adopted the same approach developed by Lewontin (1964) to estimate LD. We used D’ and r2 as the criteria to group highly dependent genetic features as blocks. In each block, we chose the features with the largest minor allele frequency to represent other features in the same block.

Parameters:

str_inputFileName_genotype (str) – File name of input genotype data
str_outputFilePath (str) – File path of output file
float_threshold_DPrime (float) – The Dprime threshold for discriminating a LD block (default: 0.8)
float_threshold_RSquare (float) – The RSquare threshold for discriminating a LD block (default: 0.8)

Returns:

Expected Success Response:
```
"step2: Estimate LD. DONE!"
```

genepi.step2_estimateLD.EstimatePairwiseLD(gen_snp_1, gen_snp_2)[source]¶

Lewontin (1964) linkage disequilibrium (LD) estimation.

Parameters:

gen_snp_1 (list) – The genotypes of first variant of all samples
gen_snp_2 (list) – The genotypes of second variant of all samples

Returns:

tuple containing:

float_D_prime (float): The DPrime of these two variants

float_R_square (float): The RSquare of these two variants

Return type:

(tuple)

step3_splitByGene¶

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step3_splitByGene.SplitByGene(str_inputFileName_genotype, str_inputFileName_UCSCDB='/home/docs/checkouts/readthedocs.org/user_builds/genepi/envs/latest/lib/python3.7/site-packages/genepi-2.0.10-py3.7.egg/genepi/UCSCGenomeDatabase.txt', str_outputFilePath='')[source]¶

In order to extract genetic features for a gene, this function used the start and end positions of each gene from the local UCSC database to split the genetic features. Then, generate the .GEN files for each gene in the folder named snpSubsets.

Parameters:

str_inputFileName_genotype (str) – File name of input genotype data
str_inputFileName_UCSCDB (str) – File name of input genome regions
str_outputFilePath (str) – File path of output file

Returns:

Expected Success Response:
```
"step3: Split by gene. DONE!"
```

Warning

“Warning of step3: .gen file should be sorted by chromosome and position”

genepi.step3_splitByGene.SplitMegaGene(list_snpsOnGene, int_window, int_step, str_outputFilePath, str_outputFileName)[source]¶

In order to extract genetic features for a gene, this function used the start and end positions of each gene from the local UCSC database to split the genetic features. Then, generate the .GEN files for each gene in the folder named snpSubsets.

Parameters:	list_snpsOnGene (list) – A list contains SNPs on a gene int_window (int) – The size of the sliding window int_step (int) – The step of the sliding window str_outputFilePath (str) – File path of output file str_outputFileName (str) – File name of output file
Returns:	None

step4_singleGeneEpistasis_Lasso¶

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step4_singleGeneEpistasis_Lasso.BatchSingleGeneEpistasisLasso(str_inputFilePath_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶

Batch running for the single gene workflow.

Parameters:

str_inputFilePath_genotype (str) – File path of input genotype data
str_inputFileName_phenotype (str) – File name of input phenotype data
str_outputFilePath (str) – File path of output file
int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
int_nJobs (int) – The number of thread (default: 1)

Returns:

Expected Success Response:

"step4: Detect single gene epistasis. DONE!"

genepi.step4_singleGeneEpistasis_Lasso.FeatureEncoderLasso(np_genotype_rsid, np_genotype, np_phenotype, int_dim)[source]¶

Implementation of the two-element combinatorial encoding.

Parameters:

np_genotype_rsid (ndarray) – 1D array containing rsid of genotype data with str type
np_genotype (ndarray) – 2D array containing genotype data with int8 type
np_phenotype (ndarray) – 2D array containing phenotype data with float type
int_dim (int) – The dimension of a variant (default: 3. AA, AB and BB)

Returns:

tuple containing:

list_interaction_rsid (ndarray): 1D array containing rsid of genotype data with str type

np_interaction (ndarray): 2D array containing genotype data with int8 type

Return type:

(tuple)

genepi.step4_singleGeneEpistasis_Lasso.FilterInLoading(np_genotype, np_phenotype)[source]¶

This function is for filtering low quality varaint. Before modeling each subset of genotype features, two criteria were adopted to exclude low quality data. The first criterion is that the genotype frequency of a feature should exceed 5%, where the genotype frequency means the proportion of genotype among the total samples in the dataset. The second criterion is regarding the association between the feature and the phenotype. We used χ2 test to estimate the association between the feature and the phenotype, and the p-value should be smaller than 0.01.

Parameters:

np_genotype (ndarray) – 2D array containing genotype data with int8 type
np_phenotype (ndarray) – 2D array containing phenotype data with float type

Returns:

np_genotype

2D array containing genotype data with int8 type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Lasso.LassoRegressionCV(np_X, np_y, int_kOfKFold=2, int_nJobs=1)[source]¶

Implementation of the L1-regularized Lasso regression with k-fold cross validation.

Parameters:

np_X (ndarray) – 2D array containing genotype data with int8 type
np_y (ndarray) – 2D array containing phenotype data with float type
int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
int_nJobs (int) – The number of thread (default: 1)

Returns:

estimator.scores

1D array containing the scores of each genetic features with float type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Lasso.RandomizedLassoRegression(np_X, np_y)[source]¶

Implementation of the stability selection.

Parameters:

np_X (ndarray) – 2D array containing genotype data with int8 type
np_y (ndarray) – 2D array containing phenotype data with float type

Returns:

estimator.scores

1D array containing the scores of each genetic features with float type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Lasso.SingleGeneEpistasisLasso(str_inputFileName_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶

A workflow to model a single gene containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Lasso regression with k-fold cross validation.

Parameters:

str_inputFileName_genotype (str) – File name of input genotype data
str_inputFileName_phenotype (str) – File name of input phenotype data
str_outputFilePath (str) – File path of output file
int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
int_nJobs (int) – The number of thread (default: 1)

Returns:

float_AVG_S_P

The average of the Peason’s and Spearman’s correlation of the model

Return type:

(float)

step4_singleGeneEpistasis_Logistic¶

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step4_singleGeneEpistasis_Logistic.BatchSingleGeneEpistasisLogistic(str_inputFilePath_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶

Batch running for the single gene workflow.

Parameters:

str_inputFilePath_genotype (str) – File path of input genotype data
str_inputFileName_phenotype (str) – File name of input phenotype data
str_outputFilePath (str) – File path of output file
int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
int_nJobs (int) – The number of thread (default: 1)

Returns:

Expected Success Response:

"step4: Detect single gene epistasis. DONE!"

genepi.step4_singleGeneEpistasis_Logistic.FeatureEncoderLogistic(np_genotype_rsid, np_genotype, np_phenotype, int_dim)[source]¶

Implementation of the two-element combinatorial encoding.

Parameters:

np_genotype_rsid (ndarray) – 1D array containing rsid of genotype data with str type
np_genotype (ndarray) – 2D array containing genotype data with int8 type
np_phenotype (ndarray) – 2D array containing phenotype data with float type
int_dim (int) – The dimension of a variant (default: 3. AA, AB and BB)

Returns:

tuple containing:

list_interaction_rsid (ndarray): 1D array containing rsid of genotype data with str type

np_interaction (ndarray): 2D array containing genotype data with int8 type

Return type:

(tuple)

genepi.step4_singleGeneEpistasis_Logistic.FilterInLoading(np_genotype, np_phenotype)[source]¶

This function is for filtering low quality varaint. Before modeling each subset of genotype features, two criteria were adopted to exclude low quality data. The first criterion is that the genotype frequency of a feature should exceed 5%, where the genotype frequency means the proportion of genotype among the total samples in the dataset. The second criterion is regarding the association between the feature and the phenotype. We used χ2 test to estimate the association between the feature and the phenotype, and the p-value should be smaller than 0.01.

Parameters:

np_genotype (ndarray) – 2D array containing genotype data with int8 type
np_phenotype (ndarray) – 2D array containing phenotype data with float type

Returns:

np_genotype

2D array containing genotype data with int8 type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Logistic.GenerateContingencyTable(np_genotype, np_phenotype)[source]¶

Generating the contingency table for chi-square test.

Parameters:

np_X (ndarray) – 2D array containing genotype data with int8 type
np_y (ndarray) – 2D array containing phenotype data with float type

Returns:

np_contingency

2D array containing the contingency table with int type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Logistic.LogisticRegressionL1CV(np_X, np_y, int_kOfKFold=2, int_nJobs=1)[source]¶

Implementation of the L1-regularized Logistic regression with k-fold cross validation.

Parameters:

np_X (ndarray) – 2D array containing genotype data with int8 type
np_y (ndarray) – 2D array containing phenotype data with float type
int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
int_nJobs (int) – The number of thread (default: 1)

Returns:

estimator.scores

1D array containing the scores of each genetic features with float type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Logistic.RandomizedLogisticRegression(np_X, np_y)[source]¶

Implementation of the stability selection.

Parameters:

np_X (ndarray) – 2D array containing genotype data with int8 type
np_y (ndarray) – 2D array containing phenotype data with float type

Returns:

estimator.scores

1D array containing the scores of each genetic features with float type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Logistic.SingleGeneEpistasisLogistic(str_inputFileName_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶

A workflow to model a single gene containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Logistic regression with k-fold cross validation.

Parameters:

str_inputFileName_genotype (str) – File name of input genotype data
str_inputFileName_phenotype (str) – File name of input phenotype data
str_outputFilePath (str) – File path of output file
int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
int_nJobs (int) – The number of thread (default: 1)

Returns:

float_f1Score

The F1 score of the model

Return type:

(float)

step5_crossGeneEpistasis_Lasso¶

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step5_crossGeneEpistasis_Lasso.CrossGeneEpistasisLasso(str_inputFilePath_feature, str_inputFileName_phenotype, str_inputFileName_score='', str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶

A workflow to model a cross gene epistasis containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Lasso regression with k-fold cross validation.

Parameters:

str_inputFilePath_feature (str) – File path of input feature files from stage 1 - singleGeneEpistasis
str_inputFileName_phenotype (str) – File name of input phenotype data
str_inputFileName_score (str) – File name of input score file from stage 1 - singleGeneEpistasis
str_outputFilePath (str) – File path of output file
int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
int_nJobs (int) – The number of thread (default: 1)

Returns:

tuple containing:

float_AVG_S_P_train (float): The average of the Peason’s and Spearman’s correlation of the model for training set

float_AVG_S_P_test (float): The average of the Peason’s and Spearman’s correlation of the model for testing set

Expected Success Response:

"step5: Detect cross gene epistasis. DONE!"

Return type:

(tuple)

genepi.step5_crossGeneEpistasis_Lasso.LassoRegression(np_X, np_y, int_nJobs=1)[source]¶

Implementation of the L1-regularized Lasso regression with k-fold cross validation.

Parameters:

np_X (ndarray) – 2D array containing genotype data with int8 type
np_y (ndarray) – 2D array containing phenotype data with float type
int_nJobs (int) – The number of thread (default: 1)

Returns:

float_AVG_S_P

The average of the Peason’s and Spearman’s correlation of the model

Return type:

(float)

genepi.step5_crossGeneEpistasis_Lasso.RegressorModelPersistence(np_X, np_y, str_outputFilePath='', int_nJobs=1)[source]¶

Dumping regressor for model persistence

Parameters:	np_X (ndarray) – 2D array containing genotype data with int8 type np_y (ndarray) – 2D array containing phenotype data with float type str_outputFilePath (str) – File path of output file int_nJobs (int) – The number of thread (default: 1)
Returns:	None

step5_crossGeneEpistasis_Logistic¶

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step5_crossGeneEpistasis_Logistic.ClassifierModelPersistence(np_X, np_y, str_outputFilePath='', int_nJobs=1)[source]¶

Dumping classifier for model persistence

Parameters:	np_X (ndarray) – 2D array containing genotype data with int8 type np_y (ndarray) – 2D array containing phenotype data with float type str_outputFilePath (str) – File path of output file int_nJobs (int) – The number of thread (default: 1)
Returns:	None

genepi.step5_crossGeneEpistasis_Logistic.CrossGeneEpistasisLogistic(str_inputFilePath_feature, str_inputFileName_phenotype, str_inputFileName_score='', str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶

A workflow to model a cross gene epistasis containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Logistic regression with k-fold cross validation.

Parameters:

str_inputFilePath_feature (str) – File path of input feature files from stage 1 - singleGeneEpistasis
str_inputFileName_phenotype (str) – File name of input phenotype data
str_inputFileName_score (str) – File name of input score file from stage 1 - singleGeneEpistasis
str_outputFilePath (str) – File path of output file
int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
int_nJobs (int) – The number of thread (default: 1)

Returns:

tuple containing:

float_f1Score_train (float): The F1 score of the model for training set

float_f1Score_test (float): The F1 score of the model for testing set

Expected Success Response:

"step5: Detect cross gene epistasis. DONE!"

Return type:

(tuple)

genepi.step5_crossGeneEpistasis_Logistic.GenerateContingencyTable(np_genotype, np_phenotype)[source]¶

Generating the contingency table for chi-square test.

Parameters:

np_X (ndarray) – 2D array containing genotype data with int8 type
np_y (ndarray) – 2D array containing phenotype data with float type

Returns:

np_contingency

2D array containing the contingency table with int type

Return type:

(ndarray)

genepi.step5_crossGeneEpistasis_Logistic.LogisticRegressionL1(np_X, np_y, int_nJobs=1)[source]¶

Implementation of the L1-regularized Logistic regression with k-fold cross validation.

Parameters:

np_X (ndarray) – 2D array containing genotype data with int8 type
np_y (ndarray) – 2D array containing phenotype data with float type
int_nJobs (int) – The number of thread (default: 1)

Returns:

float_f1Score

The F1 score of the model

Return type:

(float)

genepi.step5_crossGeneEpistasis_Logistic.PlotPolygenicScore(list_target, list_predict, list_proba, str_outputFilePath='', str_label='')[source]¶

Plot figure for polygenic score, including group distribution and prevalence to PGS

Parameters:	list_target (list) – A list containing the target of each samples list_predict (list) – A list containing the predition value of each samples list_proba (list) – A list containing the predition probability of each samples str_outputFilePath (str) – File path of output file str_label (str) – The label of the output plots
Returns:	None

genepi.step5_crossGeneEpistasis_Logistic.fsigmoid(x, a, b)[source]¶

genepi.step5_crossGeneEpistasis_Logistic.gaussian(x, mean, amplitude, standard_deviation)[source]¶

step6_ensembleWithCovariates¶

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step6_ensembleWithCovariates.ClassifierModelPersistence(np_X, np_y, str_outputFilePath='', int_nJobs=1)[source]¶

Dumping ensemble classifier for model persistence

Parameters:	np_X (ndarray) – 2D array containing genotype data with int8 type np_y (ndarray) – 2D array containing phenotype data with float type str_outputFilePath (str) – File path of output file int_nJobs (int) – The number of thread (default: 1)
Returns:	None

genepi.step6_ensembleWithCovariates.EnsembleWithCovariatesClassifier(str_inputFileName_feature, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶

A workflow to ensemble genetic features with covariates for L1-regularized Logistic regression.

Parameters:

str_inputFilePath_feature (str) – File path of input feature files from stage 2 - crossGeneEpistasis
str_inputFileName_phenotype (str) – File name of input phenotype data
str_outputFilePath (str) – File path of output file
int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
int_nJobs (int) – The number of thread (default: 1)

Returns:

tuple containing:

float_f1Score_train (float): The F1 score of the model for training set

float_f1Score_test (float): The F1 score of the model for testing set

Expected Success Response:

"step6: Ensemble with covariates. DONE!"

Return type:

(tuple)

genepi.step6_ensembleWithCovariates.EnsembleWithCovariatesRegressor(str_inputFileName_feature, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]¶

A workflow to ensemble genetic features with covariates for L1-regularized Lasso regression.

Parameters:

str_inputFilePath_feature (str) – File path of input feature files from stage 2 - crossGeneEpistasis
str_inputFileName_phenotype (str) – File name of input phenotype data
str_outputFilePath (str) – File path of output file
int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
int_nJobs (int) – The number of thread (default: 1)

Returns:

tuple containing:

float_AVG_S_P_train (float): The average of the Peason’s and Spearman’s correlation of the model for training set

float_AVG_S_P_test (float): The average of the Peason’s and Spearman’s correlation of the model for testing set

Expected Success Response:

"step6: Ensemble with covariates. DONE!"

Return type:

(tuple)

genepi.step6_ensembleWithCovariates.LoadDataForEnsemble(str_inputFileName_feature, str_inputFileName_phenotype)[source]¶

Loading genetic features for ensembling with covariates

Parameters:

str_inputFilePath_feature (str) – File path of input feature files from stage 2 - crossGeneEpistasis
str_inputFileName_phenotype (str) – File name of input phenotype data

Returns:

tuple containing:

np_genotype (ndarray): 2D array containing genotype data with int8 type

np_phenotype (ndarray): 2D array containing phenotype data with float type

Return type:

(tuple)

genepi.step6_ensembleWithCovariates.RegressorModelPersistence(np_X, np_y, str_outputFilePath='', int_nJobs=1)[source]¶

Dumping ensemble regressor for model persistence

Parameters:	np_X (ndarray) – 2D array containing genotype data with int8 type np_y (ndarray) – 2D array containing phenotype data with float type str_outputFilePath (str) – File path of output file int_nJobs (int) – The number of thread (default: 1)
Returns:	None