API Documentations

GenEpi module

Created on Apr 2019

@author: Chester (Yu-Chuan Chang)

genepi.GenEpi.ArgumentsParser()[source]

To obtain and parse the arguments from user.

Parameters:None
Returns:argparse.ArgumentParser
genepi.GenEpi.InputChecking(str_inputFileName_genotype, str_inputFileName_phenotype, args)[source]

To check the numbers of sample are consistent in genotype and phenotype data.

Parameters:
  • str_inputFileName_genotype (str) – File name of input genotype data
  • str_inputFileName_phenotype (str) – File name of input phenotype data
Returns:

tuple containing:

  • int_num_genotype (int): The sample number of genotype data
  • int_num_phenotype (int): The sample number of phenotype data

Return type:

(tuple)

genepi.GenEpi.main(args=None)[source]

Main function for obtaining user arguments, controling workflow and recording log file.

Parameters:None
Returns:None

step1_downloadUCSCDB

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step1_downloadUCSCDB.DownloadUCSCDB(str_outputFilePath='/home/docs/checkouts/readthedocs.org/user_builds/genepi/envs/latest/lib/python3.7/site-packages/genepi-2.0.10-py3.7.egg/genepi', str_hgbuild='hg19')[source]

To obtain the gene information such as official gene symbols and genomic coordinates, this function is for retrieving kgXref and knownGene data table from the UCSC human genome annotation database

Parameters:
  • str_outputFilePath (str) – File path of output database
  • str_hgbuild (str) – Genome build (eg. “hg19”)
Returns:

  • Expected Success Response:

    "step1: Down load UCSC Database. DONE!"
    

step2_estimateLD

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step2_estimateLD.EstimateAlleleFrequency(gen_snp)[source]

A function for estimating allele frequency of a single varaint

Parameters:gen_snp (list) – The genotypes of a variant of all samples
Returns:tuple containing:
  • float_frequency_A (float): The reference allele type frequency
  • float_frequency_B (float): The alternative allele type frequency
Return type:(tuple)
genepi.step2_estimateLD.EstimateLDBlock(str_inputFileName_genotype, str_outputFilePath='', float_threshold_DPrime=0.8, float_threshold_RSquare=0.8)[source]

A function for implementing linkage disequilibrium (LD) dimension reduction. In genotype data, a variant often exhibits high dependency with its nearby variants because of LD. In the practical implantation, we prefer to group these dependent features to reduce the dimension of features. In other words, we can take the advantages of LD to reduce the dimensionality of genetic features. In this regard, this function adopted the same approach developed by Lewontin (1964) to estimate LD. We used D’ and r2 as the criteria to group highly dependent genetic features as blocks. In each block, we chose the features with the largest minor allele frequency to represent other features in the same block.

Parameters:
  • str_inputFileName_genotype (str) – File name of input genotype data
  • str_outputFilePath (str) – File path of output file
  • float_threshold_DPrime (float) – The Dprime threshold for discriminating a LD block (default: 0.8)
  • float_threshold_RSquare (float) – The RSquare threshold for discriminating a LD block (default: 0.8)
Returns:

  • Expected Success Response:

    "step2: Estimate LD. DONE!"
    

genepi.step2_estimateLD.EstimatePairwiseLD(gen_snp_1, gen_snp_2)[source]

Lewontin (1964) linkage disequilibrium (LD) estimation.

Parameters:
  • gen_snp_1 (list) – The genotypes of first variant of all samples
  • gen_snp_2 (list) – The genotypes of second variant of all samples
Returns:

tuple containing:

  • float_D_prime (float): The DPrime of these two variants
  • float_R_square (float): The RSquare of these two variants

Return type:

(tuple)

step3_splitByGene

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step3_splitByGene.SplitByGene(str_inputFileName_genotype, str_inputFileName_UCSCDB='/home/docs/checkouts/readthedocs.org/user_builds/genepi/envs/latest/lib/python3.7/site-packages/genepi-2.0.10-py3.7.egg/genepi/UCSCGenomeDatabase.txt', str_outputFilePath='')[source]

In order to extract genetic features for a gene, this function used the start and end positions of each gene from the local UCSC database to split the genetic features. Then, generate the .GEN files for each gene in the folder named snpSubsets.

Parameters:
  • str_inputFileName_genotype (str) – File name of input genotype data
  • str_inputFileName_UCSCDB (str) – File name of input genome regions
  • str_outputFilePath (str) – File path of output file
Returns:

  • Expected Success Response:

    "step3: Split by gene. DONE!"
    

Warning

“Warning of step3: .gen file should be sorted by chromosome and position”

genepi.step3_splitByGene.SplitMegaGene(list_snpsOnGene, int_window, int_step, str_outputFilePath, str_outputFileName)[source]

In order to extract genetic features for a gene, this function used the start and end positions of each gene from the local UCSC database to split the genetic features. Then, generate the .GEN files for each gene in the folder named snpSubsets.

Parameters:
  • list_snpsOnGene (list) – A list contains SNPs on a gene
  • int_window (int) – The size of the sliding window
  • int_step (int) – The step of the sliding window
  • str_outputFilePath (str) – File path of output file
  • str_outputFileName (str) – File name of output file
Returns:

None

step4_singleGeneEpistasis_Lasso

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step4_singleGeneEpistasis_Lasso.BatchSingleGeneEpistasisLasso(str_inputFilePath_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]

Batch running for the single gene workflow.

Parameters:
  • str_inputFilePath_genotype (str) – File path of input genotype data
  • str_inputFileName_phenotype (str) – File name of input phenotype data
  • str_outputFilePath (str) – File path of output file
  • int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

  • Expected Success Response:

    "step4: Detect single gene epistasis. DONE!"
    

genepi.step4_singleGeneEpistasis_Lasso.FeatureEncoderLasso(np_genotype_rsid, np_genotype, np_phenotype, int_dim)[source]

Implementation of the two-element combinatorial encoding.

Parameters:
  • np_genotype_rsid (ndarray) – 1D array containing rsid of genotype data with str type
  • np_genotype (ndarray) – 2D array containing genotype data with int8 type
  • np_phenotype (ndarray) – 2D array containing phenotype data with float type
  • int_dim (int) – The dimension of a variant (default: 3. AA, AB and BB)
Returns:

tuple containing:

  • list_interaction_rsid (ndarray): 1D array containing rsid of genotype data with str type
  • np_interaction (ndarray): 2D array containing genotype data with int8 type

Return type:

(tuple)

genepi.step4_singleGeneEpistasis_Lasso.FilterInLoading(np_genotype, np_phenotype)[source]

This function is for filtering low quality varaint. Before modeling each subset of genotype features, two criteria were adopted to exclude low quality data. The first criterion is that the genotype frequency of a feature should exceed 5%, where the genotype frequency means the proportion of genotype among the total samples in the dataset. The second criterion is regarding the association between the feature and the phenotype. We used χ2 test to estimate the association between the feature and the phenotype, and the p-value should be smaller than 0.01.

Parameters:
  • np_genotype (ndarray) – 2D array containing genotype data with int8 type
  • np_phenotype (ndarray) – 2D array containing phenotype data with float type
Returns:

np_genotype

2D array containing genotype data with int8 type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Lasso.LassoRegressionCV(np_X, np_y, int_kOfKFold=2, int_nJobs=1)[source]

Implementation of the L1-regularized Lasso regression with k-fold cross validation.

Parameters:
  • np_X (ndarray) – 2D array containing genotype data with int8 type
  • np_y (ndarray) – 2D array containing phenotype data with float type
  • int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

estimator.scores

1D array containing the scores of each genetic features with float type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Lasso.RandomizedLassoRegression(np_X, np_y)[source]

Implementation of the stability selection.

Parameters:
  • np_X (ndarray) – 2D array containing genotype data with int8 type
  • np_y (ndarray) – 2D array containing phenotype data with float type
Returns:

estimator.scores

1D array containing the scores of each genetic features with float type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Lasso.SingleGeneEpistasisLasso(str_inputFileName_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]

A workflow to model a single gene containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Lasso regression with k-fold cross validation.

Parameters:
  • str_inputFileName_genotype (str) – File name of input genotype data
  • str_inputFileName_phenotype (str) – File name of input phenotype data
  • str_outputFilePath (str) – File path of output file
  • int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

float_AVG_S_P

The average of the Peason’s and Spearman’s correlation of the model

Return type:

(float)

step4_singleGeneEpistasis_Logistic

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step4_singleGeneEpistasis_Logistic.BatchSingleGeneEpistasisLogistic(str_inputFilePath_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]

Batch running for the single gene workflow.

Parameters:
  • str_inputFilePath_genotype (str) – File path of input genotype data
  • str_inputFileName_phenotype (str) – File name of input phenotype data
  • str_outputFilePath (str) – File path of output file
  • int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

  • Expected Success Response:

    "step4: Detect single gene epistasis. DONE!"
    

genepi.step4_singleGeneEpistasis_Logistic.FeatureEncoderLogistic(np_genotype_rsid, np_genotype, np_phenotype, int_dim)[source]

Implementation of the two-element combinatorial encoding.

Parameters:
  • np_genotype_rsid (ndarray) – 1D array containing rsid of genotype data with str type
  • np_genotype (ndarray) – 2D array containing genotype data with int8 type
  • np_phenotype (ndarray) – 2D array containing phenotype data with float type
  • int_dim (int) – The dimension of a variant (default: 3. AA, AB and BB)
Returns:

tuple containing:

  • list_interaction_rsid (ndarray): 1D array containing rsid of genotype data with str type
  • np_interaction (ndarray): 2D array containing genotype data with int8 type

Return type:

(tuple)

genepi.step4_singleGeneEpistasis_Logistic.FilterInLoading(np_genotype, np_phenotype)[source]

This function is for filtering low quality varaint. Before modeling each subset of genotype features, two criteria were adopted to exclude low quality data. The first criterion is that the genotype frequency of a feature should exceed 5%, where the genotype frequency means the proportion of genotype among the total samples in the dataset. The second criterion is regarding the association between the feature and the phenotype. We used χ2 test to estimate the association between the feature and the phenotype, and the p-value should be smaller than 0.01.

Parameters:
  • np_genotype (ndarray) – 2D array containing genotype data with int8 type
  • np_phenotype (ndarray) – 2D array containing phenotype data with float type
Returns:

np_genotype

2D array containing genotype data with int8 type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Logistic.GenerateContingencyTable(np_genotype, np_phenotype)[source]

Generating the contingency table for chi-square test.

Parameters:
  • np_X (ndarray) – 2D array containing genotype data with int8 type
  • np_y (ndarray) – 2D array containing phenotype data with float type
Returns:

np_contingency

2D array containing the contingency table with int type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Logistic.LogisticRegressionL1CV(np_X, np_y, int_kOfKFold=2, int_nJobs=1)[source]

Implementation of the L1-regularized Logistic regression with k-fold cross validation.

Parameters:
  • np_X (ndarray) – 2D array containing genotype data with int8 type
  • np_y (ndarray) – 2D array containing phenotype data with float type
  • int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

estimator.scores

1D array containing the scores of each genetic features with float type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Logistic.RandomizedLogisticRegression(np_X, np_y)[source]

Implementation of the stability selection.

Parameters:
  • np_X (ndarray) – 2D array containing genotype data with int8 type
  • np_y (ndarray) – 2D array containing phenotype data with float type
Returns:

estimator.scores

1D array containing the scores of each genetic features with float type

Return type:

(ndarray)

genepi.step4_singleGeneEpistasis_Logistic.SingleGeneEpistasisLogistic(str_inputFileName_genotype, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]

A workflow to model a single gene containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Logistic regression with k-fold cross validation.

Parameters:
  • str_inputFileName_genotype (str) – File name of input genotype data
  • str_inputFileName_phenotype (str) – File name of input phenotype data
  • str_outputFilePath (str) – File path of output file
  • int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

float_f1Score

The F1 score of the model

Return type:

(float)

step5_crossGeneEpistasis_Lasso

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step5_crossGeneEpistasis_Lasso.CrossGeneEpistasisLasso(str_inputFilePath_feature, str_inputFileName_phenotype, str_inputFileName_score='', str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]

A workflow to model a cross gene epistasis containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Lasso regression with k-fold cross validation.

Parameters:
  • str_inputFilePath_feature (str) – File path of input feature files from stage 1 - singleGeneEpistasis
  • str_inputFileName_phenotype (str) – File name of input phenotype data
  • str_inputFileName_score (str) – File name of input score file from stage 1 - singleGeneEpistasis
  • str_outputFilePath (str) – File path of output file
  • int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

tuple containing:

  • float_AVG_S_P_train (float): The average of the Peason’s and Spearman’s correlation of the model for training set
  • float_AVG_S_P_test (float): The average of the Peason’s and Spearman’s correlation of the model for testing set
  • Expected Success Response:

    "step5: Detect cross gene epistasis. DONE!"
    

Return type:

(tuple)

genepi.step5_crossGeneEpistasis_Lasso.LassoRegression(np_X, np_y, int_nJobs=1)[source]

Implementation of the L1-regularized Lasso regression with k-fold cross validation.

Parameters:
  • np_X (ndarray) – 2D array containing genotype data with int8 type
  • np_y (ndarray) – 2D array containing phenotype data with float type
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

float_AVG_S_P

The average of the Peason’s and Spearman’s correlation of the model

Return type:

(float)

genepi.step5_crossGeneEpistasis_Lasso.RegressorModelPersistence(np_X, np_y, str_outputFilePath='', int_nJobs=1)[source]

Dumping regressor for model persistence

Parameters:
  • np_X (ndarray) – 2D array containing genotype data with int8 type
  • np_y (ndarray) – 2D array containing phenotype data with float type
  • str_outputFilePath (str) – File path of output file
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

None

step5_crossGeneEpistasis_Logistic

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step5_crossGeneEpistasis_Logistic.ClassifierModelPersistence(np_X, np_y, str_outputFilePath='', int_nJobs=1)[source]

Dumping classifier for model persistence

Parameters:
  • np_X (ndarray) – 2D array containing genotype data with int8 type
  • np_y (ndarray) – 2D array containing phenotype data with float type
  • str_outputFilePath (str) – File path of output file
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

None

genepi.step5_crossGeneEpistasis_Logistic.CrossGeneEpistasisLogistic(str_inputFilePath_feature, str_inputFileName_phenotype, str_inputFileName_score='', str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]

A workflow to model a cross gene epistasis containing two-element combinatorial encoding, stability selection, filtering low quality varaint and L1-regularized Logistic regression with k-fold cross validation.

Parameters:
  • str_inputFilePath_feature (str) – File path of input feature files from stage 1 - singleGeneEpistasis
  • str_inputFileName_phenotype (str) – File name of input phenotype data
  • str_inputFileName_score (str) – File name of input score file from stage 1 - singleGeneEpistasis
  • str_outputFilePath (str) – File path of output file
  • int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

tuple containing:

  • float_f1Score_train (float): The F1 score of the model for training set
  • float_f1Score_test (float): The F1 score of the model for testing set
  • Expected Success Response:

    "step5: Detect cross gene epistasis. DONE!"
    

Return type:

(tuple)

genepi.step5_crossGeneEpistasis_Logistic.GenerateContingencyTable(np_genotype, np_phenotype)[source]

Generating the contingency table for chi-square test.

Parameters:
  • np_X (ndarray) – 2D array containing genotype data with int8 type
  • np_y (ndarray) – 2D array containing phenotype data with float type
Returns:

np_contingency

2D array containing the contingency table with int type

Return type:

(ndarray)

genepi.step5_crossGeneEpistasis_Logistic.LogisticRegressionL1(np_X, np_y, int_nJobs=1)[source]

Implementation of the L1-regularized Logistic regression with k-fold cross validation.

Parameters:
  • np_X (ndarray) – 2D array containing genotype data with int8 type
  • np_y (ndarray) – 2D array containing phenotype data with float type
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

float_f1Score

The F1 score of the model

Return type:

(float)

genepi.step5_crossGeneEpistasis_Logistic.PlotPolygenicScore(list_target, list_predict, list_proba, str_outputFilePath='', str_label='')[source]

Plot figure for polygenic score, including group distribution and prevalence to PGS

Parameters:
  • list_target (list) – A list containing the target of each samples
  • list_predict (list) – A list containing the predition value of each samples
  • list_proba (list) – A list containing the predition probability of each samples
  • str_outputFilePath (str) – File path of output file
  • str_label (str) – The label of the output plots
Returns:

None

genepi.step5_crossGeneEpistasis_Logistic.fsigmoid(x, a, b)[source]
genepi.step5_crossGeneEpistasis_Logistic.gaussian(x, mean, amplitude, standard_deviation)[source]

step6_ensembleWithCovariates

Created on Feb 2018

@author: Chester (Yu-Chuan Chang)

genepi.step6_ensembleWithCovariates.ClassifierModelPersistence(np_X, np_y, str_outputFilePath='', int_nJobs=1)[source]

Dumping ensemble classifier for model persistence

Parameters:
  • np_X (ndarray) – 2D array containing genotype data with int8 type
  • np_y (ndarray) – 2D array containing phenotype data with float type
  • str_outputFilePath (str) – File path of output file
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

None

genepi.step6_ensembleWithCovariates.EnsembleWithCovariatesClassifier(str_inputFileName_feature, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]

A workflow to ensemble genetic features with covariates for L1-regularized Logistic regression.

Parameters:
  • str_inputFilePath_feature (str) – File path of input feature files from stage 2 - crossGeneEpistasis
  • str_inputFileName_phenotype (str) – File name of input phenotype data
  • str_outputFilePath (str) – File path of output file
  • int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

tuple containing:

  • float_f1Score_train (float): The F1 score of the model for training set
  • float_f1Score_test (float): The F1 score of the model for testing set
  • Expected Success Response:

    "step6: Ensemble with covariates. DONE!"
    

Return type:

(tuple)

genepi.step6_ensembleWithCovariates.EnsembleWithCovariatesRegressor(str_inputFileName_feature, str_inputFileName_phenotype, str_outputFilePath='', int_kOfKFold=2, int_nJobs=1)[source]

A workflow to ensemble genetic features with covariates for L1-regularized Lasso regression.

Parameters:
  • str_inputFilePath_feature (str) – File path of input feature files from stage 2 - crossGeneEpistasis
  • str_inputFileName_phenotype (str) – File name of input phenotype data
  • str_outputFilePath (str) – File path of output file
  • int_kOfKFold (int) – The k for k-fold cross validation (default: 2)
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

tuple containing:

  • float_AVG_S_P_train (float): The average of the Peason’s and Spearman’s correlation of the model for training set
  • float_AVG_S_P_test (float): The average of the Peason’s and Spearman’s correlation of the model for testing set
  • Expected Success Response:

    "step6: Ensemble with covariates. DONE!"
    

Return type:

(tuple)

genepi.step6_ensembleWithCovariates.LoadDataForEnsemble(str_inputFileName_feature, str_inputFileName_phenotype)[source]

Loading genetic features for ensembling with covariates

Parameters:
  • str_inputFilePath_feature (str) – File path of input feature files from stage 2 - crossGeneEpistasis
  • str_inputFileName_phenotype (str) – File name of input phenotype data
Returns:

tuple containing:

  • np_genotype (ndarray): 2D array containing genotype data with int8 type
  • np_phenotype (ndarray): 2D array containing phenotype data with float type

Return type:

(tuple)

genepi.step6_ensembleWithCovariates.RegressorModelPersistence(np_X, np_y, str_outputFilePath='', int_nJobs=1)[source]

Dumping ensemble regressor for model persistence

Parameters:
  • np_X (ndarray) – 2D array containing genotype data with int8 type
  • np_y (ndarray) – 2D array containing phenotype data with float type
  • str_outputFilePath (str) – File path of output file
  • int_nJobs (int) – The number of thread (default: 1)
Returns:

None