Enrichment Predictions¶
This tutorial assumes the PIDGINv3 repository is located at $PV3
and is concerned with
the script predict_enriched.py
This script calculates target prediction enrichment (using Fishers’ t-test) between two input SMILES/SDF files as in [1]. Target predictions are extended with NCBI Biosystems pathways and DisGeNET diseases. Pathway or disease-gene association enrichment (using chi-square test) enrichment is calculated for the two input SMILES/SDF files.
The approach is used to annotate which targets/pathways/diseases are statistically associated between two compound sets given their input SMILES/SDF files. This analysis is important since a (predicted) target is not necessarily responsible for eliciting an observed mechanism-of-action. Some target prediction models also behave promiscuously, due to biases in training data (chemical space) and the nature of the target.
The analysis must use a cut-off for the probability of activity from the random forest for each target. Predictions are generated for the models using the reliability-density neighbourhood Applicability Domain (AD) analysis by Aniceto from: doi.org/10.1186/s13321-016-0182-y
biosystems.txt
contains pathway data from the NCBI biosystems used to annotate target
predictions. Pathway results can be filtered by source (e.g. KEGG/Reactome/GO) afterward.
DisGeNET_diseases.txt
contains disease data used to annotate target predictions.
DisGeNET gene-disease score takes into account the number and type of sources (level of
curation, organisms), and the number of publications supporting the association. The score
ranges from 0 to 1 in accordance to increasing confidence in annotations, resepctively. A
DisGeNET_threshold can be supplied at runtime when annotating predictions with diseases
(0.06 threshold applied by default, which includes associations from curated
sources/animal models supporting the association or reported in 20-200 papers). More info
on the score here: http://disgenet.org/web/DisGeNET/menu/dbinfo#score
List of available arguments¶
To see all available options, run
$ python $PV3/predict_enriched.py -h
Usage: predict_enriched.py [options]
Options:
-h, --help show this help message and exit
--f1=FILE Firest input smiles or sdf file (required)
--f2=FILE Second input smiles or sdf file (required)
-d DELIM, --smiles_delim=DELIM
Input file (smiles) delimiter char (default: white
space ' ')
--smiles_column=SMICOL
Input file (smiles) delimiter column (default: 0)
--smiles_id_column=IDCOL
Input file (smiles) ID column (default: 1)
-o FILE Optional output prediction file name
-n NCORES, --ncores=NCORES
No. cores (default: 1)
-b BIOACTIVITY, --bioactivity=BIOACTIVITY
Bioactivity Um threshold (required). Use either
100/10/1/0.1 (default:10)
-p PROBA, --proba=PROBA
RF probability threshold (default: None)
--ad=AD Applicability Domain (AD) filter using percentile of
weights [float]. Default: 90 (integer for percentile)
--known_flag Set known activities (annotate duplicates betweem
input to train with correct label)
--orthologues Set to use orthologue bioactivity data in model
generation
--organism=ORGANISM Organism filter (multiple can be specified using
commas ',')
--target_class=TARGETCLASS
Target classification filter
--min_size=MINSIZE Minimum number of actives used in model generation
(default: 10)
--performance_filter=P_FILT
Comma-seperated performance filtering using following
nomenclature: validation_set[tsscv,l50so,l50po],metric
[bedroc,roc,prauc,brier],performance_threshold[float].
E.g 'tsscv,bedroc,0.5'
--se_filter Optional setting to restrict to models which do not
require Sphere Exclusion (SE)
--training_log Optional setting to add training_details to the
prediction file (large increase in output file size)
--ntrees=NTREES Specify the minimum number of trees for warm-start
random forest models (N.B Potential large
latency/memory cost)
--preprocess_off Turn off preprocessing using the flatkinson (eTox)
standardizer (github.com/flatkinson/standardiser),
size filter (100 >= Mw >= 1000 and organic mol check
(C count >= 1)
--dgn=DGN_THRESHOLD DisGeNET score threshold (default: 0.06)
Generating enrichment predictions¶
In this example, we will work with a two SMILES input files, comprising cytotoxic
compounds in the file named cytotox_library.smi
and (putative) non-toxic compounds in
the file named nontoxic_background.smi
. Both are located in the examples directory.
The corresponding top 5 SMILES strings are:
C1Cc2cc3OCOc3c4c5ccccc5C[C@@H](N1)c24
[Cl-].COc1ccc2cc3c4cc5OCOc5cc4CC[n+]3cc2c1OC
COc1cc2c3CN4CCC[C@H]4[C@@H](O)c3c5ccc(O)cc5c2cc1OC
COc1cc2c3CN4CCC[C@H]4[C@@H](O)c3c5ccc(O)c(OC)c5c2cc1OC
C[C@H](CCC(=O)[C@@H](C)[C@H]1[C@H](C[C@H]2[C@@H]3CC[C@H]4C[C@H](CC[C@]4(C)[C@H]3CC(=O)[C@]12C)OC(=O)C)OC(=O)C)CO[C@@H]5O[C@H](CO)[C@@H](O)[C@H](O)[C@H]5O
and
CC12CC3CC(C)(C1)CC(C3)(C2)NCc1ccccc1OCc1ccccc1F
CCCn1ccnc1C(=O)c1cc2ccccc2s1
Cc1nn(C)c(C)c1S(=O)(=O)NC1CCCCC1N
CC(C)CN(CC(C)C)C(=O)c1ccc(nc1C)C(F)(F)F
CC1CCc2c(C1)sc1nc(CN3CCOCC3)nc(Oc3ccccc3F)c21
The following code will generate cow target prediction enrichment at 1μM (with lenient AD filters of 30 percentiles and probability of activity cut-off of 0.45) along with enriched pathways and diseases (0.06 score threshold) for the cytotoxic compounds, when compared to the non-toxic compounds.
$ python $PV3/predict_enriched.py --f1 cytotox_library.smi --f2 nontoxic_background.smi --organism "Bos taurus" -b 1 -p 0.45 --ad 30 -n 4
Three files are output for the target, pathway and disease enrichment calculations, with the naming convention:
[f1]_vs[f2]_out_[disease/pathway]_predictions_enriched[timestamp].txt
The rows in each file correspond to the ranked enriched list of targets/pathways/diseases
that are more statistically associated with the first SMILES/SDF file (--f1
) of
(e.g. cytotoxic) compounds. A higher Odd’s Ratio (column Odds_Ratio
) or Risk Ratio
(Risk_Ratio
) indicates a larger degree of enrichment for a given
target/pathway/disease compared to the second input --f2
(nontoxic) compound set.
The output has columns for the number of compound predictions (column
[f1/f2]_[In]Actives_[probability_activity]
) and the associated percentage
Percent_[f1/f2]_[In]Actives_[probability_activity]
) of compounds with that prediction.
The Fishers or Chi-squared p-values are provided ([Fishers_Test/Chisquared]_P_Value
)
including the Benjamini & Hochberg corrected values in the column named
[Fishers_Test/Chisquared]_P_Value_Corrected
. The output should be filtered for a
given preference.
The percentage NaN predictions (compounds outside the Applicability Domain (AD) filter
that were not given an active/inactive target prediction) are also provided in the column
entitled [f1/f2]_Proportion_Nan_Predictions_[ad]
.
Note
Please note that the Odd’s and Risk ratios are implemented in a different way to the previous version of PIDGIN. For this version, larger numbers indicate larger enrichments.
In this example, there are six targets with a corrected p-value less than 0.05 with a Odds or Risk ratio greater than 1.0. All targets have known links to cytotoxicity, for example three are related to Tublin with known mechanisms to cytotoxicity (via cytoskeletal machinery).
More complicated example¶
Target/pathway/disease enrichment analysis can be combined with all model filters outlined in the previous section “Getting started”.
For example, the following code:
$ python $PV3/predict_enriched.py --f1 cytotox_library.smi --f2 nontoxic_background.smi --organism Drosophila -b 100 --known_flag --ad 0 -n 4 -p 0.8 --min_size 50 --se_filter --performance_filter l50po,bedroc,0.8
would filter for Drosophila models that did not require Sphere Exlusion (SE) (i.e. sufficient number of inactives available) and a minimum number of 50 actives in the training set, with a minimum BEDROC performance of 0.8 for leave out 50% of ChEMBL publications from training data over 4-fold cross validation (L50PO), to produce enrichment predictions at a 0.8 probability cut-off at a threshold of 100μM, with the Applicability Domain (AD) filter silenced and where known activities (in ChEMBL or PubChem) are set.