Getting started¶
This tutorial assumes the PIDGINv3 repository is located at $PV3
.
Generating predictions for human targets¶
In this example, we will work with the input file named test.smi
in the examples
directory, which containins two molecules whose SMILES strings are defined as:
COc1cc2c3CN4CCC[C@H]4[C@@H](O)c3c5ccc(O)cc5c2cc1OC CompoundID1
COc1cc2c3CN4CCC[C@H]4[C@@H](O)c3c5ccc(O)c(OC)c5c2cc1OC CompoundID2
The following code will generate the RF probabilities at 1μM for all human targets for the input file:
$ python $PV3/predict.py -f test.smi --organism "Homo sapiens" -b 1
This script outputs the RF output from each of the Random Forest classifiers across the targets for the all compounds into a probability matrix, where the columns are compounds and the rows are targets.
If using --organism
, it must be as specified in the uniprot_information.txt and
if using spaces enclosed by quotes (“”) - as in the above example. The organism filter
uses fuzzy matching, so --organism homo
would also achieve a similar filtered list.
Generating binary predictions¶
The following code will generate binary predictions at 0.1 and 1μM for all human targets, at a threshold of 0.5 (the compound was more often predicted active compared to inactive):
$ python $PV3/predict.py -f test.smi --organism "Homo sapiens" -b 0.1,1 -p 0.5
The threshold can be increased to increase the confidence in the prediction.
Note
These probabilities are different from PIDGIN version 2 in that they have not been Platt-scaled, since this increased the number of false positives.
Decreasing applicability domain (AD) filter¶
To reduce the stringency in the AD filter, the --ad
parameter (defulat:90) can be
reduced, as in the following snippet:
$ python $PV3/predict.py -f test.smi --organism "Homo sapiens" -b 1 -p 0.5 --ad 60
In this case, the threshold for the applicability domain weights calculated across the targets has been reduced from 90% to 60%, and thus compounds that are further from the AD are now accepted.
Outputting the AD results¶
To following snippet calculates the weights for each of the input compounds and outputs their corresponding percentile value, so that a user can view the matrix of percentiles for each compound and accept/reject predictions at a percentile threshold without the need to re-run predictions:
$ python $PV3/predict.py -f test.smi --organism "Homo sapiens" -b 1 --percentile
Silencing the AD filter¶
To following snippet would therefore turn off the AD filter, since all predictions are accepted:
$ python $PV3/predict.py -f test.smi --organism "Homo sapiens" -b 1 --ad 0
Combining model filters¶
If the user is interested in a given target class (for example “Lipase”) then the following can be used:
$ python $PV3/predict.py -f test.smi --organism "Homo sapiens" --target_class Lipase
Filters can be combined, for instance:
$ python $PV3/predict.py -f test.smi --organism "Homo sapiens" --target_class GPCR --min_size 25 --performance_filter tsscv,prauc,0.7
would filter human models for GPCRs with a minimum number of 25 actives in the training set and with a minimum precision-recall AUC (PR-AUC) performance of 0.7 during time-series split cross validation (TSSCV).
Additional criteria can be added, for instance:
$ python $PV3/predict.py -f test.smi --organism "Rattus" -b 0.1,1 -p 0.5 --min_size 50 --se_filter --performance_filter l50po,bedroc,0.8
would filter rat models that did not require Sphere Exclusion (SE) (i.e. sufficient number of inactives available) and a minimum number of 50 actives in the training set, with a minimum BEDROC performance of 0.8 during leave 50% of ChEMBL publications in the training data out over 4-fold cross validation (L50PO) to produce a binary matrix of predictions at a probability cut-off of 0.5 and for models trained with bioactivity data at a threshold of 0.1 & 1.01μM.