Enrichment Predictions
======================

This tutorial assumes the PIDGINv3 repository is located at ``$PV3`` and is concerned with
the script ``predict_enriched.py``

This script calculates target prediction enrichment (using Fishers' t-test) between two
input SMILES/SDF files as in [1]_. Target predictions are extended with NCBI Biosystems 
pathways and DisGeNET diseases. Pathway or disease-gene association enrichment
(using chi-square test) enrichment is calculated for the two input SMILES/SDF files.

The approach is used to annotate which targets/pathways/diseases are
statistically associated between two compound sets given their input SMILES/SDF files.
This analysis is important since a (predicted) target is not necessarily responsible for 
eliciting an observed mechanism-of-action. Some target prediction models also behave
promiscuously, due to biases in training data (chemical space) and the nature of the 
target.

The analysis must use a cut-off for the probability of activity from the random forest
for each target. Predictions are generated for the models using the reliability-density
neighbourhood Applicability Domain (AD) analysis by Aniceto from:
doi.org/10.1186/s13321-016-0182-y

``biosystems.txt`` contains pathway data from the NCBI biosystems used to annotate target
predictions. Pathway results can be filtered by source (e.g. KEGG/Reactome/GO) afterward.

``DisGeNET_diseases.txt`` contains disease data used to annotate target predictions.
DisGeNET gene-disease score takes into account the number and type of sources (level of
curation, organisms), and the number of publications supporting the association. The score
ranges from 0 to 1 in accordance to increasing confidence in annotations, resepctively. A
DisGeNET_threshold can be supplied at runtime when annotating predictions with diseases
(0.06 threshold applied by default, which includes associations from curated
sources/animal models supporting the association or reported in 20-200 papers). More info
on the score here: http://disgenet.org/web/DisGeNET/menu/dbinfo#score

List of available arguments
---------------------------

To see all available options, run

.. code-block:: shell-session

	$ python $PV3/predict_enriched.py -h
	Usage: predict_enriched.py [options]

	Options:
	  -h, --help            show this help message and exit
	  --f1=FILE             Firest input smiles or sdf file (required)
	  --f2=FILE             Second input smiles or sdf file (required)
	  -d DELIM, --smiles_delim=DELIM
							Input file (smiles) delimiter char (default: white
							space ' ')
	  --smiles_column=SMICOL
							Input file (smiles) delimiter column (default: 0)
	  --smiles_id_column=IDCOL
							Input file (smiles) ID column (default: 1)
	  -o FILE               Optional output prediction file name
	  -n NCORES, --ncores=NCORES
							No. cores (default: 1)
	  -b BIOACTIVITY, --bioactivity=BIOACTIVITY
							Bioactivity Um threshold (required). Use either
							100/10/1/0.1 (default:10)
	  -p PROBA, --proba=PROBA
							RF probability threshold (default: None)
	  --ad=AD               Applicability Domain (AD) filter using percentile of
							weights [float]. Default: 90 (integer for percentile)
	  --known_flag          Set known activities (annotate duplicates betweem
							input to train with correct label)
	  --orthologues         Set to use orthologue bioactivity data in model
							generation
	  --organism=ORGANISM   Organism filter (multiple can be specified using
							commas ',')
	  --target_class=TARGETCLASS
							Target classification filter
	  --min_size=MINSIZE    Minimum number of actives used in model generation
							(default: 10)
	  --performance_filter=P_FILT
							Comma-seperated performance filtering using following
							nomenclature: validation_set[tsscv,l50so,l50po],metric
							[bedroc,roc,prauc,brier],performance_threshold[float].
							E.g 'tsscv,bedroc,0.5'
	  --se_filter           Optional setting to restrict to models which do not
							require Sphere Exclusion (SE)
	  --training_log        Optional setting to add training_details to the
							prediction file (large increase in output file size)
	  --ntrees=NTREES       Specify the minimum number of trees for warm-start
							random forest models (N.B Potential large
							latency/memory cost)
	  --preprocess_off      Turn off preprocessing using the flatkinson (eTox)
							standardizer (github.com/flatkinson/standardiser),
							size filter (100 >= Mw >= 1000 and organic mol check
							(C count >= 1)
	  --dgn=DGN_THRESHOLD   DisGeNET score threshold (default: 0.06)


Generating enrichment predictions
---------------------------------
	  
In this example, we will work with a two SMILES input files, comprising cytotoxic
compounds in the file named ``cytotox_library.smi`` and (putative) non-toxic compounds in
the file named ``nontoxic_background.smi``. Both are located in the examples directory.

The corresponding top 5 SMILES strings are:

.. literalinclude:: ../../examples/cytotox_library.smi
   :caption: cytotox_library.smi
   :lines: 1-5
   
and

.. literalinclude:: ../../examples/nontoxic_background.smi
   :caption: nontoxic_background.smi
   :lines: 1-5

The following code will generate cow target prediction enrichment at 1μM (with lenient AD
filters of 30 percentiles and probability of activity cut-off of 0.45) along with enriched
pathways and diseases (0.06 score threshold) for the cytotoxic compounds, when compared to
the non-toxic compounds.

.. code-block:: shell-session

	$ python $PV3/predict_enriched.py --f1 cytotox_library.smi --f2 nontoxic_background.smi --organism "Bos taurus" -b 1 -p 0.45 --ad 30 -n 4

Three files are output for the target, pathway and disease enrichment calculations, with 
the naming convention: 

``[f1]_vs[f2]_out_[disease/pathway]_predictions_enriched[timestamp].txt``

The rows in each file correspond to the ranked enriched list of targets/pathways/diseases 
that are more statistically associated with the first SMILES/SDF file (``--f1``) of 
(e.g. cytotoxic) compounds. A higher Odd's Ratio (column ``Odds_Ratio``) or Risk Ratio
(``Risk_Ratio``) indicates a larger degree of enrichment for a given
target/pathway/disease compared to the second input ``--f2`` (nontoxic) compound set.

The output has columns for the number of compound predictions (column
``[f1/f2]_[In]Actives_[probability_activity]``) and the associated percentage
``Percent_[f1/f2]_[In]Actives_[probability_activity]``) of compounds with that prediction.

The Fishers or Chi-squared p-values are provided (``[Fishers_Test/Chisquared]_P_Value``) 
including the Benjamini & Hochberg corrected values in the column named 
``[Fishers_Test/Chisquared]_P_Value_Corrected``. The output should be filtered for a
given preference.

The percentage NaN predictions (compounds outside the Applicability Domain (AD) filter 
that were not given an active/inactive target prediction) are also provided in the column 
entitled ``[f1/f2]_Proportion_Nan_Predictions_[ad]``.

.. note::
	Please note that the Odd's and Risk ratios are implemented in a different way to the
	previous version of PIDGIN. For this version, larger numbers indicate larger
	enrichments.
	
In this example, there are six targets with a corrected p-value less than 0.05 with a Odds
or Risk ratio greater than 1.0. All targets have known links to cytotoxicity, for example
three are related to Tublin with known mechanisms to cytotoxicity (via cytoskeletal
machinery).

More complicated example
------------------------

Target/pathway/disease enrichment analysis can be combined with all model filters outlined
in the previous section "Getting started".
 
For example, the following code:

.. code-block:: shell-session

	$ python $PV3/predict_enriched.py --f1 cytotox_library.smi --f2 nontoxic_background.smi --organism Drosophila -b 100 --known_flag --ad 0 -n 4 -p 0.8 --min_size 50 --se_filter --performance_filter l50po,bedroc,0.8
	
would filter for Drosophila models that did not require Sphere Exlusion (SE) (i.e. sufficient number of inactives available) and a minimum number of 50 actives in the training set, with a minimum BEDROC performance of 0.8 for leave out 50% of ChEMBL publications from training data over 4-fold cross validation (L50PO), to produce enrichment predictions at a 0.8 probability cut-off at a threshold of 100μM, with the Applicability Domain (AD) filter silenced and where known activities (in ChEMBL or PubChem) are set.


References
----------

.. [1] |mervin2016|

.. include:: ../substitutions.rst