Computing Plausibility of Substance-Effect Asociations Described Online

Computing Plausibility of Substance-Effect Asociations Described Online


  1. Read Applying Bradford-Hill in the Modern Era
  2. Make Clickable Map for Computation.
  3. Read the druggable schizophrenia genome

To identify which substance-effect associations mentioned online can be used to generate hypotheses it is immportant to demonstrate that the association is specific and plausible. I use these terms as they are used in the Bradford-Hill Criteria.

The more specific the association between two tokens, A and B, the more often B occurs whenever A occurs and doesn’t occur unless A occurs. In a limited domain of discourse, tf-idf does well in identifying the specificity of association after correcting for baseline rates (citation needed). The main difference between tf-idf and dimension reduction on a covariance matrix is the idf component. Tf-idf weights each entry by the number of documents the token appears in. This provides a more context-relevant scaling than subtracting a measure of central tendency from the tf term at the expense of making it impossible to compare absolute tf-idf scores across corpora.

The more plausible an association between two tokens, A and B, are, the more the association between A and B can be explained in terms of accepted knowledge (e.g., things one would find in a knowlwedge base). Right off the bat we can anticipate that all computational of plausibility are preliminary because the knowledge base is incomplete, at least until the last question is answered. Said another way, computations of plausibility are necessarily limited by the open world assumption. In this notebook, I discuss computing plausibility in terms of fractional overlap of gene targets. See Alternative Approaches for approaches that use additional information, like network structure, that provide more information but are also less available.

We start with the following resources:

Resource Role (n)
KEGG Map Substance to Signal Transduction Cascade  
DrugBank Maps Substances to Biochemical Targets 14,594
GO Map Biocheimcal Intermediary to Effect  
DisGeNet Maps Gene Target to Clinical Phenotype 441 (for OUD or PTSD)


flowchart TD
	subgraph Statistic[Similarity of Gene Profile]
		direction LR
        A[Drugs froms Social Media<br>In Drug Bank with Known Targets] --> B[Genes of Drug Targets]
		B --> C[Co-expressed Genes] & D[First Neighbors of Drug]
		C & D --> E[Expanded Gene Set]
        E --> F[<p align='left'>Jaccard Similarity Between Expanded Gene Set<br> and Genes Differentially Expressed in PTSD, OUD, or Both]
    subgraph Significance[Calculation of Significance]
        direction LR
       a[Drugs in DrugaBank with Known Targets]
       a --> b[Calculate Jaccard Similarity for Their Expanded Gene Sets]
       b --> c[Create Empirical Probability Density Function]
       b --> d["<p align='left'>Identify Median Jaccard Similarity for Currently Approved PTSD, OUD Therapeutics: <li align='left'>ROC Curve  <li align='left'>(Is there a Deault Network?)"]

    Statistic --> Significance
(+) Controls (-) Controls

The Computation

I obtained the XML dump of DrugBank (current as of March 2022) from DrugBank, licensed for academic use. I converted the XML to a JSON array using xmltodict and then imported the array into a MongoDB server. Even though DrugBank provides an XSD, I found MongoDB faster and more intuitive than XPath.

I obtained the genes that were direct targets of methadone with the following query (example is for methadone):

> db.drugs.aggregate([{$match:{"name":"Methadone"}},{$project:{_id:0,"name":"$","gene-name":"$"}}])

 # "{"name" : [ "Mu-type opioid receptor", "NMDA receptor", "Delta-type opioid receptor", "5-hydroxytryptamine receptor 3A", "Neuronal acetylcholine receptor subunit alpha-7", "Neuronal acetylcholine receptor subunit alpha-3", "Neuronal acetylcholine receptor subunit alpha-4", "Neuronal acetylcholine receptor subunit beta-2" ], "gene-name" : [ "OPRM1", [ "GRIN1", "GRIN2A", "GRIN2B", "GRIN2C", "GRIN2D", "GRIN3A", "GRIN3B" ], "OPRD1", "HTR3A", "CHRNA7", "CHRNA3", "CHRNA4", "CHRNB2" ] }

# Not sure why a nested list was returned. 

Let’s use Enrichr (API) to identify the 100 most commonly co-exressed genes. For methadone this created a set of 1400 genes, of which 994 were unique and has 2 genes overlapping, OPRM1, which we knew and GAL, a gene that encodes for galanin and galanin-map associated peptide. Indeed this article suggest that heteromers between the receptor for galanin and the opioid receptors may attenuate the addictive potential of methadone. This paper suggests that galanin modulates stress reactions. The idea

Then we ask KEGG what human pathways those genes are implicated in

DisGeNet provides a REST API. I queried the API for PTSD (CUI:C0038436) and OUD (CUI:C4324621), which returned 441 genes.

#import requests
PTSD = "C0038436"
OUD = "C4324621"
diseases = f'{PTSD},{OUD}'

if api_key:
    s.headers.update({"Authorization": "Bearer %s" % api_key}) 
    gda_response = s.get(f'{api_host}/gda/disease/{diseases}', params={'source':'ALL'})
if s:

I imported the JSON array into MongoDB so queries could reference both data sets. We can now compute the fraction of targets for a drug of interest that lie in the intersection of OUD- and PTSD-associated genes.

An alternative approach would be to export the gene names for each condition to a text file. This worked for PTSD and OUD. In bash, one can compute the intersection as sort OUD.txt PTSD.txt | uniq -d if we treat gene expression as a categorical variable.

sort OUD.txt PTSD.txt | uniq -d
Condition Genes
PTSD & OUD & Methadone OPRM1

My interpretation is that there is only one intersection because the Methadone row contains genes of direct targets, but the PTSD & OUD row contains genes for intracellular components. To add genes for intracellular components for (i.e., downstream effectors of) the direct targets of Methadone. Luckily, this post describes exactly how to do that.

First, download the ids of all targets.

curl -Lfv -o -u $username:$password -vs > ./external/drugbank_targets .txt 2>&1

mv uniprot ./external uniprot kegg_uniprot_ids.tsv

My file structure now looks like.

tree -d 
├── Area 51
└── external

A peak at the files:

#we are in the external subdirectory of data/
head -n 3 drug-bank-target-ids.csv | csvlook | less -S #has very long columns
ID Name Gene Name GenBank Protein ID GenBank Gene ID UniProt ID Uniprot Title PDB ID GeneCar
1 Peptidoglycan synthase FtsI ftsI 1,574,687 L42023 P45059 FTSI_HAEIN      
2 Histidine decarboxylase HDC 32,109 X54297 P19113 DCHS_HUMAN 4E1O    
head -n 2 kegg_uniprot_ids.tsv 
up:P0AD86	eco:b0001
up:P00561	eco:b0002

Alternative Approaches to Computing Plausibility

The goal is to create a graph like the following

	subgraph Path Length = 3
	A[Substance] --> B --> C --> D[Effect]

	subgraph Path Length = 2
	AA[Substance] --> BB[B] --> DD[Effect]

We infer that a path with a shorter length is more biologically plausible than a path with a longer length because the shorter path involves fewer intermediaries and so fewer dependencies on other signal transduction cascades or cellular processes.

Our corpus lists substance and effects. We assemble a graph by linking (substance, target), (target, target), and (target, effect) pairs. Here target means any biochemical intermediary, such as the target receptor or a carrier protein.