Creating A Library of Chemical Structures

Creating A Library of Chemical Structures

Outstanding Questions.
  1. How to quantify the structural heterogeneity of each field?
  2. How to distinguish between agonists and antagonists?


For my Analysis of Latent Spaces I created a convolutional neural network (link) to classify compounds as belonging to drug categories. To obtain the drug categories and their labels.

The divide between the everyday terminology clinicians use and the formal language of the NDCs can effect observational studies (DeFalco et al., 2013), but our input here is SMILES strings.

Creating a Representative Training Set.

One consideration when creating the training data sets was to have enough samples of novel psychoactive substances so that the relative magnitudes of the fractional classifications would be meaningful.

Training Set Should Have

  1. Comparable Numbers of Compounds from Each Class
  2. Comparable Subclasses of Compounds within Each Class (e.g., opiates, semisynthetic opioids, opioids, novel opioids)

Classes (Is there a pre-existing classification I could use?). Links to CSV only work if you are a known collaborator for this project.

  1. Opioid (CSV one and two)
  2. Tryptamine (CSV)
  3. Phenethylamine (CSV)
  4. Benzodiazepine(CSV)
  5. Beta-blocker (CSV)
  6. Calcium Channel Blocker (CSV)
  7. Cannabinoids (CSV)

Testing sets:

  1. Ergolines (CSV)

I used PyBioMed (Git Hub Repo, Tutorial) and ChemSiPy. Here is a good overview of the databases describing apporved sustances (NB openFDA might be useful)

Code Narration.

I used requests to query RxNorm.

import requests
URL = ""
query = lambda classId: requests.get(url=URL, params={"classId": classId,"relaSource":"ATC"})

The keyword relaSource refers to the relationship that must hold between the drug and class for a drug to be considered a member of the class. I’m not sure why, but while the source of the relationship has to be specified, the object of the relationship is optional. The documentation does not explain this design choice.

Compound Class Data Source

Resources I used.

  1. Wikipedia: List of psychedelic substances
  2. Wikipedia: List of designer drugs
  3. Wikipedia: List of substances used in rituals
  4. Wikipedia: Substituted phenethylamines

I used the following Resources.

  1. Oxford’s Catalogue of Opioids (Richards et al., 2021)

The Wikipedia page, List of Pyschoactive Substances detailed serotonergic agonists and cannabinoid receptor agonists, which I extracted as

I didn’t include the benzofuran derivatives, dimemebfe, also known as 5-MeO-BFE and 5-MeO-DiBF because nither is, structurally, a tryptamine.


I began with the Wikipedia page. I excluded LSD derivatives because they contain phenethylamine. The D stands for diethylamide. I saved them in [filename] because I think they are a good test for the CNN. I expect them to classify LSD and LSD derivatives as partly tryptamine and partly phenethylamine.


I excluded HU-211 because it binds to NMDA as well as CB receptors. (Wikipedia suggests it has therapeutic uses.) HU-211 is the enantiomer of HU-210. The structural diversity of cannabinoids could be a second paper.


  1. 1.Richards, G. C., Sitkowski, K., Heneghan, C. & Aronson, J. K. The Oxford Catalogue of Opioids: A systematic synthesis of opioid drug names and their pharmacology. British journal of clinical pharmacology 87, 3790–3812 (2021).
  2. 2.DeFalco, F. J., Ryan, P. B. & Soledad Cepeda, M. Applying standardized drug terminologies to observational healthcare databases: a case study on opioid exposure. Health Services and Outcomes Research Methodology 13, 58–67 (2013).