25 Jun 2022 (Last Modified 23 Dec 2022)
Named entity linking links mentions in a corpus with what they signify. One has to:
In my project on understanding DNP toxicity I use NEL (distinguishing it from NER) to understand social media postings in terms of biomedical knowledge. I view NEL as a step along the path converting strings of characters in posts to statements with which one can perform the logical operations such as inference, deduction, or abstraction.
I’m following SpaCy’s v3 NEL tutorial.
We have a NER module, ner
we trained on about 4k posts, tested on 1k, updated the model and validated it on another 1k (manuscript/preprint under preparation). ner
recognizes substances and symptoms1.
An important feature of NEL is mapping all mentions of an entity to the standardized representation of that entity. For example, test and testosterone should both be mapped to a universal representation of the concept of testosterone. SpaCy’s example uses Wikipedia identifiers. This is not a generalizable approach, but it does work to demonstrate SpaCy’s approach to NEL. We use IRIs (Internationalized Resource Identifier)
See here for a fuller description of creating the knowledge base. SpaCy’s Knowledge Base is formatted as follows
#QID,identifier,description
"Q312545","Roy Stanley Emerson","Australian tennis player"
"Q48226","Ralph Waldo Emerson","American philosopher, essayist, and poet"
"Q215952","Emerson Ferreira da Rosa","Brazilian footballer"
Ours is formatted as follows.
ENTITY,TERMS,STANDRADISED MAPPING TERM,GROUP
Myalgia,Myalgia,Feeling Tired,Fatigue issue
Fatigue,Fatigue,Feeling Tired,Fatigue issue
Gassed,Fatigue,Feeling Tired,Fatigue issue
Exhausted,Fatigue,Feeling Tired,Fatigue issue
“identifier” in SpaCy’s schema corresponds to “entity” in ours.
The SpaCy example trains on Wikidata. We use our online forum DNP corpus.
ner
currently is better at recognizing substances and symptoms associated with bodybuilding and DNP use. This predilection reflects its training. ↩