23 Dec 2022 (Last Modified 23 Dec 2022)
For named entity recognition in SpaCy, the knowledge base is the thing to which entities in a specific text are linked. SpaCy assumes a format
#QID,identifier,description
"Q312545","Roy Stanley Emerson","Australian tennis player"
"Q48226","Ralph Waldo Emerson","American philosopher, essayist, and poet"
"Q215952","Emerson Ferreira da Rosa","Brazilian footballer"
Let $\textrm{proj}_s{w} $ denote the embedding (projection) of a word $w$ into space $s$.
Let $t$ denote a token and denote the tokens that describe an entity, $ e_i \in \left\{e\right\}$ by $ \left\{t\right\}^\left\{e_i\right\}$.
SpaCy will suggest that entities $\left\{e_i\right\}$ should be linked with $w$ if $|\textrm{proj}_s{w} - \bar{\textrm{proj}_s{t_i}}| < \epsilon $.
The bar denotes the average. The summation is implied over all $\left\{t_j\right\}$
As the mathermatical formulation the performance of the ner
module hinges on
The word descriptions is ambiguous. For SpaCy, it means synyonyms and context in the form of associated tokens. It does not mean anything approaching a definition, concordance, or etymology of that term.
Our entry for DNP might be thus
#tab-separated values
#id alias description
n DNP DNP, 2,4-dinitrophenol, weight loss agent, uncoupler of oxidative phosphorylation, phenolic, fat burner, fatburner ...
n 2,4-DNP ditto
n 2,4-dinitrophenol ditto
Each alias for an entity contains the same description. Each description begins with an enumeration of all aliases and then provides synonyms and closely related words.
We assemble our preliminary list by creating a tsv file with two columns an index, and all the sign and symptoms from our curated data sets.
ENTITIES=./data/kb/dnp.entities.draft.tsv
touch ${ENTITIES}
for file in ./data/bb.corpus.deduplicated.cleaned.reconciled.ner.*.jsonl
do
echo $file
jq -r '.spans[].text | select( . != null)' $file >> ${ENTITIES}
done