Named Entity Recognition in Python

22 Mar 2023 (Last Modified 23 Apr 2023)

Named Entity Recognition in Python

Introduction.

Named Entity Recognition (NER) refers to mapping groups of characters to known entities in the real word, for example recognizing that the sequence of characters ball refers to the round object that bounces. NER is linked with named entity linking (NEL) where the recognized entity is mapped onto a unique identifier.

Literature Review

Biomedical NER is challenging. Biomedical texts have compound words and large out-of-vocabulary sizes. Except for RNN-based models whose F1 scores are around 0.60, models useing word embeddings (GloVe, Word2Vec) have F1 scores between 0.7 and 0.75 (Song et al., 2018). An ensemble model can achieved an F1 score of 0.93 on biomedical texts, but not validation for social media has been reported (Sung et al., 2022).

Including character level features can increase the generalizability of word-level embeddings. Indeed, adding a bidirectional LTSM improves state-of-the-art biomedical NER systems based on word embeddings improved the F1 score from .70 to .75 (Gridach, 2017). This increase is modest. Most biomedical NER systems that rely on word embeddings have F1 scores between 0.70 and 0.75.

Character level features frequently vary in morphology even while preserving phonology, for example carfentanil but fentanyl (Kim & Kang, 2022).

Named Entity Linking in Python

Bibliography

1.Kim, H. & Kang, J. How do your biomedical named entity recognition models generalize to novel entities? Ieee Access 10, 31513–31523 (2022).
2.Sung, M. et al. BERN2: an advanced neural biomedical named entity recognition and normalization tool. Bioinformatics 38, 4837–4839 (2022).
3.Song, H.-J., Jo, B.-C., Park, C.-Y., Kim, J.-D. & Kim, Y.-S. Comparison of named entity recognition methodologies in biomedical documents. Biomedical engineering online 17, 1–14 (2018).
4.Gridach, M. Character-level neural network for biomedical named entity recognition. Journal of biomedical informatics 70, 85–91 (2017).