MedBERT: A Pre-trained Language Model for Biomedical Named Entity Recognition
This paper introduces MedBERT, a new pretrained transformer-based model for biomedical named entity recognition. MedBERT is trained with 57.46M tokens collected from biomedical-related data sources, i.e. datasets acquired from N2C2, BioNLP, CRAFT challenges, and biomedical-related articles crawled from Wikipedia. We validate the effectiveness of MedBERT by comparing it with four publicly available pre-trained models on ten biomedical datasets from BioNLP and CRAFT shared tasks.
Our experimental results show that models fine-tuned on MedBERT achieve state-of-the-art performance in nine datasets that predict Protein, Gene, Chemical, Cellular/Component, Gene Ontology, and Taxonomy entities. Specifically, the model achieved an average of 84.04% F1-micro score on ten test sets from BioNLP and CRAFT challenges with an improvement of3.7% and 7.83% as compared to models that were fine-tuned on BioBERT and Bio ClinicalBERT, respectively.
History
Journal/Conference/Book title
2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 07-10 November 2022, Chiang Mai, Thailand.Publication date
2022-12-21Version
- Post-print