MedBERT: A Pre-trained Language Model for Biomedical Named Entity Recognition

posted on 2023-08-31, 05:17 authored by Charangan Vasantharajan, Kyaw Zin Tun, Ho Thi-Nga, Sparsh Jain, Rong TongRong Tong, Eng-Siong Chng

This paper introduces MedBERT, a new pretrained transformer-based model for biomedical named entity recognition. MedBERT is trained with 57.46M tokens collected from biomedical-related data sources, i.e. datasets acquired from N2C2, BioNLP, CRAFT challenges, and biomedical-related articles crawled from Wikipedia. We validate the effectiveness of MedBERT by comparing it with four publicly available pre-trained models on ten biomedical datasets from BioNLP and CRAFT shared tasks.

Our experimental results show that models fine-tuned on MedBERT achieve state-of-the-art performance in nine datasets that predict Protein, Gene, Chemical, Cellular/Component, Gene Ontology, and Taxonomy entities. Specifically, the model achieved an average of 84.04% F1-micro score on ten test sets from BioNLP and CRAFT challenges with an improvement of3.7% and 7.83% as compared to models that were fine-tuned on BioBERT and Bio ClinicalBERT, respectively.


2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 07-10 November 2022, Chiang Mai, Thailand.

  • Post-print

© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Charangan Vasantharajan

