Singapore Institute of Technology
Browse

A Modular OCR and NLP Pipeline for Digitizing Calibration Reports in Metrology

conference contribution
posted on 2025-10-16, 13:23 authored by Kah Kian Fong, Nicholas Heng Loong WongNicholas Heng Loong Wong
<p dir="ltr">Calibration reports are essential for traceability and quality assurance in metrology. Yet, their common storage as scanned PDFs severely limits digital usability. This paper introduces a modular optical character recognition (OCR) and natural language processing (NLP) pipeline to digitize and structure such reports into JSON format. The pipeline integrates layout-aware text detection using PaddleOCR, robust table extraction via PP-StructureV2 with TableMaster, and domain-specific named-entity recognition using SciBERT fine?tuned on calibration-specific terminology. Tested on real-world calibration reports for single-axis and roundness measuring machines, the pipeline achieved a median processing time of 95 seconds per report, a 48% reduction compared to manual transcription. A Flask-based front end enables data verification, while a MongoDB database supports flexible querying and trend analysis. These features collectively deliver quantifiable improvements in processing speed, structured data quality, and traceability for metrology operations.</p>

History

Journal/Conference/Book title

IEEE International Conference on Service Operations and Logistics, and Informatics (SOLI)

Publication date

2025-09-29

Version

  • Pre-print

Corresponding author

nicholas.wong@singaporetech.edu.sg