Singapore Institute of Technology
Browse
- No file added yet -

ATTN: towards practical automated tabular semantic analysis

Download (317.78 kB)
conference contribution
posted on 2024-09-30, 02:52 authored by Kan ChenKan Chen, Teck Wei Low, Alex Q. ChenAlex Q. Chen

Access to printed copies of documents is only available in many organisations due to legal restrictions. Digitalising these documents has several challenges, such as overlapping texts and cancellations due to manual editing, varying layouts, low contrast, physical damages, and high cost for cloud-based (e.g., AWS) bulk processing. This paper introduces a low-cost practical method for analysing tabular semantics in printed document digitisation. We propose to first extract the text labels followed by text values and table structure semantics, then refine the extraction. Our method leverages Fuzzy matching, and Spatial hashing to facilitate the extraction. The results showcase that our method is effective and efficient with less than 1 cent/page cost on AWS.

History

Journal/Conference/Book title

International Workshop on Advanced Imaging Technology (IWAIT) 2024, 2024, Langkawi, Malaysia

Publication date

2024-05-02

Version

  • Published

Rights statement

Copyright 2024 Society of Photo‑Optical Instrumentation Engineers (SPIE). One print or electronic copy may be made for personal use only. Systematic reproduction and distribution, duplication of any material in this publication for a fee or for commercial purposes, and modification of the contents of the publication are prohibited.

Usage metrics

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC