ATTN: towards practical automated tabular semantic analysis
Access to printed copies of documents is only available in many organisations due to legal restrictions. Digitalising these documents has several challenges, such as overlapping texts and cancellations due to manual editing, varying layouts, low contrast, physical damages, and high cost for cloud-based (e.g., AWS) bulk processing. This paper introduces a low-cost practical method for analysing tabular semantics in printed document digitisation. We propose to first extract the text labels followed by text values and table structure semantics, then refine the extraction. Our method leverages Fuzzy matching, and Spatial hashing to facilitate the extraction. The results showcase that our method is effective and efficient with less than 1 cent/page cost on AWS.
History
Journal/Conference/Book title
International Workshop on Advanced Imaging Technology (IWAIT) 2024, 2024, Langkawi, MalaysiaPublication date
2024-05-02Version
- Published