Our client was an Austrian bank that utilized its proprietary accounting software. Their database contained extensive private client data, including information related to the energy parameters of their clients' buildings.
These parameters needed to be extracted from energy certificates, typically presented as PDF documents with a standardized format. Our team developed an application designed to extract values for a predefined set of parameters from these PDF energy certificates.
Austrian Bank Energy Certificates PDFs Extraction
The Business and Technical Challenges:
To address this challenge, we leveraged state-of-the-art machine learning techniques to ensure high-quality Optical Character Recognition (OCR). The primary technical challenge stemmed from the variability in energy certificates. These certificates could have different numbers of pages, contain additional data, feature similar fields, include high-quality text layers, or present scanned images. Our goal was to identify and extract critical field/value pairs from these energy certificate PDFs and store them in a structured format.
Our task was to develop a backend application responsible for the extraction of vital information from these energy certificates.
To tackle this problem, we designed a cloud-based application using an event-driven architecture. The core service responsible for text recognition within PDF documents was Amazon Textract. It detected text regions and extracted the text content. Subsequently, a custom application with tailored logic was employed to identify the desired field/value pairs and save them to a database.
The general event-driven workflow included the following steps:
1. The energy certificate (PDF document) was uploaded to an input S3 bucket.
2. AWS Lambda sent the PDF document to Amazon Textract for essential text recognition. Upon completion, Amazon Textract triggered a notification to an Amazon SNS topic.
3. Amazon SNS relayed a message to an Amazon SQS queue.
4. The message in the queue initiated an AWS Lambda function that analyzed the recognition results, extracted the required information, and saved it to the output S3 bucket.
The Tech Stack Used in the Project:
Python for backend application code
React Native JS for frontend application
AWS services as a cloud platform
Amazon Textract as a OCR service
We successfully developed a cloud-based application capable of accurately recognizing and extracting field/value pairs from energy certificates. The application adopted a modern event-driven approach, ensuring scalability and high accuracy in data extraction.
Our Clients Say:
We are very happy with the actual results of this project, especially when using UI a user can see the PDF as well as the extracted data. It demonstrated that we can achieve great results in a short time with the DataEngi team and AWS-Textract.