Skip to content

If AWS Textract itself makes a mistake by extracting the same table into different columns, what should be done? I want to merge all the tables to a single table in CSV.

0

I am developing a Bank Statement PDF Converter that converts PDF files into CSV format. I have resolved many bugs in the output and am nearing completion.

However, during testing with sample files, I discovered an issue: AWS Textract extracts one table as a 5-column table, while a similar table on the next page is extracted as a 6-column table.

This inconsistency in the number of columns makes it challenging to merge all the tables, resulting in an improper CSV output.

How to fix this??? Please help so that I can deploy my project on my server.

asked a year ago275 views
1 Answer
0

When dealing with inconsistencies in table extraction by Amazon Textract, such as extracting similar tables with different numbers of columns, there are a few approaches you can consider:

  1. Use text detection as a workaround: Amazon Textract recommends using text detection instead of table extraction in cases where there are inconsistencies, especially with merged cells spanning multiple columns or tables with varying cell, row, or column structures.

  2. Post-processing: Implement a post-processing step in your code to normalize the extracted data. This could involve:

    • Analyzing the structure of all extracted tables
    • Identifying the most common column structure
    • Mapping the inconsistent tables to match the common structure
  3. Custom parsing: Instead of relying solely on Textract's table extraction, you could use the raw text extraction and implement your own table parsing logic. This gives you more control over how the data is structured.

  4. Confidence scores: Utilize the confidence scores provided by Amazon Textract to identify potentially problematic extractions. You may need to manually review or adjust extractions with lower confidence scores.

  5. Pre-processing: If possible, try to standardize the input PDFs before processing them with Textract. This might involve converting the PDFs to a consistent format or resolution.

  6. Iterative approach: Process the document multiple times with different settings or APIs (e.g., using both table extraction and text detection) and compare the results to find the most consistent output.

Remember that while these approaches can help, they may require additional development effort and testing to ensure accuracy across various document types. It's important to thoroughly test your solution with a diverse set of sample documents to ensure it handles different scenarios effectively before deploying your project.

Sources
Best Practices - Amazon Textract
Handling Connection Errors - Amazon Textract
Tables - Amazon Textract

answered a year ago
EXPERT
reviewed a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.