Skip to content

Issue with Textract Converting Decimal Points to Commas in Extracted Data.

0

Hello, I am using Amazon Textract to extract tabular data from PDFs. The tables primarily contain numerical values with decimals. However, I have noticed that in some cases, Textract is incorrectly converting decimal points (.) into commas (,), which is causing miscalculations when processing the extracted data. For example, the actual value in the table is "14.5%", but Textract extracts it as "14,5%". When converting this to a float, it results in 145 instead of 14.5, leading to incorrect calculations. Has anyone else encountered this issue? Are there any recommended solutions or workarounds to ensure decimal points are correctly retained during extraction? Thanks! Actual table in PDF Extracted table using Textract

asked 8 months ago102 views
1 Answer
0

The issue you're experiencing with Amazon Textract converting decimal points to commas in extracted data is a known challenge, particularly when dealing with numerical data in tables. This problem can indeed lead to miscalculations and data interpretation errors. Here are some potential solutions and workarounds to address this issue:

  1. Post-processing: Implement a post-processing step in your code to correct the decimal point issue. You can create a function that searches for patterns like "number,number%" and replaces the comma with a decimal point. This approach allows you to clean the data after extraction but before any calculations are performed.

  2. Custom parsing: Instead of relying solely on Textract's table extraction, you could use the raw text extraction feature and implement your own parsing logic. This gives you more control over how the data is interpreted and structured, allowing you to handle decimal points correctly.

  3. Locale settings: Check if the issue is related to locale settings. Some regions use commas as decimal separators. Ensure your Textract settings and subsequent data processing steps are configured to use the correct locale for your data.

  4. Pre-processing: If possible, try to standardize the input PDFs before processing them with Textract. This might involve converting the PDFs to a consistent format or adjusting the way decimal points are represented in the source documents.

  5. Use confidence scores: Utilize the confidence scores provided by Amazon Textract to identify potentially problematic extractions. You may need to manually review or adjust extractions with lower confidence scores, particularly for numerical data.

  6. Alternative representation: If feasible, consider modifying how decimal numbers are represented in your source documents. For example, using a different format or font for decimal points might improve Textract's accuracy in recognizing them correctly.

  7. Feedback to AWS: It's valuable to provide feedback to Amazon about this limitation. While it won't provide an immediate solution, it can help prioritize improvements to the service in future updates.

Remember to thoroughly test any solution you implement with a diverse set of sample documents to ensure it handles different scenarios effectively. You may need to combine multiple approaches to achieve the most accurate results for your specific use case.

If the issue persists despite these measures, you might want to consider using additional OCR tools in conjunction with Textract or exploring third-party libraries specialized in handling numerical data extraction from PDFs.

Sources
AWS Textract Missing Crucial Data in a Table | AWS re:Post
Recognition of '1' from table filled in by hand | AWS re:Post
If AWS Textract itself makes a mistake by extracting the same table into different columns, what should be done? I want to merge all the tables to a single table in CSV. | AWS re:Post

answered 8 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.