- Newest
- Most votes
- Most comments
The issue you're experiencing with Amazon Textract truncating data in table cells is not a common behavior, and there might be a few ways to address this problem:
-
Use the TABLES feature type: Instead of using only the "FORMS" feature type, try including "TABLES" in your FeatureTypes list. This might provide better accuracy for table extraction:
FeatureTypes=["FORMS", "TABLES"]
-
Check cell merging: Ensure that the cells in your PDF are not merged in a way that's confusing Textract. Sometimes, if cells are merged or have unusual formatting, it can affect the extraction accuracy.
-
Increase image quality: If possible, try increasing the resolution or quality of your PDF. Higher quality images generally lead to better text extraction.
-
Use raw text extraction: If the issue persists, you could try using the DetectDocumentText API instead, which focuses on raw text extraction without trying to interpret the document structure. This might capture all the text, which you could then parse yourself.
-
Custom queries: If you're consistently having issues with specific fields, you might want to consider using custom queries to target those fields specifically.
-
Check for hidden characters: Ensure there are no hidden characters or formatting issues in the PDF that might be interfering with the extraction.
If these suggestions don't resolve the issue, it would be worth contacting AWS support, as this behavior seems unusual for Textract, especially if it's consistently truncating the same field across multiple documents. They might be able to provide more specific guidance or identify if there's an underlying issue with the service for your particular use case.
Sources
Analyzing Documents - Amazon Textract
Detecting Text - Amazon Textract
Tables - Amazon Textract
Hi, Thanks for sharing the feedback and sorry for the inconvenience it might have brought to you. AWS Textract prediction results highly depend on the document quality as well. Without seeing the original document, i tried take a screenshot of your document and sent to textract api and was able to get all the text. can you share your original document? or can you try improve the document quality before calling AWS Textract? thanks
Relevant content
- asked 3 years ago
- asked a year ago
- AWS OFFICIALUpdated 5 months ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 3 months ago