Dealing with large dimensioned, small-data PDFs in Textract

0

I am getting a INVALID_DOCUMENT_TYPE error when trying to process a given PDF with Textract even though the PDF is only 1MB. However, the PDF is about 105"x35" which I know is greater than the allowed quota limit. I had two primary quesitons:

  • Is there a way to get more expressive error returns with Textract? This debugging took me quite awhile to find the size issue as there seems to only be one overarching exception, UnsupportedDocumentException, for these types of errors while there are any possible document quota issues.
  • Are there best practices for splitting up large PDFs within the Textract system? The file has a large amount of white space which causes this size to dimensions variation.
  • Zac
profile picture
Zac Dan
asked a year ago230 views
1 Answer
1
Accepted Answer
  1. I understand that you would like to know if you are able to get more logs from textract. Unfortunately there is limitations with textract logs. What you are currently seeing is all the logging are currently supported. You could also see more info by checking the cloudtrail api calls, you could do this manually by checking the cloudtrail console, or set up logging with cloudwatch to view your cloudtrail logs[1] Usually that error happens when the document does not follow the criteria listed here [2]. Or it could be in cases where the doc is corrupted or encoded incorrectly.

  2. For PDF's with pages greater than 3000, I recommended splitting your PDF into batches so that they fall within the acceptable ranges of pages. I have also provided an external link for a PDF splitter code you can implement [3]. For extra information, for images above 10 MB I recommended that you decrease the resolution of the images until they meet the 10 MB mark. I can recommend OpenCV to achieve this.

Resources: [1] https://docs.aws.amazon.com/textract/latest/dg/logging-using-cloudtrail.html

[2] https://docs.aws.amazon.com/textract/latest/dg/API_Document.html

[3] https://github.com/x4nth055/pythoncode-tutorials/tree/master/handling-pdf-files/split-pdf

AWS
answered 10 months ago
profile picture
EXPERT
reviewed a month ago
  • Thanks for the note! Specifically, I was wanting to try see if there was a way to get more granularity on when. document fails which of the criteria it failed on. For my documents I found that it was their overall physical size not the data size, but the error did not offer that specification. Appreciate your response.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions