By using AWS re:Post, you agree to the AWS re:Post Terms of Use

Do Textract DetectDocumentText support PDF format?

0

I was checking the javascript @aws-sdk/client-textract documentation.

In the DetectDocumentTextCommand docs page, it claimed that supports JPEG, PNG, PDF, or TIFF format.

But in the DetectDocumentTextCommandInput docs page, it claimed that only supports JPEG or PNG format.

I tried the command with PDF file in S3 and also BLOB format, it throws UnsupportedDocumentException. Just trying to figure out whether it doesn't support PDF format, or there's some bug right here.

DetectDocumentTextCommand docs page: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-textract/classes/detectdocumenttextcommand.html

DetectDocumentTextCommandInput docs page: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-textract/interfaces/detectdocumenttextcommandinput.html

asked 2 years ago648 views
2 Answers
2
Accepted Answer

Yes, the synchronous DetectText API does support PDF documents. However, the document must have maximum 1 page and cannot be larger than 10MB (source). These limits are in place because the API is synchronous and there is an expectation that the result will be returned quickly. A multi-page PDF document takes longer to process and can only be done with the asynchronous StartDocumentTextDetection API.

I agree that the documentation you link in your question is unclear on this, so I will report this to the Textract documentation team and ask to have this updated.

AWS
answered 2 years ago
profile picture
EXPERT
reviewed 6 months ago
profile picture
EXPERT
reviewed 8 months ago
0

Hi Moose, thanks for the clarification. I was trying on a multi page PDF, no wonder it doesn't work. Will check out the async solution. But ideally I need a synchronous solution, mayb I have to do it with step functions.

answered 2 years ago
  • I have seen customers use a Lambda function (or other compute) to split the document into individual pages, then make several synchronous calls for each page, and merge the results together afterwards. Just make sure you check the quotas applied to your account for Textract, because you may see a throttling error if you hit the quota. Most quotas can be increased if you request it.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions