Do Textract DetectDocumentText support PDF format?


I was checking the javascript @aws-sdk/client-textract documentation.

In the DetectDocumentTextCommand docs page, it claimed that supports JPEG, PNG, PDF, or TIFF format.

But in the DetectDocumentTextCommandInput docs page, it claimed that only supports JPEG or PNG format.

I tried the command with PDF file in S3 and also BLOB format, it throws UnsupportedDocumentException. Just trying to figure out whether it doesn't support PDF format, or there's some bug right here.

DetectDocumentTextCommand docs page:

DetectDocumentTextCommandInput docs page:

demandé il y a un an377 vues
2 réponses
Réponse acceptée

Yes, the synchronous DetectText API does support PDF documents. However, the document must have maximum 1 page and cannot be larger than 10MB (source). These limits are in place because the API is synchronous and there is an expectation that the result will be returned quickly. A multi-page PDF document takes longer to process and can only be done with the asynchronous StartDocumentTextDetection API.

I agree that the documentation you link in your question is unclear on this, so I will report this to the Textract documentation team and ask to have this updated.

répondu il y a un an
profile picture
vérifié il y a 9 jours

Hi Moose, thanks for the clarification. I was trying on a multi page PDF, no wonder it doesn't work. Will check out the async solution. But ideally I need a synchronous solution, mayb I have to do it with step functions.

répondu il y a un an
  • I have seen customers use a Lambda function (or other compute) to split the document into individual pages, then make several synchronous calls for each page, and merge the results together afterwards. Just make sure you check the quotas applied to your account for Textract, because you may see a throttling error if you hit the quota. Most quotas can be increased if you request it.

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions