Do Textract DetectDocumentText support PDF format?

0

I was checking the javascript @aws-sdk/client-textract documentation.

In the DetectDocumentTextCommand docs page, it claimed that supports JPEG, PNG, PDF, or TIFF format.

But in the DetectDocumentTextCommandInput docs page, it claimed that only supports JPEG or PNG format.

I tried the command with PDF file in S3 and also BLOB format, it throws UnsupportedDocumentException. Just trying to figure out whether it doesn't support PDF format, or there's some bug right here.

DetectDocumentTextCommand docs page: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-textract/classes/detectdocumenttextcommand.html

DetectDocumentTextCommandInput docs page: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-textract/interfaces/detectdocumenttextcommandinput.html

Austin
posta un anno fa390 visualizzazioni
2 Risposte
1
Risposta accettata

Yes, the synchronous DetectText API does support PDF documents. However, the document must have maximum 1 page and cannot be larger than 10MB (source). These limits are in place because the API is synchronous and there is an expectation that the result will be returned quickly. A multi-page PDF document takes longer to process and can only be done with the asynchronous StartDocumentTextDetection API.

I agree that the documentation you link in your question is unclear on this, so I will report this to the Textract documentation team and ask to have this updated.

AWS
S_Moose
con risposta un anno fa
profile picture
ESPERTO
verificato 16 giorni fa
0

Hi Moose, thanks for the clarification. I was trying on a multi page PDF, no wonder it doesn't work. Will check out the async solution. But ideally I need a synchronous solution, mayb I have to do it with step functions.

Austin
con risposta un anno fa
  • I have seen customers use a Lambda function (or other compute) to split the document into individual pages, then make several synchronous calls for each page, and merge the results together afterwards. Just make sure you check the quotas applied to your account for Textract, because you may see a throttling error if you hit the quota. Most quotas can be increased if you request it.

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande