Do Textract DetectDocumentText support PDF format?

0

I was checking the javascript @aws-sdk/client-textract documentation.

In the DetectDocumentTextCommand docs page, it claimed that supports JPEG, PNG, PDF, or TIFF format.

But in the DetectDocumentTextCommandInput docs page, it claimed that only supports JPEG or PNG format.

I tried the command with PDF file in S3 and also BLOB format, it throws UnsupportedDocumentException. Just trying to figure out whether it doesn't support PDF format, or there's some bug right here.

DetectDocumentTextCommand docs page: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-textract/classes/detectdocumenttextcommand.html

DetectDocumentTextCommandInput docs page: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-textract/interfaces/detectdocumenttextcommandinput.html

Austin
gefragt vor einem Jahr377 Aufrufe
2 Antworten
1
Akzeptierte Antwort

Yes, the synchronous DetectText API does support PDF documents. However, the document must have maximum 1 page and cannot be larger than 10MB (source). These limits are in place because the API is synchronous and there is an expectation that the result will be returned quickly. A multi-page PDF document takes longer to process and can only be done with the asynchronous StartDocumentTextDetection API.

I agree that the documentation you link in your question is unclear on this, so I will report this to the Textract documentation team and ask to have this updated.

AWS
S_Moose
beantwortet vor einem Jahr
profile picture
EXPERTE
überprüft vor 9 Tagen
0

Hi Moose, thanks for the clarification. I was trying on a multi page PDF, no wonder it doesn't work. Will check out the async solution. But ideally I need a synchronous solution, mayb I have to do it with step functions.

Austin
beantwortet vor einem Jahr
  • I have seen customers use a Lambda function (or other compute) to split the document into individual pages, then make several synchronous calls for each page, and merge the results together afterwards. Just make sure you check the quotas applied to your account for Textract, because you may see a throttling error if you hit the quota. Most quotas can be increased if you request it.

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen