Do Textract DetectDocumentText support PDF format?

0

I was checking the javascript @aws-sdk/client-textract documentation.

In the DetectDocumentTextCommand docs page, it claimed that supports JPEG, PNG, PDF, or TIFF format.

But in the DetectDocumentTextCommandInput docs page, it claimed that only supports JPEG or PNG format.

I tried the command with PDF file in S3 and also BLOB format, it throws UnsupportedDocumentException. Just trying to figure out whether it doesn't support PDF format, or there's some bug right here.

DetectDocumentTextCommand docs page: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-textract/classes/detectdocumenttextcommand.html

DetectDocumentTextCommandInput docs page: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-textract/interfaces/detectdocumenttextcommandinput.html

Austin
質問済み 1年前377ビュー
2回答
1
承認された回答

Yes, the synchronous DetectText API does support PDF documents. However, the document must have maximum 1 page and cannot be larger than 10MB (source). These limits are in place because the API is synchronous and there is an expectation that the result will be returned quickly. A multi-page PDF document takes longer to process and can only be done with the asynchronous StartDocumentTextDetection API.

I agree that the documentation you link in your question is unclear on this, so I will report this to the Textract documentation team and ask to have this updated.

AWS
S_Moose
回答済み 1年前
profile picture
エキスパート
レビュー済み 9日前
0

Hi Moose, thanks for the clarification. I was trying on a multi page PDF, no wonder it doesn't work. Will check out the async solution. But ideally I need a synchronous solution, mayb I have to do it with step functions.

Austin
回答済み 1年前
  • I have seen customers use a Lambda function (or other compute) to split the document into individual pages, then make several synchronous calls for each page, and merge the results together afterwards. Just make sure you check the quotas applied to your account for Textract, because you may see a throttling error if you hit the quota. Most quotas can be increased if you request it.

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ