- Newest
- Most votes
- Most comments
Yes, the synchronous DetectText API does support PDF documents. However, the document must have maximum 1 page and cannot be larger than 10MB (source). These limits are in place because the API is synchronous and there is an expectation that the result will be returned quickly. A multi-page PDF document takes longer to process and can only be done with the asynchronous StartDocumentTextDetection API.
I agree that the documentation you link in your question is unclear on this, so I will report this to the Textract documentation team and ask to have this updated.
Hi Moose, thanks for the clarification. I was trying on a multi page PDF, no wonder it doesn't work. Will check out the async solution. But ideally I need a synchronous solution, mayb I have to do it with step functions.
Relevant content
- asked 2 years ago
- asked 6 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 9 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 4 months ago
I have seen customers use a Lambda function (or other compute) to split the document into individual pages, then make several synchronous calls for each page, and merge the results together afterwards. Just make sure you check the quotas applied to your account for Textract, because you may see a throttling error if you hit the quota. Most quotas can be increased if you request it.