- Newest
- Most votes
- Most comments
You need to first grab the entire JSON and combine it when you deal with paginated responses.
In order to get the full JSON check out the Textract Caller (https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller) and the get_full_json method (https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/caller/textractcaller/t_call.py#L256).
For smaller number of pages and jobs those work fine, but they poll against the Textract Get* APIs. For larger numbers of pages and multiple concurrent jobs please pass in the OutputConfig and the get_full_json_from_output_config after notification from SNS, otherwise you may will get throttled on the Textract Get* calls.
With that full JSON the order_blocks_by_geo should work just fine.
For large PDF with thousands of pages, the memory consumption will be high (from my experience Python consumes 5x the RAM the file consumes on disk...), so make sure you have enough RAM available.
Hi @ddanger,
Thanks for reaching out to Textract. It'd be helpful if you could share the sample document you are using to test with us via customer support. Our team can then have a look at the output and get back to you.
Thanks for using AWS Textract
I am happy to share my sample document, but it looks like I need to sign up for at least developer support in order to share. Since I am experimenting prior to making a decision (for my company) to use Textract, I think I will simply solve this issue on my own. Please let me know if I have misunderstood and there is a way to share my PDF and get help without having to pay for support as I am just experimenting. Thank you kindly. Note: If I do find that Textract will work for my company, I will sign up for support and may pursue this at that time (unless there is a way to do so at this time without paying).
Don (ddanger)
Thanks, Martin. Your response makes sense and is most appreciated. It is exactly what I was looking for.
Don
Relevant content
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 5 months ago
- AWS OFFICIALUpdated 4 months ago