Textract Value Error "No Block Id:"

0

Hi. I am experimenting with Amazon Textract to extract clinical data from PDF files. Specifically, I am using Python and experimenting with "order_blocks_by_geo" (from trp.t_pipeline). I notice that when extracted text is split across two batches (where it is necessary to use 'NextToken' with get_document_text_detection) that "order_blocks_by_geo" will not be able to find a specific LINE block from a relationship in a PAGE block; in such a case I get a "No Block Id: xxxxxxxx" Value Error where xxxxx is the child LINE block ID. How am I supposed to present data to order_blocks_by_geo so that it has all the necessary relationships when data can be split across get_document_text_detection requests where the parent PAGE block is in one response and associated child LINE block is in another response? In my case, I have a small 10 page PDF split across two batches where the PAGE block is in the first batch and the child LINE block is in the second batch (using NextToken). What happens when I sent a large PDF of thousands of pages? Thank you for your consideration to this question.

ddanger
asked 2 years ago276 views
4 Answers
1
Accepted Answer

You need to first grab the entire JSON and combine it when you deal with paginated responses.

In order to get the full JSON check out the Textract Caller (https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller) and the get_full_json method (https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/caller/textractcaller/t_call.py#L256).

For smaller number of pages and jobs those work fine, but they poll against the Textract Get* APIs. For larger numbers of pages and multiple concurrent jobs please pass in the OutputConfig and the get_full_json_from_output_config after notification from SNS, otherwise you may will get throttled on the Textract Get* calls.

With that full JSON the order_blocks_by_geo should work just fine.

For large PDF with thousands of pages, the memory consumption will be high (from my experience Python consumes 5x the RAM the file consumes on disk...), so make sure you have enough RAM available.

AWS
answered 2 years ago
0

Hi @ddanger,

Thanks for reaching out to Textract. It'd be helpful if you could share the sample document you are using to test with us via customer support. Our team can then have a look at the output and get back to you.

Thanks for using AWS Textract

answered 2 years ago
0

I am happy to share my sample document, but it looks like I need to sign up for at least developer support in order to share. Since I am experimenting prior to making a decision (for my company) to use Textract, I think I will simply solve this issue on my own. Please let me know if I have misunderstood and there is a way to share my PDF and get help without having to pay for support as I am just experimenting. Thank you kindly. Note: If I do find that Textract will work for my company, I will sign up for support and may pursue this at that time (unless there is a way to do so at this time without paying).

Don (ddanger)

ddanger
answered 2 years ago
0

Thanks, Martin. Your response makes sense and is most appreciated. It is exactly what I was looking for.

Don

ddanger
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions