Textract/A2I Asynchronous Review


I have a Textract/A2I process setup and it works as expected. However, I need to change the workflow and am looking for suggestions.

Context: we are using Textract/A2I to process historical documents for research purposes (aka digital humanities). The current process analyzes a subset of the documents with Textract and then presents them for human review in A2I/SageMaker. Because of the nature of these documents, typically image qualities issues, we have opted to have each document presented for review. We have over 10,000 documents so are reliant upon volunteers (private team) for this review but that takes time and means the outputs are held until the review is completed or SageMaker terminates the job for a server reboot.

One option that seems promising is a form of asynchronous review where the Textract process is completed but the SageMaker/A2I review occurs using the stored Textract output. I have not been able to find any documentation that describes any form of asynchronous review. Is this even possible? TIA!

2 Answers

You can choose "Custom Task Type" instead of built-in integration with Textract. https://docs.aws.amazon.com/sagemaker/latest/dg/a2i-task-types-custom.html

With Custom Task Type you can use start_human_loop API to trigger the human review loop. https://docs.aws.amazon.com/sagemaker/latest/dg/a2i-start-human-loop.html

For custom task type, please see also this sample notebook : https://github.com/aws-samples/amazon-a2i-sample-jupyter-notebooks/blob/master/A2I-Video-Transcription-with-Amazon-Transcribe.ipynb

answered 2 months ago
  • If I understand this correctly, I can basically keep what I have developed but make two small changes.

    1. Instead of invoking textract.anaylze_document inside of the human loop, I would call a2i.start_human_loop.
    2. Instead of calling a bucket with image files, I would call the bucket with the JSON files. Is that correct? It seems like I would need to construct the human loop where it would present the user with both the image and its annotations derived from the textract output.
  • Right. You need to feed a JSON to "InputContent" field when calling start_human_loop(). It doesn't necessarily have to be from S3, but you need to construct a JSON for this field.


Is it possible to pass the response from textract.get_document_analysis() as the InputContent?

answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions