- Newest
- Most votes
- Most comments
Hi, as it shows here, you can use a lambda to process the document extraction https://aws.amazon.com/it/blogs/machine-learning/store-output-in-custom-amazon-s3-bucket-and-encrypt-using-aws-kms-for-multi-page-document-processing-with-amazon-textract/
Take note please of this wrapper for Textract: https://github.com/aws-samples/amazon-textract-textractor that will ease your job with the APIs.
Also, you can trigger the Lambda using EventBridge when a new file is uploaded in the specific S3 bucket and trigger the email notification on an SNS topic when finished. https://aws.amazon.com/it/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/
I hope this helps
Hi,
I'd would strongly recommend you to read about the limitations: https://docs.aws.amazon.com/textract/latest/dg/limits-document.html
Based on the following and the size of your project, you will have to work asynchronously:
File Size and Page Count Limits
For synchronous operations, JPEG, PNG, PDF, and TIFF files have a limit of 10 MB in memory. PDF and TIFF files also have a limit of 1 page. For asynchronous operations, JPEG and PNG files have a limit of 10 MB in memory. PDF and TIFF files have a limit of 500 MB in memory. PDF and TIFF files have a limit of 3,000 pages.
So, you should call textract asynchronously and you can be notified via a Lambda when the results are written back to our S3 bucket: https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
Best,
Didier
Thank you! I so far have been able to figure out how to submit the request to take input from S3 bucket and output to S3 bucket. I also figured out how to automatically download the json object from bucket and decode into text locally. My concern, as you point out, is what happens with a fairly large document. I will study the Lambda tutorial and report my progress. Thank you for the assistance and direction.
Oh no! Can't do it! Got halfway through this tutorial: https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
Then realized PHP isn't supported. But, I did find this: https://medium.com/@nwosuonyedikachi/how-to-run-a-php-based-aws-lambda-function-e1b7a0254036 Don't know if it's a workaround or not. Should I create a new topic? "Using an Amazon S3 trigger to invoke a Lambda function using PHP?"
Also check out the scale samples, which include a DocumentSplitter that can be configured to split documents when > 3000 pages.
I tested it with the OpenSearchWorkflow as well and documents that are 10k pages. There is a blog post going into detail of the setup, which I tested with 100k documents and 1.6 million pages (fully processed in 4.5h in us-east-1).
Relevant content
- asked 2 years ago
- asked 10 months ago
- asked 2 months ago
- asked 2 years ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
It helps! I want to output to S3. Don't need to encrypt. If I could just find the code to do it in php I'll be good!
I've modified my code to upload to my S3 bucket. But, this is what I get:
output/ output//f968fe264ba627742badd701c2a7aed5cf02ed64424c6d98d4ea74963438bbda/.s3_access_check output//f968fe264ba627742badd701c2a7aed5cf02ed64424c6d98d4ea74963438bbda/1 output//f968fe264ba627742badd701c2a7aed5cf02ed64424c6d98d4ea74963438bbda/2
This is a bunch of stuff, but not the text I'm looking for. How do I get it to upload the scanned text as a text file? If I can't do that, how do I decode this stuff? Thanks!
You are looking at the paginated output from the asynchronous Textract processing and need to concatenate the files. Check out this imple for a way how to achieve this in Python https://github.com/aws-samples/amazon-textract-textractor/blob/6e7125c51a351900089102bee1ef2c679c635df2/caller/textractcaller/t_call.py#L195 and you can convert that to PHP.
Got it. Thanks! Able to concatenate and download as text. Just concerned about what happens when it's 800 pages or more.