Questions tagged with Nonprofit
Content language: English
Sort by most recent
Browse through the questions and answers listed below or filter and sort to narrow down your results.
Which Textract approach is faster/cheaper/better for 1M+ documents?
I have a pile of 1 million+ TIF files (single and multi-page) that I need to OCR, search for particular terms, and then by the end of my workflow, have images and text for each individual page. I have built a Step Machine with different lambdas that handle different parts of the pipeline. After the Step Machine has run on all of our raw images, I am using Python to collect the results and load pointers to the processed images, text, and JSON into a database. I'm wondering if it makes more sense to split the pages into separate images before I use Textract to OCR, or if there are efficiencies in cost, speed, rate-limiting or something else that would be gained by not splitting the pages until after I OCR. I get errors when I try to run detect_document_text() on my multi-page TIFs, but I am for some reason able to run detect_document_text() on at least some single-page TIFs. So I'm also wondering if there are efficiencies between detect_document_text() and start_document_text_detection(). We are a small nonprofit but are designing a repeatable workflow for other datasets of this size or larger. At the scale of 1 million + documents the costs/time does add up.