How does Textract process PDFs with searchable and selectable text? Compared to the "scanned" PDFs?


I couldn't find information if Textract working differently with these PDFs. I ponder if there is even a need for Textract if PDF already contains text (which is typically the case for machine generated invoiced and other documents). Textract is still working very well with searchable PDFs.

My question if it makes sense to assess any other services for extracting text? We're going to embed it it with LLM, so we do not care much about form and shape, exact locations of text, overlays and so on.

Thank you!

asked a year ago317 views
1 Answer

Assuming the text is always searchable/selectable, if you only plan on extracting the raw text and using a standard library does the job, then I'd agree with your assessment that Textract might be overkill. Where Textract really shines is when you do care about the format, structure, location of information, and relationship between blocks / sections of the document.

answered a year ago

