By using AWS re:Post, you agree to the Terms of Use

Sort and extract full text


I need to extract text from some .PDFs documents.

The formatting of these texts varies (column where it starts and ends) but the beginning and end are always similar, it starts with the word DECRETO and ends with the word CHEFE + something

In this link you can see an example of the original text:

Is it possible to do this with AWS tools? What's the best way?

OBS: Text language is PT-BR

2 Answers

Hi there, thank you for using Textract. At the moment, we do not provide mechanism to support your use case directly, though we recommend that it is achieved on client side by doing some post processing based on the bounding boxes of lines returned in response. I hope this helps!

answered 5 months ago
  • Could Rekognition help by identifying each column and after doing some client-side processing, leave the texts in sequence and use Textract? With Rekognition, will I be able to identify each column separately?


Yes, this is possible with Amazon Textract (which supports Portuguese). To learn more how to extract text from PDFs, you can check out the documentation.

answered 5 months ago
  • With Textract it is not possible because it extracts and aligns the words per line and not the columns as I selected in the image.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions