Textract Form data + Raw data


I am having a problem in extraction of text if a page has form data as well as some raw data format. For instance half or some portion of the page has plain text and half of the page has form data. Once we analyze document by textract it will say that it has form data, once we extract as per form data then in downloaded file only portion of form data will be captured and if we run raw data then all of the page including form data will be extracted but it will not be in a structured format. Is there a way we can do extract data where we have form then it will be excluded from raw data or anything else we can do to have the same extraction as per document uploaded

asked 5 months ago83 views
1 Answer

Hey! This one is a bit tricky, but if you understand how the JSON response is structured, you should be able to post-process the AnalyzeDocument API Json Response to get the data you need.

  • The AnalyzeDocument API will return all the text it found in the document + the FeatureTypes you requested (In this case Forms)
  • Each word is represented by an ID, and this ID is present in the Forms relationships.
  • You will first have to process and save the Form relationship results
  • Then you will have to delete from the response, the IDs of the words that correspond to the forms and will obtain the remaining raw text you are looking for.

Take a look at the Textractor aws sample, which can help you process the JSON results!

Hope this helps.

Dani M
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions