Textract Form data + Raw data

0

I am having a problem in extraction of text if a page has form data as well as some raw data format. For instance half or some portion of the page has plain text and half of the page has form data. Once we analyze document by textract it will say that it has form data, once we extract as per form data then in downloaded file only portion of form data will be captured and if we run raw data then all of the page including form data will be extracted but it will not be in a structured format. Is there a way we can do extract data where we have form then it will be excluded from raw data or anything else we can do to have the same extraction as per document uploaded

已提问 2 年前676 查看次数
1 回答
0

Hey! This one is a bit tricky, but if you understand how the JSON response is structured, you should be able to post-process the AnalyzeDocument API Json Response to get the data you need.

  • The AnalyzeDocument API will return all the text it found in the document + the FeatureTypes you requested (In this case Forms)
  • Each word is represented by an ID, and this ID is present in the Forms relationships.
  • You will first have to process and save the Form relationship results
  • Then you will have to delete from the response, the IDs of the words that correspond to the forms and will obtain the remaining raw text you are looking for.

Take a look at the Textractor aws sample, which can help you process the JSON results!

Hope this helps.

AWS
Dani M
已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则