Textract Form data + Raw data

0

I am having a problem in extraction of text if a page has form data as well as some raw data format. For instance half or some portion of the page has plain text and half of the page has form data. Once we analyze document by textract it will say that it has form data, once we extract as per form data then in downloaded file only portion of form data will be captured and if we run raw data then all of the page including form data will be extracted but it will not be in a structured format. Is there a way we can do extract data where we have form then it will be excluded from raw data or anything else we can do to have the same extraction as per document uploaded

已提問 2 年前檢視次數 692 次
1 個回答
0

Hey! This one is a bit tricky, but if you understand how the JSON response is structured, you should be able to post-process the AnalyzeDocument API Json Response to get the data you need.

  • The AnalyzeDocument API will return all the text it found in the document + the FeatureTypes you requested (In this case Forms)
  • Each word is represented by an ID, and this ID is present in the Forms relationships.
  • You will first have to process and save the Form relationship results
  • Then you will have to delete from the response, the IDs of the words that correspond to the forms and will obtain the remaining raw text you are looking for.

Take a look at the Textractor aws sample, which can help you process the JSON results!

Hope this helps.

AWS
Dani M
已回答 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南