Textract Form data + Raw data

0

I am having a problem in extraction of text if a page has form data as well as some raw data format. For instance half or some portion of the page has plain text and half of the page has form data. Once we analyze document by textract it will say that it has form data, once we extract as per form data then in downloaded file only portion of form data will be captured and if we run raw data then all of the page including form data will be extracted but it will not be in a structured format. Is there a way we can do extract data where we have form then it will be excluded from raw data or anything else we can do to have the same extraction as per document uploaded

質問済み 2年前676ビュー
1回答
0

Hey! This one is a bit tricky, but if you understand how the JSON response is structured, you should be able to post-process the AnalyzeDocument API Json Response to get the data you need.

  • The AnalyzeDocument API will return all the text it found in the document + the FeatureTypes you requested (In this case Forms)
  • Each word is represented by an ID, and this ID is present in the Forms relationships.
  • You will first have to process and save the Form relationship results
  • Then you will have to delete from the response, the IDs of the words that correspond to the forms and will obtain the remaining raw text you are looking for.

Take a look at the Textractor aws sample, which can help you process the JSON results!

Hope this helps.

AWS
Dani M
回答済み 2年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ