Textract Form data + Raw data

0

I am having a problem in extraction of text if a page has form data as well as some raw data format. For instance half or some portion of the page has plain text and half of the page has form data. Once we analyze document by textract it will say that it has form data, once we extract as per form data then in downloaded file only portion of form data will be captured and if we run raw data then all of the page including form data will be extracted but it will not be in a structured format. Is there a way we can do extract data where we have form then it will be excluded from raw data or anything else we can do to have the same extraction as per document uploaded

preguntada hace 2 años704 visualizaciones
1 Respuesta
0

Hey! This one is a bit tricky, but if you understand how the JSON response is structured, you should be able to post-process the AnalyzeDocument API Json Response to get the data you need.

  • The AnalyzeDocument API will return all the text it found in the document + the FeatureTypes you requested (In this case Forms)
  • Each word is represented by an ID, and this ID is present in the Forms relationships.
  • You will first have to process and save the Form relationship results
  • Then you will have to delete from the response, the IDs of the words that correspond to the forms and will obtain the remaining raw text you are looking for.

Take a look at the Textractor aws sample, which can help you process the JSON results!

Hope this helps.

AWS
Dani M
respondido hace 2 años

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas