Extracted FORMS keep order

0

Hello.

I am using AWS textract and specifically the FORMS functionality to extract just that. It works really good. But the issue I have is that when returning the extracted FORMS, they do not keep their natural order as they come from the document. Is there any way to keep the natural order in the returned objected? Or can I map back to the document order using the coordinates? This is how I use the extraction currently:

def ocr(document):

job_id = start_job(client, BUCKET, document)
is_job_complete(client, job_id)
response = get_job_results(client, job_id) #This is the full object of the OCR
field_list = []
doc = Document(response)
start = 0
for page in doc.pages:
    lst = []
    for field in page.form.fields:
        lst.append("Key: {} Value: {}".format(field.key, field.value))
    field_list.append(lst)
    start = start + 1
text_list = []#Also extract the raw text
for i in range(0,len(response)):
    for item in response[i]["Blocks"]:
        if item["BlockType"] == "LINE":
            text_list.append(item["Text"])
text = " ".join(text_list)
return(field_list, text)

To put it in a real scenario, Example document contains the following FORMS:

A: 123

B: 432

C: 000

D: 126

But the above function returns:

B: 432

A: 123

D: 126

C: 000

Hence not keeping the natural order of working from the top, then left to right, down to the bottom of the document. Is there any setting I can alter earlier or something I can change about my current function to return, the original/natural order?

3 Answers
1
Accepted Answer

Textract is a machine learning service so it may not be 100% accurate. However, we are always trying to improve the accuracy of our models. We will forward this particular use case to our science teams so that hopefully a future model update will not miss this case.

Regarding the issue with the reading order, Textract currently does not support configuration for reading order. However, we will surface this feature request to our PMs to see if this can be added to a future release.

AWS
answered 2 years ago
1

Hi glad to hear Textract AnalyzeDoc's FORMS feature is working well for you generally.

Textract currently does not support configuration for reading order. For this particular case, it would be good if you share your document with us so we can investigate further as to why you're not getting the expected order.

AWS
answered 2 years ago
0

Hi, thanks for the reply! I can share a snippet of a FORMS example. In the link below you can find the picture. After converting the PNG to PDF and running the code above, I get the following result:

LINK: https://gyazo.com/b7b949cd06538bf7931cb9f6117ac581

[['Key: Straße, Hausnummer Value: None',

'Key: Ggf. weitere Angaben Value: None',

'Key: Geburtsname Value: None',

'Key: Postleitzahl, Wohnort, bei Soldaten Standort Value: None',

'Key: Familienname Value: None',

'Key: Geburtsdatum Value: Geburtsort']]

Naturally, the order should have been:

[['Key: Vorname, Value: None',

'Key: Familienname Value: None',

'Key: Straße, Hausnummer Value: None',

'Key: Postleitzahl, Wohnort, bei Soldaten Standort Value: None',

'Key: Geburtsdatum Value: Geburtsort',

'Key: Geburtsname Value: None',

'Key: Ggf. weitere Angaben Value: None']]

So in this example, it even missed out on "Vorname" variable. Usually it don't miss... But if anyone can help me explain this behavior I would be very grateful!

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions