Extracted FORMS keep order

0

Hello.

I am using AWS textract and specifically the FORMS functionality to extract just that. It works really good. But the issue I have is that when returning the extracted FORMS, they do not keep their natural order as they come from the document. Is there any way to keep the natural order in the returned objected? Or can I map back to the document order using the coordinates? This is how I use the extraction currently:

def ocr(document):

job_id = start_job(client, BUCKET, document)
is_job_complete(client, job_id)
response = get_job_results(client, job_id) #This is the full object of the OCR
field_list = []
doc = Document(response)
start = 0
for page in doc.pages:
    lst = []
    for field in page.form.fields:
        lst.append("Key: {} Value: {}".format(field.key, field.value))
    field_list.append(lst)
    start = start + 1
text_list = []#Also extract the raw text
for i in range(0,len(response)):
    for item in response[i]["Blocks"]:
        if item["BlockType"] == "LINE":
            text_list.append(item["Text"])
text = " ".join(text_list)
return(field_list, text)

To put it in a real scenario, Example document contains the following FORMS:

A: 123

B: 432

C: 000

D: 126

But the above function returns:

B: 432

A: 123

D: 126

C: 000

Hence not keeping the natural order of working from the top, then left to right, down to the bottom of the document. Is there any setting I can alter earlier or something I can change about my current function to return, the original/natural order?

3 回答
1
已接受的回答

Textract is a machine learning service so it may not be 100% accurate. However, we are always trying to improve the accuracy of our models. We will forward this particular use case to our science teams so that hopefully a future model update will not miss this case.

Regarding the issue with the reading order, Textract currently does not support configuration for reading order. However, we will surface this feature request to our PMs to see if this can be added to a future release.

AWS
已回答 2 年前
1

Hi glad to hear Textract AnalyzeDoc's FORMS feature is working well for you generally.

Textract currently does not support configuration for reading order. For this particular case, it would be good if you share your document with us so we can investigate further as to why you're not getting the expected order.

AWS
已回答 2 年前
0

Hi, thanks for the reply! I can share a snippet of a FORMS example. In the link below you can find the picture. After converting the PNG to PDF and running the code above, I get the following result:

LINK: https://gyazo.com/b7b949cd06538bf7931cb9f6117ac581

[['Key: Straße, Hausnummer Value: None',

'Key: Ggf. weitere Angaben Value: None',

'Key: Geburtsname Value: None',

'Key: Postleitzahl, Wohnort, bei Soldaten Standort Value: None',

'Key: Familienname Value: None',

'Key: Geburtsdatum Value: Geburtsort']]

Naturally, the order should have been:

[['Key: Vorname, Value: None',

'Key: Familienname Value: None',

'Key: Straße, Hausnummer Value: None',

'Key: Postleitzahl, Wohnort, bei Soldaten Standort Value: None',

'Key: Geburtsdatum Value: Geburtsort',

'Key: Geburtsname Value: None',

'Key: Ggf. weitere Angaben Value: None']]

So in this example, it even missed out on "Vorname" variable. Usually it don't miss... But if anyone can help me explain this behavior I would be very grateful!

已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则