textract is not working as it should

0

I have an automation for extracting text from PDF. I have put it together in python with the boto3 sdk to use textract and extract the texts from those pdfs and images. I have written a program that automates the entire action of downloading the pdfs from S3, then runs the textract to extract the text and with text mining clean it and organize it in a json to send it to an endpoint that receives that json. The problem is that locally it is working well for me, but when I go to put it in a lambda the extraction of some parts does not seem to be doing what it should. here an example:

in lambda execution: Agencia E Expedidora: in local executionL: Agencia Expedidora

Of course, in this case there wouldn't be such a problem but I have other fields that are numeric that would be impossible for me to manage by modifying the text. example: in lambda execution: 773747 in local execution: 273747

Please help me solve it because I don't know what the problem would be, I have already tried updating the docker and standardizing the packages to the packages I have locally but still nothing.

已提問 4 個月前檢視次數 222 次
1 個回答
0

Hi,

the problem you are facing is indeed pretty strange. If you are passing the same bytes to the Textract API (I assume one of AnalyzeDocument, AnalyzeExpense or AnalyzeId), the result should be the same independently from the call being made from your local computer or from a Lambda function.

From your description it seems you are performing some redundant steps: as your documents are already on S3, you can pass the S3 object location directly to the Textract APIs, thus avoiding the download step.

response = client.analyze_document(
    Document={
        'Bytes': b'bytes',
        'S3Object': {
            'Bucket': 'string',
            'Name': 'string',
            'Version': 'string'
        }
    },
...

If you have multiple documents to process, you can also use the batch operators, like start_document_analysis

[1] https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract/client/analyze_document.html

AWS
專家
已回答 4 個月前
  • hi, thanks for you response. but this not solution my problem, i'm use star_document_analysis for process one to one document. the problem in reality is in docker. I was doing some tests and it seems that the problem is in docker. but I am installing the same versions but it doesn't seem to work.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南