textract is not working as it should

0

I have an automation for extracting text from PDF. I have put it together in python with the boto3 sdk to use textract and extract the texts from those pdfs and images. I have written a program that automates the entire action of downloading the pdfs from S3, then runs the textract to extract the text and with text mining clean it and organize it in a json to send it to an endpoint that receives that json. The problem is that locally it is working well for me, but when I go to put it in a lambda the extraction of some parts does not seem to be doing what it should. here an example:

in lambda execution: Agencia E Expedidora: in local executionL: Agencia Expedidora

Of course, in this case there wouldn't be such a problem but I have other fields that are numeric that would be impossible for me to manage by modifying the text. example: in lambda execution: 773747 in local execution: 273747

Please help me solve it because I don't know what the problem would be, I have already tried updating the docker and standardizing the packages to the packages I have locally but still nothing.

asked 6 months ago278 views
1 Answer
0

Hi,

the problem you are facing is indeed pretty strange. If you are passing the same bytes to the Textract API (I assume one of AnalyzeDocument, AnalyzeExpense or AnalyzeId), the result should be the same independently from the call being made from your local computer or from a Lambda function.

From your description it seems you are performing some redundant steps: as your documents are already on S3, you can pass the S3 object location directly to the Textract APIs, thus avoiding the download step.

response = client.analyze_document(
    Document={
        'Bytes': b'bytes',
        'S3Object': {
            'Bucket': 'string',
            'Name': 'string',
            'Version': 'string'
        }
    },
...

If you have multiple documents to process, you can also use the batch operators, like start_document_analysis

[1] https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract/client/analyze_document.html

AWS
EXPERT
answered 6 months ago
  • hi, thanks for you response. but this not solution my problem, i'm use star_document_analysis for process one to one document. the problem in reality is in docker. I was doing some tests and it seems that the problem is in docker. but I am installing the same versions but it doesn't seem to work.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions