Textrac returns JSON with only the first 2 pages

0

Hello. We are trying to develop an app for which we want to use Textract to perform OCR on documents, but when uploading PDF documents to a bucket via the API it returns a JSON file with only the first 2 pages of a document that has more than 30... My question is, is this happening because I am still within the 3-month trial period? If so, I want to pay for the service to unlock that restriction, but I haven't found where to make the change. Or maybe The problem is another... Estoy usando un depósito S3 para cargar el PDF antes y luego procesarlo desde allí con start_document_text_detection y luego get_document_text_detection... Thanks

gefragt vor einem Jahr490 Aufrufe
3 Antworten
1
Akzeptierte Antwort

Hi Carlos,

When we call "get_document_text_detection", it returns paginated results, along with "NextToken".
We can use the "NextToken" to iteratively call and fetch the rest of the parts of the results. [1]

Please have a look at [2] and [3], for referring to the part on how to use the "NextToken".
Example Code Snippet:

def getJobResults(jobId):
    pages = []
    client = boto3.client('textract')

    response = client.get_document_text_detection(JobId=jobId) 
    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']

    while(nextToken):
        response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
        pages.append(response)
        print("Resultset page recieved: {}".format(len(pages)))
        nextToken = None
        if('NextToken' in response):
            nextToken = response['NextToken']
    return pages

References:
[1] https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentTextDetection.html
[2] https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing/blob/master/src/jobresultsproc.py
[3] https://medium.com/petabytz/automatically-extract-data-using-aws-textract-7a599b80b92

profile picture
beantwortet vor einem Jahr
0

Thanks!!! Works perfectly!!!

beantwortet vor einem Jahr
  • Glad that it helped! Thanks.

0

And, for reference, if you use the OutputConfig in asynchronous Textract API calls (which you probably should because you save on Get* calls, which are TPS limited), you can use the function def get_full_json_from_output_config(output_config: OutputConfig, job_id: str, s3_client=None) -> dict: (source) from the amazon-textract-caller PyPI package.

AWS
beantwortet vor einem Jahr

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen