Textrac returns JSON with only the first 2 pages

0

Hello. We are trying to develop an app for which we want to use Textract to perform OCR on documents, but when uploading PDF documents to a bucket via the API it returns a JSON file with only the first 2 pages of a document that has more than 30... My question is, is this happening because I am still within the 3-month trial period? If so, I want to pay for the service to unlock that restriction, but I haven't found where to make the change. Or maybe The problem is another... Estoy usando un depósito S3 para cargar el PDF antes y luego procesarlo desde allí con start_document_text_detection y luego get_document_text_detection... Thanks

질문됨 일 년 전495회 조회
3개 답변
1
수락된 답변

Hi Carlos,

When we call "get_document_text_detection", it returns paginated results, along with "NextToken".
We can use the "NextToken" to iteratively call and fetch the rest of the parts of the results. [1]

Please have a look at [2] and [3], for referring to the part on how to use the "NextToken".
Example Code Snippet:

def getJobResults(jobId):
    pages = []
    client = boto3.client('textract')

    response = client.get_document_text_detection(JobId=jobId) 
    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']

    while(nextToken):
        response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
        pages.append(response)
        print("Resultset page recieved: {}".format(len(pages)))
        nextToken = None
        if('NextToken' in response):
            nextToken = response['NextToken']
    return pages

References:
[1] https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentTextDetection.html
[2] https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing/blob/master/src/jobresultsproc.py
[3] https://medium.com/petabytz/automatically-extract-data-using-aws-textract-7a599b80b92

profile picture
답변함 일 년 전
0

Thanks!!! Works perfectly!!!

답변함 일 년 전
0

And, for reference, if you use the OutputConfig in asynchronous Textract API calls (which you probably should because you save on Get* calls, which are TPS limited), you can use the function def get_full_json_from_output_config(output_config: OutputConfig, job_id: str, s3_client=None) -> dict: (source) from the amazon-textract-caller PyPI package.

AWS
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠