- Newest
- Most votes
- Most comments
Hi Carlos,
When we call "get_document_text_detection", it returns paginated results, along with "NextToken".
We can use the "NextToken" to iteratively call and fetch the rest of the parts of the results. [1]
Please have a look at [2] and [3], for referring to the part on how to use the "NextToken".
Example Code Snippet:
def getJobResults(jobId):
pages = []
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
while(nextToken):
response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
return pages
References:
[1] https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentTextDetection.html
[2] https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing/blob/master/src/jobresultsproc.py
[3] https://medium.com/petabytz/automatically-extract-data-using-aws-textract-7a599b80b92
And, for reference, if you use the OutputConfig in asynchronous Textract API calls (which you probably should because you save on Get* calls, which are TPS limited), you can use the function def get_full_json_from_output_config(output_config: OutputConfig, job_id: str, s3_client=None) -> dict:
(source) from the amazon-textract-caller PyPI package.
Relevant content
- asked 19 days ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
Glad that it helped! Thanks.