- Newest
- Most votes
- Most comments
You can achieve this in the following manner. I am assuming Python language being used in your project. To get the text from a document you can use Amazon Textract DetectDocumentText
API for single page documents, or for multi-page documents use StartDocumentTextDetection
API to execute an async job. Once you get the response from the API calls you can get the character count in the following manner.
Assuming response
is the JSON response from Amazon Textract API, to get all the lines from it you can do something like this in python-
for single page documents using DetectDocumentText
def get_word_count(response):
words = [ k for k in response['Blocks'] if k['BlockType'] == 'WORD']
total_word_count = len(words)
return total_word_count
For multi-page document, using StartDocumentTextDetection
def get_word_count(response, page_num):
words = [ k for k in response['Blocks'] if k['BlockType'] == 'WORD' and k['Page'] == page_num]
total_word_count = len(words)
return total_word_count
Note: for multi-page documents, the StartDocumentTextDetection
may generate multiple JSON outputs. The function above expects a combined JSON output. To combine multiple JSONs from an asynchronous job started using StartDocumentTextDetection
, you can use textract-caller by installing it-
python -m pip install amazon-textract-caller
usage -
from textractcaller import get_full_json_from_output_config
from textractcaller.t_call import OutputConfig
import boto3
s3 = boto3.client('s3')
op_config = OutputConfig(s3_bucket=<bucket>, s3_prefix=<s3Prefix>)
full_json = get_full_json_from_output_config(output_config=op_config, job_id=<job_id>, s3_client=s3)
Where <bucket>
is your Amazon S3 bucket where the Async job's output is stored, <s3Prefix>
is the prefix, and <job_id>
is the async "Job ID" returned by StartDocumentTextDetection
API call.
For example: s3://my-bucket/my-prefix/43d6d8af4a2b7c850e20c8b135ad2c4d8a04eadb44874a2cada98aa61c/
. bucket is my-bucket
, prefix is my-prefix
and the "Job ID" is 43d6d8af4a2b7c850e20c8b135ad2c4d8a04eadb44874a2cada98aa61c
.
Hope this helps!
use detect-document-text api
https://docs.aws.amazon.com/textract/latest/dg/detecting-document-text.html
or it's async counterpart https://docs.aws.amazon.com/textract/latest/dg/api-async.html
In the Blocks
section of result, count entries with "BlockType": "WORD"
Relevant content
- Accepted Answerasked 10 months ago
- Accepted Answerasked 2 years ago
- asked 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 10 days ago
- AWS OFFICIALUpdated 3 months ago