Textract - Word counter implementation in wordpress

1

I need to add word counter to my wordpress site basing on uploaded file (png, jpg, pdf, doc, etc.). Like on the example: translated.com.

The purpose is to also create quotation basing on number of characters in the uploaded file (image or text).

I'd be grateful if you can help me understanding how to achieve it.

Example

0952
질문됨 일 년 전379회 조회
2개 답변
1

You can achieve this in the following manner. I am assuming Python language being used in your project. To get the text from a document you can use Amazon Textract DetectDocumentText API for single page documents, or for multi-page documents use StartDocumentTextDetection API to execute an async job. Once you get the response from the API calls you can get the character count in the following manner.

Assuming response is the JSON response from Amazon Textract API, to get all the lines from it you can do something like this in python-

for single page documents using DetectDocumentText

def get_word_count(response):
    words = [ k for k in response['Blocks'] if k['BlockType'] == 'WORD']
    total_word_count = len(words)
    return total_word_count

For multi-page document, using StartDocumentTextDetection

def get_word_count(response, page_num):
    words = [ k for k in response['Blocks'] if k['BlockType'] == 'WORD' and  k['Page'] == page_num]
    total_word_count = len(words)
    return total_word_count

Note: for multi-page documents, the StartDocumentTextDetection may generate multiple JSON outputs. The function above expects a combined JSON output. To combine multiple JSONs from an asynchronous job started using StartDocumentTextDetection, you can use textract-caller by installing it-

python -m pip install amazon-textract-caller

usage -

from textractcaller import get_full_json_from_output_config
from textractcaller.t_call import OutputConfig
import boto3

s3 = boto3.client('s3')

op_config = OutputConfig(s3_bucket=<bucket>, s3_prefix=<s3Prefix>)
full_json = get_full_json_from_output_config(output_config=op_config, job_id=<job_id>, s3_client=s3)

Where <bucket> is your Amazon S3 bucket where the Async job's output is stored, <s3Prefix> is the prefix, and <job_id> is the async "Job ID" returned by StartDocumentTextDetection API call.

For example: s3://my-bucket/my-prefix/43d6d8af4a2b7c850e20c8b135ad2c4d8a04eadb44874a2cada98aa61c/. bucket is my-bucket, prefix is my-prefix and the "Job ID" is 43d6d8af4a2b7c850e20c8b135ad2c4d8a04eadb44874a2cada98aa61c.

Hope this helps!

profile pictureAWS
전문가
Anjan
답변함 일 년 전
0

use detect-document-text api https://docs.aws.amazon.com/textract/latest/dg/detecting-document-text.html or it's async counterpart https://docs.aws.amazon.com/textract/latest/dg/api-async.html In the Blocks section of result, count entries with "BlockType": "WORD"

AWS
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인