By using AWS re:Post, you agree to the Terms of Use
/Amazon Textract/

Questions tagged with Amazon Textract

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Textract completion msg not published to SNS Topic using Cognito user

I have read the instructions <https://docs.aws.amazon.com/textract/latest/dg/api-async-roles.html#api-async-roles-all-topics> . My set up is somewhat different because I am using a Cognito user. To enable textract to publish msg to SNS, I pass relevant permissions to Textract to enable it to call SNS. I am able to call StartDocumentAnalysis method and get a response. But the SNS message is never published. The weird thing is that on a few occasions I did see the several data points in CloudWatch's SNS Metric - 'NumberOfNotificationsDelivered', indicating that the messages were published. However, they are almost all gone now. What is wrong with the below? The cognito authorized user has the CognitoAuthRole role: CognitoAuthRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Federated: cognito-identity.amazonaws.com Action: sts:AssumeRoleWithWebIdentity Condition: StringEquals: cognito-identity.amazonaws.com:aud: !Ref CoginitoIdentityPool ForAnyValue:StringLike: cognito-identity.amazonaws.com:amr: authenticated - Effect: Allow Principal: Service: lambda.amazonaws.com Action: sts:AssumeRole - Effect: Allow Principal: Service: textract.amazonaws.com Action: sts:AssumeRole Description: Used by cognito authenticated users ManagedPolicyArns: - !Ref DesktopPolicy #definition is immediately below And the desktop policy is: DesktopPolicy: Type: AWS::IAM::ManagedPolicy Properties: ManagedPolicyName: DesktopBackup PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - 'iam:GetRole' - 'iam:PassRole' Resource: !GetAtt "TextractEc2Role.Arn" #definition is below - Effect: Allow Action: - "sns:Publish" Resource: - arn:aws:sns:us-east-1:xxxxxxxxxxxx:AmazonTextractTopic - Effect: Allow Action: - "textract:GetDocumentAnalysis" - "textract:GetDocumentTextDetection" - "textract:StartDocumentAnalysis" - "textract:StartDocumentTextDetection" Resource: - "*" The role that is passed to Textract service using iam:PassRole is: TextractEc2Role: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: textract.amazonaws.com Action: sts:AssumeRole ManagedPolicyArns: - arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess - arn:aws:iam::aws:policy/AmazonSNSFullAccess - arn:aws:iam::aws:policy/AmazonTextractFullAccess - arn:aws:iam::aws:policy/service-role/AmazonTextractServiceRole RoleName: TextractEc2 Edited by: L Jones on Sep 1, 2020 6:40 PM Edited by: L Jones on Sep 1, 2020 6:41 PM
1
answers
0
votes
1
views
L Jones
asked a year ago

Textract Error When PDF Is Uploaded to Folder

Hi, I currently have a Lambda that is triggered upon uploading a PDF document to an S3 bucket. Once called, the Lambda calls the start_document_text_detection method to extract the document text. The job gets published to an SNS topic and another Lambda gets triggered on completion to get_document_text_detection method to retrieve the results and upload it into said bucket. If I upload a document to the root of the S3 bucket, everything works great. My issue is that when I create a folder within the S3 bucket and upload a PDF there, the trigger fires but I am getting the following error : ``` [ERROR] InvalidParameterException: An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters Traceback (most recent call last): File "/var/task/lambda_function.py", line 33, in lambda_handler raise e File "/var/task/lambda_function.py", line 26, in lambda_handler &#39;SNSTopicArn&#39;: os.environ[&#39;sns_arn&#39;] File "/opt/python/botocore/client.py", line 357, in _api_call return self._make_api_call(operation_name, kwargs) File "/opt/python/botocore/client.py", line 661, in _make_api_call raise error_class(parsed_response, operation_name) ``` SNSTopicARN being part of the NotificationChannel argument to the start_document_text_detection call. The RoleARN also has FullSNSAccess policy and the JobTag is the document name if that helps. I checked the limits of Textract and everything is within bounds. I also tried adjusting the prefix in the Events property of the S3 bucket to match the folder, with no luck. Any advice would be appreciated.
1
answers
0
votes
1
views
kylem1989
asked 2 years ago

How to feed Textract output into Comprehend Medical

Hi I'm trying to modify this code at https://github.com/aws-samples/amazon-textract-enhancer/blob/master/functions/detect-text-postprocess-page.py I want to be able to pipe the results of textract analysis into comprehend medical api I've tried to hacking the code, unfortunately I'm not a python expert Any ideas how best to approach this? cheers Jon from textract_util import * import io import os import json import time import boto3 def lambda_handler(event, context): #Initialize Boto Resource s3 = boto3.resource('s3') textract = boto3.client('textract') dynamodb = boto3.client('dynamodb') table_name=os.environ\['table_name'] file_list = \[] if "Records" in event: records = event\['Records'] numRecords = len(records) print("{} messages recieved".format(numRecords)) for record in records: documentBlocks = None num_pages = 0 num_lines = 0 bucket = "" upload_prefix = "" textractJobId = "" textractStatus = "" textractAPI = "" textractJobTag = "" textractS3ObjectName = "" textractS3Bucket = "" textractTimestamp = "" if 'Sns' in record.keys(): sns = record\['Sns'] if 'Message' in sns.keys(): message = json.loads(sns\['Message']) textractJobId = message\['JobId'] print("{} = {}".format("JobId", textractJobId)) textractStatus = message\['Status'] print("{} = {}".format("Status",textractStatus)) textractTimestamp = str(int(float(message\['Timestamp'])/1000)) print("{} = {}".format("Timestamp",textractTimestamp)) textractAPI = message\['API'] print("{} = {}".format("API", textractAPI)) textractJobTag = message\['JobTag'] print("{} = {}".format("JobTag", textractJobTag)) documentLocation = message\['DocumentLocation'] textractS3ObjectName = documentLocation\['S3ObjectName'] print("{} = {}".format("S3ObjectName", textractS3ObjectName)) textractS3Bucket = documentLocation\['S3Bucket'] print("{} = {}".format("S3Bucket", textractS3Bucket)) bucket = textractS3Bucket document_path = textractS3ObjectName\[:textractS3ObjectName.rfind("/")] if textractS3ObjectName.find("/") >= 0 else "" document_name = textractS3ObjectName\[textractS3ObjectName.rfind("/")+1:textractS3ObjectName.rfind(".")] if textractS3ObjectName.find("/") >= 0 else textractS3ObjectName\[:textractS3ObjectName.rfind(".")] document_type = textractS3ObjectName\[textractS3ObjectName.rfind(".")+1:].upper() if document_path == "": upload_prefix = textractJobId else: upload_prefix = "{}/{}".format(document_path, textractJobId) print("upload_prefix = " + upload_prefix) num_pages, documentBlocks = GetTextDetectionResult(textract, textractJobId) if documentBlocks is not None and len(documentBlocks) > 0: print("{} Blocks retrieved".format(len(documentBlocks))) #Extract lines of texts into a Python dictionary by parsing the raw JSON from Textract blocks = groupBlocksByType(documentBlocks) document_text, num_lines = extractTextBody(blocks) #Generate JSON document using form fields information json_document = "{}-text.json".format(document_name) json_file = open("/tmp/"_json_document,'w_') json_file.write(json.dumps(document_text, indent=4, sort_keys=True)) json_file.close() s3.meta.client.upload_file("/tmp/"+json_document, bucket, "{}/{}".format(upload_prefix,json_document)) try: response = dynamodb.update_item( TableName=table_name, Key={ 'JobId':{'S':textractJobId}, 'JobType':{'S':'TextDetection'} }, ExpressionAttributeNames={"#tf": "TextFiles", "#jst": "JobStatus", "#jct": "JobCompleteTimeStamp", "#nl": "NumLines", "#np": "NumPages"}, UpdateExpression='SET #tf = list_append(#tf, :text_files), #jst = :job_status, #jct = :job_complete, #nl = :num_lines, #np = :num_pages', ExpressionAttributeValues={ ":text_files": {"L": \[{"S": "{}/{}".format(upload_prefix,json_document)}]}, ":job_status": {"S": textractStatus}, ":job_complete": {"N": str(textractTimestamp)}, ":num_lines": {"N": str(num_lines)}, ":num_pages": {"N": str(num_pages)} } ) except Exception as e: print('DynamoDB Insertion Error is: {0}'.format(e)) else: try: response = dynamodb.update_item( TableName=table_name, Key={ 'JobId':{'S':textractJobId}, 'JobType':{'S':'TextDetection'} }, ExpressionAttributeNames={"#jst": "JobStatus", "#jct": "JobCompleteTimeStamp"}, UpdateExpression='SET #jst = :job_status, #jct = :job_complete', ExpressionAttributeValues={ ":job_status": {"S": textractStatus}, ":job_complete": {"N": str(textractTimestamp)} } ) except Exception as e: print('DynamoDB Insertion Error is: {0}'.format(e)) s3_result = s3.meta.client.list_objects_v2(Bucket=bucket, Prefix="{}/".format(upload_prefix), Delimiter = "/") if 'Contents' in s3_result: for key in s3_result\['Contents']: if key\['Key'].endswith("json"): file_list.append("https://s3.amazonaws.com/{}/{}".format(bucket, key\['Key'])) while s3_result\['IsTruncated']: continuation_key = s3_result\['NextContinuationToken'] s3_result = s3.meta.client.list_objects_v2(Bucket=bucket, Prefix="{}/".format(upload_prefix), Delimiter="/", ContinuationToken=continuation_key) for key in s3_result\['Contents']: if key\['Key'].endswith("json"): file_list.append("https://s3.amazonaws.com/{}/{}".format(bucket, key\['Key'])) print(file_list) return file_list thanks Jon
1
answers
0
votes
0
views
cloudstartuptech
asked 2 years ago
  • 1
  • 90 / page