Textract S3 Permissions

0

Hi I'm trying to analyze a multipage pdf using Textract and the start_document_analysis API. I understand that the document I'm analyzing must be present in an S3 bucket. However when calling this function, I receive the following error message:

InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the StartDocumentAnalysis operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.

I've verified that the bucket name and key are correct, and the document works in the test console, leaving me to think this is related to permissions. Here is my test script (note, I am running this from my local computer, NOT lambda):

import boto3
session = boto3.Session(profile_name="default")

s3 = s.client("s3")
tx = s.client("textract")
doc = "test.pdf"
bucket = "test"

s3.upload_file(doc, bucket, doc)
resp = tx.start_document_analysis(
    DocumentLocation = {
        "S3Object": {
            "Bucket": bucket,
            "Name": doc
        }
    },
    FeatureTypes = ["TABLES"]
)

How do I configure my bucket to allow access from Textract?

Thanks

2 Answers
2

hi there, According to my testing, as per your API, you only need the following permission for the API { "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::rafaxu-bucket/example.pdf" } ] }

but there are a few other steps you can check further.

  1. is there any s3 bucket policies to limit the access
  2. is there any kms key applied to the object? if that is the case, you may need to get KMS related permission for your iam user/role
  3. you can add boto3.set_stream_logger(name='botocore') to your code to find some debug information which may help you.

I recommend you to seperate the S3 upload and Texttract API in different code snippet for troubleshooting purpose.

here is my testing code and working example

import boto3
boto3.set_stream_logger(name='botocore')

s = boto3.Session(profile_name="default")

tx = s.client("textract")
doc = "example.pdf"
bucket = "rafaxu-bucket"

resp = tx.start_document_analysis(
    DocumentLocation={
        "S3Object": {
            "Bucket": bucket,
            "Name": doc
        }
    },
    FeatureTypes=["TABLES"]
)

print(resp)

Here is my IAM policy for IAM user:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::rafaxu-bucket/example.pdf"
        }
    ]
}
  • Texttract full access just to the texttract api.

If I remove the policy, I do get this error: botocore.errorfactory.InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the StartDocumentAnalysis operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.

help that works

answered a year ago
  • Sorry, I"m not sure where that IAM policy needs to be applied. Not on the bucket, right? I've created a new user with the S3FullAccess and TextractFullAccess policies applied, and I'm now using that as the account executing the code, but still running into the same issue. Thank you for your help.

1
Accepted Answer

Document and Textract client not being in the same AWS region is another potential error. Make sure the Textract call is done from the same region as the bucket.

# when your bucket is in us-east-2
textract_client = boto3.client('textract', region_name='us-east-2')
AWS
answered a year ago
  • Hi, thank you for the response. Incredibly, that seems to have worked... Doesn't instantiating both the s3 and textract clients from the same session object ensure they all use the same region?

  • @danem: The bucket region is defined when the bucket is created, not when the boto3 client session is instantiated. So every S3 bucket is 'bound' to a specific region. Textract on the other hand is available in most regions and when a boto3 client session is instantiated, it will execute the Textract API call against that region.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions