Skip to content

Textract S3 Permissions

0

Hi I'm trying to analyze a multipage pdf using Textract and the start_document_analysis API. I understand that the document I'm analyzing must be present in an S3 bucket. However when calling this function, I receive the following error message:

InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the StartDocumentAnalysis operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.

I've verified that the bucket name and key are correct, and the document works in the test console, leaving me to think this is related to permissions. Here is my test script (note, I am running this from my local computer, NOT lambda):

import boto3
session = boto3.Session(profile_name="default")

s3 = s.client("s3")
tx = s.client("textract")
doc = "test.pdf"
bucket = "test"

s3.upload_file(doc, bucket, doc)
resp = tx.start_document_analysis(
    DocumentLocation = {
        "S3Object": {
            "Bucket": bucket,
            "Name": doc
        }
    },
    FeatureTypes = ["TABLES"]
)

How do I configure my bucket to allow access from Textract?

Thanks

2 Answers
2
Accepted Answer

Document and Textract client not being in the same AWS region is another potential error. Make sure the Textract call is done from the same region as the bucket.

# when your bucket is in us-east-2
textract_client = boto3.client('textract', region_name='us-east-2')
AWS
answered 3 years ago
  • Hi, thank you for the response. Incredibly, that seems to have worked... Doesn't instantiating both the s3 and textract clients from the same session object ensure they all use the same region?

  • @danem: The bucket region is defined when the bucket is created, not when the boto3 client session is instantiated. So every S3 bucket is 'bound' to a specific region. Textract on the other hand is available in most regions and when a boto3 client session is instantiated, it will execute the Textract API call against that region.

0

I know this is an old post, but it's still relevant. I was getting this error and found that the raw key passed to textract often contained special characters, which were generating this error. To fix this, I decoded the key and then used that when passing data to textract.

Here is an example of what to add, not the entire code

Need to import this to decode the key

import urllib.parse

Then added this code to use the decoded key

decoded_key = urllib.parse.unquote_plus(key)

            # Start Textract asynchronous processing, use env vars
            response = textract.start_document_text_detection(
                DocumentLocation={
                    'S3Object': {
                        'Bucket': bucket,
                        'Name': decoded_key
                    }
                }
answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.