Textract S3 Permissions

0

Hi I'm trying to analyze a multipage pdf using Textract and the start_document_analysis API. I understand that the document I'm analyzing must be present in an S3 bucket. However when calling this function, I receive the following error message:

InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the StartDocumentAnalysis operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.

I've verified that the bucket name and key are correct, and the document works in the test console, leaving me to think this is related to permissions. Here is my test script (note, I am running this from my local computer, NOT lambda):

import boto3
session = boto3.Session(profile_name="default")

s3 = s.client("s3")
tx = s.client("textract")
doc = "test.pdf"
bucket = "test"

s3.upload_file(doc, bucket, doc)
resp = tx.start_document_analysis(
    DocumentLocation = {
        "S3Object": {
            "Bucket": bucket,
            "Name": doc
        }
    },
    FeatureTypes = ["TABLES"]
)

How do I configure my bucket to allow access from Textract?

Thanks

2 Respuestas
2

hi there, According to my testing, as per your API, you only need the following permission for the API { "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::rafaxu-bucket/example.pdf" } ] }

but there are a few other steps you can check further.

  1. is there any s3 bucket policies to limit the access
  2. is there any kms key applied to the object? if that is the case, you may need to get KMS related permission for your iam user/role
  3. you can add boto3.set_stream_logger(name='botocore') to your code to find some debug information which may help you.

I recommend you to seperate the S3 upload and Texttract API in different code snippet for troubleshooting purpose.

here is my testing code and working example

import boto3
boto3.set_stream_logger(name='botocore')

s = boto3.Session(profile_name="default")

tx = s.client("textract")
doc = "example.pdf"
bucket = "rafaxu-bucket"

resp = tx.start_document_analysis(
    DocumentLocation={
        "S3Object": {
            "Bucket": bucket,
            "Name": doc
        }
    },
    FeatureTypes=["TABLES"]
)

print(resp)

Here is my IAM policy for IAM user:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::rafaxu-bucket/example.pdf"
        }
    ]
}
  • Texttract full access just to the texttract api.

If I remove the policy, I do get this error: botocore.errorfactory.InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the StartDocumentAnalysis operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.

help that works

respondido hace un año
  • Sorry, I"m not sure where that IAM policy needs to be applied. Not on the bucket, right? I've created a new user with the S3FullAccess and TextractFullAccess policies applied, and I'm now using that as the account executing the code, but still running into the same issue. Thank you for your help.

1
Respuesta aceptada

Document and Textract client not being in the same AWS region is another potential error. Make sure the Textract call is done from the same region as the bucket.

# when your bucket is in us-east-2
textract_client = boto3.client('textract', region_name='us-east-2')
AWS
respondido hace un año
  • Hi, thank you for the response. Incredibly, that seems to have worked... Doesn't instantiating both the s3 and textract clients from the same session object ensure they all use the same region?

  • @danem: The bucket region is defined when the bucket is created, not when the boto3 client session is instantiated. So every S3 bucket is 'bound' to a specific region. Textract on the other hand is available in most regions and when a boto3 client session is instantiated, it will execute the Textract API call against that region.

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas