Skip to content

Can't use document file names containging symbols for textract.start_document_text_detection

0

Hi,

I am trying to use textract to extract text from a pdf stored in an S3 bucket

response = textract.start_document_text_detection(
DocumentLocation={
    'S3Object': {
        'Bucket': 'sample-bucket',
        'Name': 'scanned_pdf_#1.pdf'
    }
},
JobTag = 'scanned_pdf_#1.pdf_job',
NotificationChannel={
    'RoleArn': 'arn:aws:iam::*******:role/AWSSNSFullAccessRole',
    'SNSTopicArn': 'arn:aws:sns:us-east-1:*********:PDF_TextProcess_Completed'
})

Here, when the file name contains a symbol I get the following error

InvalidParameterException: An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters

How do I get passed this without changing the name of the files?

I've also tried using this but it did not work

file = urllib.parse.unquote_plus(file, encoding='utf-8')

asked 4 years ago706 views
2 Answers
1
Accepted Answer

Hi, Thank you for using Amazon Textract. As per the documentations https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentTextDetection.html , parameter **JobTag ** allows only these characters

Type: String

Length Constraints: Minimum length of 1. Maximum length of 64.

Pattern:[ [a-zA-Z0-9_.-:]+]

, Please provide a valid JobTag value and retry the request.

AWS
answered 4 years ago
0

Thank you, I was including spaces in the name. It worked after taking that off.

answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.