Skip to content

Kendra + S3 Connector = Fails to index PDF files

0

Hello.

I´m trying to create a Kendra index with a S3 Connector. I have some custom fields that we need to search content. I created the fields in Kendra index correctly and understood that I´d need a metadata file together my files in S3 bucket to index it with these fields. I uploaded a PDF file with a metadata JSON file like this into S3:

{"Title":"Document title",
"ContentType":"pdf",
"Id":"100000241",
"Attributes":{
"asset_id":"100000241",
"asset_title":"Some title here",
"description":"Some description",
"asset_type":"Brochure:",
"audience_level_1":"Client",
"mru_control_id":"1004862-00009-00",
"persistent_url": "Some URL",
"_source_uri": "SOme other URL",
"date_published":"1970-01-01T00:00:00.000Z",
"expiration_date":"2026-05-06T17:10:00.000Z",
"modified_date_time":"2024-06-06T18:04:14.722Z",
"thumbnail": "Some thumb URL",
"file_format":"pdf",
"asset_subtype":"Brochure:",
"created_date":"2024-03-07T18:12:49.559Z",
"upload_date_time":"2024-03-07T18:12:49.559Z",
"keywords":["keyword1","keyword2","keyword3"],
"type":"assets"
}
}

But when I click on Sync Now, I get these errors in logs:

Error fetching item id from repository. Content/Metadata retrieval failed for document: Metadata validation failed with errors [{
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4
instance failed to match at least one required schema among 4

Does anyone know what this error is about?

Thanks a lot.

asked 10 months ago170 views
2 Answers
0

The error you're encountering with your Amazon Kendra S3 connector is related to the format of your metadata JSON file. The error message indicates that your metadata file is not matching the required schema format that Kendra expects.

Here are some potential issues and solutions:

  1. Metadata File Naming: Make sure your metadata file follows the correct naming convention. The metadata file should have the same name as your document with ".metadata.json" appended to it. For example, if your document is "mydocument.pdf", the metadata file should be "mydocument.pdf.metadata.json".

  2. JSON Structure: The structure of your JSON appears to have some issues. For Amazon Kendra S3 data sources, the metadata JSON should follow a specific format. The "_source_uri" field should be at the top level of your JSON, not nested within the "Attributes" object.

  3. Field Formatting: Some of your fields might not be in the expected format. For example, date fields should be in the ISO 8601 format.

Here's a suggested correction to your metadata JSON format:

{
"Title": "Document title",
"ContentType": "pdf",
"Id": "100000241",
"_source_uri": "Your source URI here",
"asset_id": "100000241",
"asset_title": "Some title here",
"description": "Some description",
"asset_type": "Brochure:",
"audience_level_1": "Client",
"mru_control_id": "1004862-00009-00",
"persistent_url": "Some URL",
"date_published": "1970-01-01T00:00:00.000Z",
"expiration_date": "2026-05-06T17:10:00.000Z",
"modified_date_time": "2024-06-06T18:04:14.722Z",
"thumbnail": "Some thumb URL",
"file_format": "pdf",
"asset_subtype": "Brochure:",
"created_date": "2024-03-07T18:12:49.559Z",
"upload_date_time": "2024-03-07T18:12:49.559Z",
"keywords": ["keyword1", "keyword2", "keyword3"],
"type": "assets"
}

Also, ensure that all the custom fields you're using in the metadata file have been properly defined in your Kendra index before attempting to sync. If you've created custom fields in your Kendra index, make sure the field names in your metadata JSON match exactly with the field names in your index.

Additionally, check that your IAM role for Kendra has the necessary permissions to access the S3 bucket and its objects. The role should have permissions for actions like s3:GetObject and s3:ListBucket on your S3 bucket.
Sources
Encountered error "Ignored 1 files as metadata file is not in valid JSON format. [Files:docs_metadata.csv]. Call to Customer Source did not succeed." | AWS re:Post
Troubleshooting data sources - Amazon Kendra
"Kendra FAQ Failed to Stabilize" Error when attempting to create a Kendra FAQ in CFN | AWS re:Post
BMW Group Develops a GenAI Assistant to Accelerate Infrastructure Optimization on AWS | AWS for Industries

answered 10 months ago
0

thanks a lot for answering..about your points:

  1. yup, I made sure about the names for PDF and metadata json
  2. sorry, had to mask some values before posting here, but I´ll remove dates just for testing
  3. probably that´s my issue..didn´t know that attributes order was an issue..gonna follow the order from AWS docs.

UPDATE: changed the JSON format to have _source_uri at top of list, but I got the same errors... also, to create the metadata JSON file, I followed this format: https://docs.aws.amazon.com/kendra/latest/dg/s3-metadata.html with my fields inside Attributes object.

answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.