AWS Glue not properly crawling s3 bucket populated by "Resource Data Sync" -- specifically, "AWS: InstanceInformation" is not made into a table
I set up an s3 bucket that collects inventory data from multiple AWS accounts using the Systems Manager "Resource Data Sync". I was able to set up the Data Syncs to feed into the single bucket without issue and the Glue crawler was created automatically.
Now that I'm trying to query the data in Athena, I noticed there is an issue with how the Crawler is parsing the data in the bucket. The folder "AWS:InstanceInformation" is not being turned into a table. Instead, it is turning all of the "region=us-east-1/" and "test.json" sub-items into tables which are, obviously, not queryable.
To illustrate further, each of the following paths is being turned into it's own table.
- s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=12345679012/region=us-east-1
- s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=12345679012/test.json
- s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=23456790123/region=us-east-1
- s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=23456790123/test.json
- s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=34567901234/region=us-east-1
- s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=34567901234/test.json
This is ONLY happening with the "AWS:InstanceInformation" folder. All of the other folders (e.g. "AWS:DetailedInstanceInformation") are being properly turned into tables.
Since all of this data was populated automatically, I'm assuming that we are dealing with a bug? Is there anything I can do to fix this?
After testing more, I've determined the issue is being caused by that "test.json" file which is being added at the time of Resource Sync creation.
Relevant questions
AWS::Synthetics::Canary Not Creating S3 Bucket in EU
Accepted Answerasked 4 months agoAWS:InstanceInformation folder created in s3 by Resource Data Sync cannot be queried by Athena because it has an invalid schema with duplicate columns.
asked a month agoStore csv data from s3 bucket automatically inside timestream
asked 2 months agoencrypted db snapshot restore from S3 not working AWS RDS(mysql) console in an S3 bucket.
asked 2 months agoWithin Quicksight, is there a way to visualize real time data from an S3 bucket
Accepted Answerasked 2 years agoHow to insert S3 data into Aurora table via glue transform?
asked 3 years agoAWS Glue not properly crawling s3 bucket populated by "Resource Data Sync" -- specifically, "AWS: InstanceInformation" is not made into a table
asked a month agoI need to read S3 data, transform and put into Data Catalog. Should I be using a Crawler?
Accepted Answerasked 2 months agoDelete partitions in Glue Data Catalog using crawler not working.
asked 9 days agoS3 Inventory : Is there a limit to the number of objects in a bucket that will show up in the inventory?
Accepted Answerasked 4 years ago