We deployed the Foundational dashboards a few months back for one Billing Org. Everything worked without any issues. A few week ago we added five more Billing Orgs. All of their CURs have been backfilled to 1/1/2023 and we see the data replicated to the shared bucket. The data has been there for at least 3 days. Also the required stack (step 1 - https://catalog.workshops.aws/awscid/en-US/dashboards/foundational/cudos-cid-kpi/deploy#step-1.-(in-destinationdata-collection-account)-create-destination-for-cur-aggregation) was updated to include all the additional payers.
The issue we are facing is that only two of the new payer accounts appear in the dashboards and only for August 2024.
We opened a case with AWS support because upon investigation there were errors in the Glue Jobs. The error we are getting is
Service Principal: glue.amazonaws.com is not authorized to perform: glue:GetTable on resource: arn:aws:glue:us-east-1:{redacted}:table/cid_cur/{redacted} because no identity-based policy allows the glue:GetTable action (Database name: cid_cur, Table Name: {redacted}) (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: {redacted}; Proxy: null).
The Tech's analysis resulted in the following:
*I reviewed the IAM role for the crawler and found that it is only allowed to perform actions on a single table cur
under the database cid_cur
, which is the reason for the error messaging. After reviewing the CID process with you, we confirmed that ideally, this permission should be enough. However, the crawler sees the schema of some of the tables as different enough to warrant creating new tables instead of adding partitions as expected.
You walked me through some of the setup for the Cloud Intelligent Dashboard. While I offered a couple of possibilities to resolve the issue, you explained how you need a solution that does not modify the crawler or IAM role, since this setup is generated via Cloud Formation stacks provided as part of the CID community program, and that would require manually adjusting them again when those stacks are updated or redeployed.
I took the investigation offline so I could continue digging through crawler logs looking for specifics. I also compared the parquet files provided to the case, in order to check how much the schema differs. The parts of the schema that overlap are matches, however, there are still a significant number of column differences. Referring to a documented explanation of how schema similarity is determined [^1], the crawler compares the schema of the existing table with the schema of the new data detected.
My own comparison from just these 2 files shows that the older file contains 66 columns that are not present in the newer file; the newer file contains 91 columns that are not present in the older file; and there are 212 columns present in both that have matching data types. These schema differences are also weighted against the schema differences across all of the newer files, which is described as the "cluster" in the referenced re:Post article. If there are more than 5 files with sufficiently different schema in the new folders, then the crawler will try to create a new table rather than updating the partitions of the existing table. I also compared the newer file to the existing table, and found that all of its columns are present in the table schema, and the table schema contains 195 columns that are not present in that file.
In the crawler's previous runs, it had already identified the account folders ending in {redacted} and {redacted} as new table targets, and in the latest run it identified the {redacted} folder as a new table target today, and encountered the GetTable permissions issue when trying to see if the table already existed. Because this metadata has already been established, the crawler may need to be deleted and re-created in order to remove the stale (and invalid) table associations with the account folders.
I still recommend updating the crawler's configuration to set the table level [^2] to 2, which will tell the crawler that all new tables will be at the second level of the S3 path: "s3://cid-{redacted}-shared/cur/". This will force it to try to recognize that the newer folders are partitions instead of tables. Since the Cloud Formation template comes from the open source community for Cloud Intelligent Dashboards, you may find it beneficial to request that they allow configuring the table level in the stack, or they can hard-code it into the crawler's template based on the depth of table's CUR folder.*
We are very weary of making changes to the configuration of Glue and the Jobs outside of what was deployed via the CID Templates.
Has anyone run into this before or have any recommendations. Our deployment is essential unusable at this moment since we added the additional billing orgs.