- Newest
- Most votes
- Most comments
This is a well-structured question and the approaches you have already explored are on the right track. I will try to add specifics to each of your four questions based on what AWS provides and what has been documented by others working at similar scale.
- Querying findings at scale: Athena with partition projection Your Athena and Glue Crawler approach is the recommended AWS-native path. AWS published a detailed walkthrough of this exact pattern in their security blog: How to query and visualize Macie sensitive data discovery results with Athena and QuickSight. The post includes table definitions for Macie's JSONL schema and sample queries. One specific improvement worth considering for your scale: rather than relying on a Glue Crawler for partition management, use Athena partition projection. With 40+ accounts and continuous ingestion, the Crawler will need to run frequently to pick up new partitions, and each run adds latency and cost. Partition projection calculates partition values and locations from table properties you configure directly, so Athena resolves partitions at query time without a metadata lookup. Given your path structure (AWSLogs/<Admin-Account-ID>/Macie/eu-west-1/<SDLC-Account-ID>/), you can define the account ID as an injected partition type and the date components as date-type projections. AWS documents this in detail here: Use partition projection with Amazon Athena. There is also the Amazon Macie Results Analytics repository on GitHub (maintained by AWS), which provides pre-built Athena table definitions and sample queries for Macie discovery results. If you have not already looked at it, it may save you time on the schema mapping.
- Continuous ingestion If you move to partition projection as described above, the ingestion problem largely resolves itself because there is no Crawler to re-run. Athena will pick up new files in existing partitions automatically, and new account-level partitions are resolved by the projection configuration without manual intervention. If you still want event-driven processing on top of this (for example, to trigger downstream alerts when new sensitive data is found), you could add S3 Event Notifications on your centralised findings bucket to invoke a Lambda function when new JSONL.GZ files land. That Lambda can parse the finding, extract the key fields (bucket ARN, sensitivity score, finding type), and push them to an SNS topic or EventBridge for downstream consumers. This avoids polling and gives you near real-time notification without replacing the Athena layer for analytical queries.
- Application attribution This is the hardest of your four problems, and your multi-signal fallback approach is the right architecture. A few specific tools to layer in: AWS Service Catalog AppRegistry is purpose-built for this. It lets you define applications as logical groupings and associate AWS resources (including S3 buckets) to them. It supports cross-account applications through AWS RAM sharing, so you can define applications centrally and associate resources from your 40 member accounts. AWS published a walkthrough here: How to manage multi-account applications with AppRegistry and Resource Access Manager. The challenge is that AppRegistry requires resources to be associated explicitly, so it works best going forward once you enforce it as part of your deployment pipeline. It will not retroactively solve the problem for existing untagged buckets. AWS Resource Explorer with multi-account search can help you find untagged or inconsistently tagged buckets across your organisation. You can query for S3 buckets that are missing a specific tag key (using tag:none queries), which gives you a remediation target list for your tagging compliance effort. AWS Config Aggregator in your management or delegated admin account can collect resource configuration data across all 40 accounts. You can write Config rules that check for the presence of required tags (such as an ApplicationOwner or Team tag) on S3 buckets, and report non-compliant resources. This does not solve attribution directly, but it gives you a compliance mechanism to enforce tagging going forward. Tag policies through AWS Organizations let you define and enforce standardised tag keys and allowed values across all accounts. If you have not implemented these yet, they are worth setting up in parallel so that new buckets are tagged correctly even while you work through the backlog. For your existing untagged buckets, the CloudTrail creator event approach you mentioned is probably the most reliable retroactive signal. The CreateBucket API call in CloudTrail records the IAM principal that created the bucket, which in most environments maps back to a deployment role or CI/CD pipeline that is associated with a specific application or team. Combining that with bucket naming conventions and account-level ownership gives you reasonable coverage for the historical backlog.
- Classification Errors Classification errors from Macie automated discovery typically fall into a few specific categories, documented here: Remediating coverage issues for automated sensitive data discovery. The most common causes at scale are:
KMS key access: objects encrypted with customer managed KMS keys where the key policy does not grant Macie's service-linked role decrypt permissions. This is the most frequent cause in multi-account setups because each account may have its own CMKs. You need to update the key policy for each relevant CMK to allow the AWSServiceRoleForAmazonMacie role to perform kms:Decrypt. Restrictive bucket policies: if a bucket policy has explicit Deny statements, these override Macie's service-linked role permissions even if the role otherwise has access. Unsupported object types or size: Macie has limits on which file types it can classify and the maximum object size it will analyse.
For bulk remediation, you can use your Athena table to query for all findings with a classification error status, group them by account and error type, and then generate targeted remediation scripts. For the KMS issue specifically, if you use AWS Config across your accounts, you can write a custom Config rule that checks whether each CMK's policy includes the Macie service role, and flag non-compliant keys.
Thanks for the detailed answer with multiple options to achieve the desired outcome. I will implement one of the solutions and will update the status accordingly.
Relevant content
- asked 2 years ago
- asked 3 years ago
- AWS OFFICIALUpdated 2 years ago

Sure sounds like you don’t have a start-up and you are asking this community to build what you can’t for free.