Skip to content

Automated PII/PCI Sensitive Data Discovery and Application Attribution Across Multi-Account AWS S3 Estate Using Amazon Macie

-1

We are working on a GDPR-driven data discovery initiative across a multi-account AWS environment managed through a centralised Macie Administrator account. The setup spans across 40+ SDLC accounts with Automated Sensitive Data Discovery enabled to evaluate all S3 buckets across the estate.

To view the current discovery statistics in your own environment, navigate to: AWS Console → Amazon Macie → Summary → S3 Buckets (under the Automated Sensitive Data Discovery section)

This provides a live breakdown across four classification states — Sensitive, Not Sensitive, Not Yet Analysed, and Classification Error or programmatically via:

  1. aws macie2 get-automated-discovery-configuration
  2. aws macie2 list-resource-profile-artifacts --resource-arn <s3-bucket-arn>
  3. aws macie2 get-resource-profile --resource-arn <s3-bucket-arn>

All findings are stored in a centralised S3 bucket within the Macie Administrator account, encrypted using a CMK, following this path structure:

<s3-bucket>/AWSLogs/<Macie-Admin-Account-ID>/Macie/eu-west-1/<SDLC-Account-ID>/*.jsonl.gz

Each SDLC account folder contains several hundred JSONL.GZ files, and the bucket is continuously populated as discovery progresses.

CHALLENGES WE ARE TRYING TO SOLVE

  1. Querying findings at scale Amazon Macie does not natively support aggregating or running SQL-style analysis across the raw discovery output stored in S3. Downloading and parsing hundreds of compressed files manually per account is not scalable across 40 accounts.

  2. Continuous ingestion Automated Sensitive Data Discovery is an ongoing process. The S3 bucket is continuously populated with new findings, so any solution needs to handle incremental data rather than requiring a full re-scan each time.

  3. Application attribution The findings identify which S3 buckets contain sensitive data, but not which application or system owns that bucket. With 40 accounts and thousands of buckets, linking a bucket back to an owning application, team, or business system is non-trivial — particularly where S3 tagging is inconsistent or absent.

WHAT WE HAVE EXPLORED SO FAR

• A Python-based approach to download, decompress, and parse the JSONL.GZ files, flatten the nested Macie schema, and load results into SQLite for SQL analysis.

• An AWS-native approach using a Glue Crawler with CRAWL_NEW_FOLDERS_ONLY policy for incremental ingestion, and Amazon Athena to query findings directly in S3 without any data movement.

• A multi-signal bucket attribution resolver that attempts to map bucket names to owning applications using S3 tags, bucket naming patterns, AWS Resource Groups, account names, and CloudTrail creator events as layered fallback signals.

QUESTIONS FOR THE COMMUNITY

  1. Has anyone implemented a scalable pipeline for aggregating and querying Macie Automated Sensitive Data Discovery output across a large multi-account estate? What approach did you take?

  2. For those who have solved the application attribution problem — how did you reliably link S3 buckets back to owning systems, particularly where tagging compliance is inconsistent?

  3. Are there AWS-native features, third-party tools, or integration patterns — such as AWS Config, Service Catalog AppRegistry, or Resource Explorer — that you have found effective for bucket-to-application mapping at scale?

  4. How are others handling Classification Error findings? Is there a recommended approach to diagnose and remediate these in bulk?

Any experiences, architectural patterns, or lessons learned would be greatly appreciated.

  • Sure sounds like you don’t have a start-up and you are asking this community to build what you can’t for free.

1 Answer
0

This is a well-structured question and the approaches you have already explored are on the right track. I will try to add specifics to each of your four questions based on what AWS provides and what has been documented by others working at similar scale.

  1. Querying findings at scale: Athena with partition projection Your Athena and Glue Crawler approach is the recommended AWS-native path. AWS published a detailed walkthrough of this exact pattern in their security blog: How to query and visualize Macie sensitive data discovery results with Athena and QuickSight. The post includes table definitions for Macie's JSONL schema and sample queries. One specific improvement worth considering for your scale: rather than relying on a Glue Crawler for partition management, use Athena partition projection. With 40+ accounts and continuous ingestion, the Crawler will need to run frequently to pick up new partitions, and each run adds latency and cost. Partition projection calculates partition values and locations from table properties you configure directly, so Athena resolves partitions at query time without a metadata lookup. Given your path structure (AWSLogs/<Admin-Account-ID>/Macie/eu-west-1/<SDLC-Account-ID>/), you can define the account ID as an injected partition type and the date components as date-type projections. AWS documents this in detail here: Use partition projection with Amazon Athena. There is also the Amazon Macie Results Analytics repository on GitHub (maintained by AWS), which provides pre-built Athena table definitions and sample queries for Macie discovery results. If you have not already looked at it, it may save you time on the schema mapping.
  2. Continuous ingestion If you move to partition projection as described above, the ingestion problem largely resolves itself because there is no Crawler to re-run. Athena will pick up new files in existing partitions automatically, and new account-level partitions are resolved by the projection configuration without manual intervention. If you still want event-driven processing on top of this (for example, to trigger downstream alerts when new sensitive data is found), you could add S3 Event Notifications on your centralised findings bucket to invoke a Lambda function when new JSONL.GZ files land. That Lambda can parse the finding, extract the key fields (bucket ARN, sensitivity score, finding type), and push them to an SNS topic or EventBridge for downstream consumers. This avoids polling and gives you near real-time notification without replacing the Athena layer for analytical queries.
  3. Application attribution This is the hardest of your four problems, and your multi-signal fallback approach is the right architecture. A few specific tools to layer in: AWS Service Catalog AppRegistry is purpose-built for this. It lets you define applications as logical groupings and associate AWS resources (including S3 buckets) to them. It supports cross-account applications through AWS RAM sharing, so you can define applications centrally and associate resources from your 40 member accounts. AWS published a walkthrough here: How to manage multi-account applications with AppRegistry and Resource Access Manager. The challenge is that AppRegistry requires resources to be associated explicitly, so it works best going forward once you enforce it as part of your deployment pipeline. It will not retroactively solve the problem for existing untagged buckets. AWS Resource Explorer with multi-account search can help you find untagged or inconsistently tagged buckets across your organisation. You can query for S3 buckets that are missing a specific tag key (using tag:none queries), which gives you a remediation target list for your tagging compliance effort. AWS Config Aggregator in your management or delegated admin account can collect resource configuration data across all 40 accounts. You can write Config rules that check for the presence of required tags (such as an ApplicationOwner or Team tag) on S3 buckets, and report non-compliant resources. This does not solve attribution directly, but it gives you a compliance mechanism to enforce tagging going forward. Tag policies through AWS Organizations let you define and enforce standardised tag keys and allowed values across all accounts. If you have not implemented these yet, they are worth setting up in parallel so that new buckets are tagged correctly even while you work through the backlog. For your existing untagged buckets, the CloudTrail creator event approach you mentioned is probably the most reliable retroactive signal. The CreateBucket API call in CloudTrail records the IAM principal that created the bucket, which in most environments maps back to a deployment role or CI/CD pipeline that is associated with a specific application or team. Combining that with bucket naming conventions and account-level ownership gives you reasonable coverage for the historical backlog.
  4. Classification Errors Classification errors from Macie automated discovery typically fall into a few specific categories, documented here: Remediating coverage issues for automated sensitive data discovery. The most common causes at scale are:

KMS key access: objects encrypted with customer managed KMS keys where the key policy does not grant Macie's service-linked role decrypt permissions. This is the most frequent cause in multi-account setups because each account may have its own CMKs. You need to update the key policy for each relevant CMK to allow the AWSServiceRoleForAmazonMacie role to perform kms:Decrypt. Restrictive bucket policies: if a bucket policy has explicit Deny statements, these override Macie's service-linked role permissions even if the role otherwise has access. Unsupported object types or size: Macie has limits on which file types it can classify and the maximum object size it will analyse.

For bulk remediation, you can use your Athena table to query for all findings with a classification error status, group them by account and error type, and then generate targeted remediation scripts. For the KMS issue specifically, if you use AWS Config across your accounts, you can write a custom Config rule that checks whether each CMK's policy includes the Macie service role, and flag non-compliant keys.

answered a month ago
  • Thanks for the detailed answer with multiple options to achieve the desired outcome. I will implement one of the solutions and will update the status accordingly.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.