- Newest
- Most votes
- Most comments
Recommended Approach: Export → Scan with AWS Glue
Since Macie only supports S3, the recommended pattern for OpenSearch, DocumentDB, and Neptune is to export/sample data to S3 and use AWS Glue Sensitive Data Detection — which supports 100+ PII entity types via pattern matching and ML.
Amazon OpenSearch
- Use the _mapping API to export field mappings across all domains/indices (first-pass heuristic on field names like email, phone, name).
- Sample documents via the _search API with scroll/point-in-time, export to S3.
- Run AWS Glue's classifyColumns() API against the exported data to detect PII at the column level.
Reference: Detect, mask, and redact PII using AWS Glue before loading into OpenSearch (https://aws.amazon.com/blogs/big-data/detect-mask-and-redact-pii-data-using-aws-glue-before-loading-into-amazon-opensearch-service/)
Amazon DocumentDB
- AWS Glue has a native DocumentDB connector — create a Glue connection to each cluster.
- Use a Glue Crawler to catalog collections into the Glue Data Catalog.
- Run the Glue PII Detection transform (classifyColumns()) against the cataloged tables — returns a map of field names → detected PII entity types.
No export to S3 needed — Glue connects directly.
Amazon Neptune
-
Schema discovery: - Property Graph: g.V().label().dedup(), g.E().label().dedup(), g.V().hasLabel('X').properties().key().dedup() - RDF: SELECT DISTINCT ?p WHERE { ?s ?p ?o } + Neptune statistics API
-
Export string-type property values to S3 (via bulk export or targeted queries).
-
Run Glue PII Detection on the exported data.
Neptune has the least native tooling support — the export → scan pattern is the primary approach here.
AWS-Native Tools
- AWS Glue Sensitive Data Detection — Primary scanner. Pattern matching + ML, 100+ PII types. Works on anything in the Glue Data Catalog.
- Amazon Comprehend (DetectPiiEntities API) — NLP-based PII detection for unstructured text fields.
- AWS Glue DataBrew — Visual PII profiling with built-in statistics.
At Scale (40 accounts)
- Use Step Functions to orchestrate cross-account discovery (assume roles)
- Sample first — use classifyColumns() with sampling (~1000 docs per collection/index) to identify which fields contain PII before doing full scans
- Centralize results in a compliance account's S3 bucket + DynamoDB for the PII inventory
- Tag identified PII fields in the Glue Data Catalog for ongoing governance
Third-Party Options (AWS Marketplace)
If you need a turnkey solution with pre-built connectors and compliance dashboards: BigID, Securiti.ai, and OneTrust all support OpenSearch and DocumentDB with automated PII classification and GDPR reporting.
References
- AWS Glue Sensitive Data Detection (https://docs.aws.amazon.com/glue/latest/dg/detect-PII.html)
- Using Sensitive Data Detection outside Glue Studio (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-sensitive-data-example.html)
- AWS GDPR Center (https://aws.amazon.com/compliance/gdpr-center/)
Relevant content
- asked 3 years ago
- asked a year ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 7 months ago
