Skip to content

GDPR PII data discovery across OpenSearch, DocumentDB and Neptune for multiple AWS accounts

0

Hi AWS Support, we are conducting a GDPR compliance exercise across 40 AWS accounts and need to identify Personally Identifiable Information (PII) stored in Amazon OpenSearch, Amazon DocumentDB, and Amazon Neptune. Our goal is to systematically discover and document PII attributes across all three services. We are aware that Amazon Macie does not natively support these services, so we are looking for alternative approaches or tooling recommendations.

Below is a summary of what we are trying to achieve per service:

  1. Amazon OpenSearch
  • Export index mappings across all domains.
  • Identify and list potential PII-related fields on a per-index basis.
  1. Amazon DocumentDB
  • Enumerate all collections and capture key document structures.
  • Identify and list potential PII-related fields on a per-collection basis.
  1. Amazon Neptune
  • Enumerate graph schemas, including node labels, edge labels, and associated properties (for both Property Graph / Gremlin and RDF / SPARQL models).
  • Identify vertex and edge properties that may contain PII (e.g. names, email addresses, phone numbers, identifiers).
  • Capture schema metadata from the Gremlin schema or via SPARQL introspection queries where applicable.

The key questions are:

  1. What is the recommended approach for PII discovery across these three services at scale?
  2. Are there any AWS-native tools, third-party integrations, or partner solutions that support automated PII scanning for OpenSearch, DocumentDB, and Neptune?
  3. Are there any AWS Professional Services engagements, Well-Architected guidance, or reference architectures for GDPR data discovery at this scale?

Any guidance, best practices, or community experience would be greatly appreciated.

1 Answer
0

Recommended Approach: Export → Scan with AWS Glue

Since Macie only supports S3, the recommended pattern for OpenSearch, DocumentDB, and Neptune is to export/sample data to S3 and use AWS Glue Sensitive Data Detection — which supports 100+ PII entity types via pattern matching and ML.

Amazon OpenSearch

  1. Use the _mapping API to export field mappings across all domains/indices (first-pass heuristic on field names like email, phone, name).
  2. Sample documents via the _search API with scroll/point-in-time, export to S3.
  3. Run AWS Glue's classifyColumns() API against the exported data to detect PII at the column level.

Reference: Detect, mask, and redact PII using AWS Glue before loading into OpenSearch (https://aws.amazon.com/blogs/big-data/detect-mask-and-redact-pii-data-using-aws-glue-before-loading-into-amazon-opensearch-service/)

Amazon DocumentDB

  1. AWS Glue has a native DocumentDB connector — create a Glue connection to each cluster.
  2. Use a Glue Crawler to catalog collections into the Glue Data Catalog.
  3. Run the Glue PII Detection transform (classifyColumns()) against the cataloged tables — returns a map of field names → detected PII entity types.

No export to S3 needed — Glue connects directly.

Amazon Neptune

  1. Schema discovery: - Property Graph: g.V().label().dedup(), g.E().label().dedup(), g.V().hasLabel('X').properties().key().dedup() - RDF: SELECT DISTINCT ?p WHERE { ?s ?p ?o } + Neptune statistics API

  2. Export string-type property values to S3 (via bulk export or targeted queries).

  3. Run Glue PII Detection on the exported data.

Neptune has the least native tooling support — the export → scan pattern is the primary approach here.

AWS-Native Tools

  • AWS Glue Sensitive Data Detection — Primary scanner. Pattern matching + ML, 100+ PII types. Works on anything in the Glue Data Catalog.
  • Amazon Comprehend (DetectPiiEntities API) — NLP-based PII detection for unstructured text fields.
  • AWS Glue DataBrew — Visual PII profiling with built-in statistics.

At Scale (40 accounts)

  • Use Step Functions to orchestrate cross-account discovery (assume roles)
  • Sample first — use classifyColumns() with sampling (~1000 docs per collection/index) to identify which fields contain PII before doing full scans
  • Centralize results in a compliance account's S3 bucket + DynamoDB for the PII inventory
  • Tag identified PII fields in the Glue Data Catalog for ongoing governance

Third-Party Options (AWS Marketplace)

If you need a turnkey solution with pre-built connectors and compliance dashboards: BigID, Securiti.ai, and OneTrust all support OpenSearch and DocumentDB with automated PII classification and GDPR reporting.

References

AWS
answered 13 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.