Skip to content

How do I automatically redact PII from Amazon Connect call transcripts and recordings using Contact Lens?

10 minute read
Content level: Advanced
1

Contact centers regularly capture PII (names, addresses, SSNs, credit card numbers) in call audio and transcripts. To meet PCI DSS, GDPR, and HIPAA controls, this data must be removed from artifacts before access for QA, analytics, or model training. This article shows how Amazon Connect Contact Lens produces redacted transcripts and audio, where the artifacts land in Amazon S3, how to lock down access to original (un-redacted) files, and the limits of what redaction can and cannot do.

Short description

Contact Lens for Amazon Connect performs natural-language redaction of PII as part of conversational analytics. For voice and chat, redaction is applied post-contact (after the call disconnects or the chat ends). For email — which is asynchronous — analysis and redaction begin when the email is received, not on a "post-contact" trigger. When you enable redaction in the contact flow, Contact Lens writes a redacted analyzed transcript (and, optionally, the original analyzed transcript) to your instance's S3 bucket under separate prefixes — Analysis/Voice/, Analysis/Voice/Redacted/, Analysis/Chat/, Analysis/Chat/Redacted/, Analysis/Email/, Analysis/Email/Redacted/. For voice, a redacted audio file (with silences over the PII segments) is also produced. By denying access to the un-redacted prefixes at the bucket-policy or KMS-key-policy level, you can ensure that day-to-day users see only redacted artifacts.

Resolution

1. Enable redaction in the contact flow

Redaction is configured per contact flow, not per instance — different flows can have different settings.

  1. Open the contact flow in the Amazon Connect flow editor.
  2. Add (or open) a Set recording, analytics and processing behavior block. (The older Set recording and analytics behavior block is now legacy; it still works but new flows should use the newer block, which also supports email and in-flight chat redaction.)
  3. Under Recording, enable both Agent and Customer audio tracks. This is a hard requirement, not a recommendation — Contact Lens analytics (and therefore redaction) can only be enabled when RecordedParticipants contains both Agent and Customer. Single-sided recording silently disables analytics.
  4. Under Analytics, enable Contact Lens and choose the analytics mode:
    • Post-call analytics — analysis runs after the call ends. Redaction is applied here.
    • Real-time analytics — segment-by-segment analysis emitted to a Kinesis stream during the call (for supervisor alerts and live coaching). Real-time analytics does not redact in the live segment stream — redaction is still applied post-call to the stored transcript.
  5. Choose the language. Contact Lens redaction is supported across many English variants and additional languages including Spanish, French, German, Italian, Japanese, Korean, and Portuguese. The exact list expands periodically and the supported set for redaction is a subset of the languages supported for analytics — verify the current list against the supported languages for Contact Lens documentation. Use the four-character xx-XX format (for example en-US, es-US, fr-FR).
  6. Check Redact sensitive data.
  7. Choose what to retain:
    • Get redacted transcripts with redacted audio — only the redacted artifacts are kept.
    • Get redacted and original transcripts with redacted audio — both versions are kept.
  8. (Optional) Choose the mask mode:
    • PII (default) — every detected entity is replaced with the literal [PII].
    • EntityType — each entity is replaced with its specific type, for example [NAME], [SSN], [CREDIT_DEBIT_NUMBER]. Useful for downstream QA tooling that needs to distinguish entity classes without seeing the values.
  9. (Optional) Choose specific entity types via AnalyticsRedactionEntities if you don't want to redact all categories — for example, redact only financial entities and leave NAME un-redacted for QA workflows.

Important — irreversibility of "redacted only". If you choose redacted-only retention, the original analyzed file is never written. The redacted JSON does not preserve the original text of the redacted segments, and the redacted audio replaces those segments with silence. If the model misses a PII instance, you cannot recover it for review later. For most regulated workloads, retaining originals in a hardened, separately-keyed location (see Section 4) is safer than redacted-only.

2. What gets redacted

Contact Lens detects and redacts the following entity types (the exact list as accepted by AnalyticsRedactionEntities). If AnalyticsRedactionEntities is omitted, all of these are redacted:

GroupEntity types
FinancialBANK_ACCOUNT_NUMBER, BANK_ROUTING, CREDIT_DEBIT_NUMBER, CREDIT_DEBIT_CVV, CREDIT_DEBIT_EXPIRY, INTERNATIONAL_BANK_ACCOUNT_NUMBER, PIN, SWIFT_CODE
Government / national IDsSSN, US_INDIVIDUAL_TAX_IDENTIFICATION_NUMBER, PASSPORT_NUMBER, DRIVER_ID, CA_HEALTH_NUMBER, CA_SOCIAL_INSURANCE_NUMBER, UK_NATIONAL_HEALTH_SERVICE_NUMBER, UK_NATIONAL_INSURANCE_NUMBER, UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER, IN_AADHAAR, IN_PERMANENT_ACCOUNT_NUMBER, IN_NREGA, IN_VOTER_NUMBER
PersonalNAME, AGE, EMAIL, PHONE, ADDRESS, DATE_TIME, AGENT_DISPLAY_NAME, CUSTOMER_DISPLAY_NAME
Vehicle / assetLICENSE_PLATE, VEHICLE_IDENTIFICATION_NUMBER
Technical / credentialsAWS_ACCESS_KEY, AWS_SECRET_KEY, IP_ADDRESS, MAC_ADDRESS, PASSWORD, URL, USERNAME, ATTACHMENT_NAME

Note: DATE_TIME covers all dates, not just date of birth — set it carefully if you need to retain timestamps for analytics.

In the redacted audio (*_call_recording_redacted_*.wav), each redacted segment is replaced with silence. The silence offsets are stored in the analysis JSON under Transcript[].Redaction.RedactedTimestamps[] (each entry has BeginOffsetMillis and EndOffsetMillis). You can use those offsets to overlay a tone post-process if your QA team prefers a beep over silence. Caveat: silenced segments are not flagged as non-talk time in the Connect admin UI, so silence inserted by redaction can distort talk-time / non-talk-time analytics if you don't account for it.

Important timing note. Contact Lens redaction is applied after the contact ends (or, for email, after the email is received and analyzed). It does not prevent an agent from hearing or seeing PII in real time during a voice call. If you need to ensure the agent never observes the data — for example, a card PAN — pair redaction with a secure-input pattern that pauses recording and encrypts the input at capture time. For chat, you can additionally enable in-flight redaction, which masks PII in chat messages before they are delivered to the agent UI. (See the related article on secure input with Amazon Lex and AWS Lambda, and Enable in-flight sensitive data redaction.)

3. Where the files land in Amazon S3

The exact paths emitted by Contact Lens (verified against the output file locations documentation):

architecture

Original call audio (the un-redacted WAV) lives under your CALL_RECORDINGS storage prefix (commonly call-recordings/), not under Analysis/. To delete a call's PII completely you must remove the original audio under call-recordings/, the original transcript under Analysis/Voice/ (if retained), and the redacted artifacts under Analysis/Voice/Redacted/.

4. Lock down access to original artifacts

The principle: agents, analytics roles, and ML training pipelines should read only Analysis/*/Redacted/*. A small compliance role can read originals when legally required, with all access logged.

The Amazon Connect service-linked role for an instance has the form arn:aws:iam::<account>:role/aws-service-role/connect.amazonaws.com/AWSServiceRoleForAmazonConnect_<instance-id>. Use a wildcard suffix or the service principal condition to allow it through, since the SLR ARN includes the instance ID:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyOriginalAnalysisExceptComplianceAndConnectSLR",
      "Effect": "Deny",
      "Principal": "*",
      "Action": ["s3:GetObject", "s3:GetObjectVersion"],
      "NotResource": [
        "arn:aws:s3:::my-connect-bucket/Analysis/Voice/Redacted/*",
        "arn:aws:s3:::my-connect-bucket/Analysis/Chat/Redacted/*",
        "arn:aws:s3:::my-connect-bucket/Analysis/Email/Redacted/*"
      ],
      "Condition": {
        "ArnNotLike": {
          "aws:PrincipalArn": [
            "arn:aws:iam::123456789012:role/ConnectComplianceAuditor",
            "arn:aws:iam::123456789012:role/aws-service-role/connect.amazonaws.com/AWSServiceRoleForAmazonConnect_*"
          ]
        }
      }
    },
    {
      "Sid": "DenyOriginalCallRecordingExceptComplianceAndConnectSLR",
      "Effect": "Deny",
      "Principal": "*",
      "Action": ["s3:GetObject", "s3:GetObjectVersion"],
      "Resource": "arn:aws:s3:::my-connect-bucket/call-recordings/*",
      "Condition": {
        "ArnNotLike": {
          "aws:PrincipalArn": [
            "arn:aws:iam::123456789012:role/ConnectComplianceAuditor",
            "arn:aws:iam::123456789012:role/aws-service-role/connect.amazonaws.com/AWSServiceRoleForAmazonConnect_*"
          ]
        }
      }
    }
  ]
}

Pair this with a KMS key policy (or grant scope) that also restricts kms:Decrypt to the same compliance role for objects under the original prefixes — defense in depth. Even if a bucket policy is mis-edited, the missing decrypt path prevents un-redacted reads.

For an extra layer, use S3 Event Notifications to trigger a Lambda when an original transcript is created. The Lambda moves the object to a hardened bucket (my-connect-originals-locked) with MFA-required reads and CloudTrail Data Events enabled, then deletes the source object. This isolates originals to a single audited location.

5. Validate redaction

After deployment, place a test call and read aloud:

  • A test SSN: 123-45-6789
  • A test PAN: 4111 1111 1111 1111
  • A name and address: John Doe, 123 Main Street, Seattle Washington

Then:

  1. Check the redacted JSON at s3://<bucket>/Analysis/Voice/Redacted/<YYYY>/<MM>/<DD>/<contactId>_analysis_redacted_*.json. Search the file for the test values — they should not appear. With default mask mode, you'll see the literal [PII] in the affected transcript turns; with EntityType mode, you'll see specific tags such as [SSN], [NAME], or [CREDIT_DEBIT_NUMBER]. Note that Contact Lens-generated transcripts cannot be downloaded through the Amazon Connect admin website — pull them from S3 directly.

  2. Check the redacted audio. Download <contactId>_call_recording_redacted_*.wav and play the segments where the values were spoken — you should hear silence.

  3. Check the original JSON (if retained) at s3://<bucket>/Analysis/Voice/<YYYY>/<MM>/<DD>/<contactId>_analysis_*.json. The original values should be present, along with Redaction.RedactedTimestamps entries on each turn that contains PII. (Rule-based contact categorization lives separately under Categories.MatchedDetails and is unrelated to PII redaction.)

  4. Run Amazon Macie on the redacted prefix. Configure a Macie sensitive-data discovery job (or automated sensitive data discovery) to scan Analysis/Voice/Redacted/. Findings should be empty. If Macie flags PII in the redacted prefix, it indicates either a redaction miss (raise it with AWS Support) or — more commonly — non-PII text that triggered Macie's broader patterns (review case-by-case). Note: Macie discovery jobs are billed per GB scanned; price-test on a small prefix first.

6. Caveats and known limits

  • HIPAA de-identification. AWS explicitly states that Contact Lens redaction does not by itself meet HIPAA Safe Harbor de-identification standards. Treat redacted transcripts as still-protected information (encryption, access controls, retention).

  • Redaction quality is probabilistic. The model can miss domain-specific identifiers (member IDs, claim numbers, internal account references) that are not in the supported entity list. For domain-specific patterns, run a custom Lambda that applies regex replacement on the redacted JSON before publishing it to downstream consumers.

  • Real-time stream is not redacted. Real-time Contact Lens analytics emitted to Kinesis is not redacted in transit. Restrict consumer roles on the Kinesis stream and apply redaction in your consumer if you need to display the live stream outside the supervisor console.

  • Multi-party calls. When a third party joins (warm transfer, conference), redaction quality on the joined leg can be lower until enough audio is sampled. Validate with your specific transfer flows.

  • Languages. Confirm your contact flow language matches a redaction-supported locale. The supported set for redaction is a subset of the analytics-supported languages, and selecting an analytics-only locale silently leaves transcripts un-redacted.

  • Silence is not non-talk time. Audio silenced by redaction is not flagged as non-talk time in the Connect admin website, so QA dashboards built on talk-time metrics may need to subtract redaction offsets to avoid skew.

  • Console download limits. Through the Amazon Connect admin website, agents and supervisors with the right security profile can play (and, for voice, download) redacted audio, but redacted chat files and voice transcripts must be retrieved from S3 directly.

Related information