Skip to content

Not able to select the columns that are nested in PII detection script in AWS Glue, facing IllegalArgumentException: Invalid column name used in sourceColumns!

0

I am encountering an IllegalArgumentException while running a PII detection script in AWS Glue. The error message indicates that an invalid column name is being used in sourceColumns.

S3bucket_node1_df = spark.read.format("delta").options(**additionalOptions).load("/path/to/delta/")
S3bucket_node1 = DynamicFrame.fromDF(S3bucket_node1_df, glueContext, "S3bucket_node1")

# Script generated for node PIIDetection
detection_parameters = {
  "INDIA_AADHAAR_NUMBER": [{
    "action": "REDACT",
    "sourceColumns": ["kyc[0].kycnumber"],
    "actionOptions": {"redactText": "******"}
  }]
}

entity_detector = EntityDetector()
PIIDetection_node2 = entity_detector.detect(S3bucket_node1, detection_parameters, "DetectedEntities", "HIGH")

**Schema Overview: **

|-- kyc: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- kycnumber: string (nullable = true)
asked 9 months ago151 views
2 Answers
0

You’re trying to reference a column inside an array of structs (kyc[0].kycnumber), which could cause issues because Glue does not support array indexing ([0]) in sourceColumns. So, you can flatten the nested structure before running the PII detection, for example:

S3bucket_node1_df_flattened = S3bucket_node1_df.withColumn("kyc_element", explode("kyc"))

S3bucket_node1_flattened = S3bucket_node1_df_flattened.select("kyc_element.kycnumber")

S3bucket_node1_flattened_dynamic = DynamicFrame.fromDF(S3bucket_node1_flattened, glueContext, "S3bucket_node1_flattened")

detection_parameters = { "INDIA_AADHAAR_NUMBER": [{ "action": "REDACT", "sourceColumns": ["kycnumber"], "actionOptions": {"redactText": "******"} }] }

PIIDetection_node2 = entity_detector.detect(S3bucket_node1_flattened_dynamic, detection_parameters, "DetectedEntities", "HIGH")

AWS
answered 9 months ago
  • Consider this example, |-- fathername: struct (nullable = true) | |-- completename: string (nullable = true) | |-- firstname: string (nullable = true)

    Even for the above schema I am not able to get have the source column like

    "sourceColumns": ["fathername.firstname"],

0

The error you're encountering is likely due to the way you're trying to access nested columns in your PII detection script. When working with nested structures in AWS Glue, you need to use a different approach to select and process nested fields.

For nested columns, you can't directly use dot notation like "kyc[0].kycnumber" in the sourceColumns parameter. Instead, you should first flatten or unnest the structure before applying the PII detection.

Here's a modified approach you can try:

  1. First, explode the array and select the nested field:
from pyspark.sql.functions import explode, col

flattened_df = S3bucket_node1.toDF().select(explode("kyc").alias("kyc_exploded"), "*")
flattened_df = flattened_df.select("kyc_exploded.kycnumber", "*")

# Convert back to DynamicFrame
flattened_dynamic_frame = DynamicFrame.fromDF(flattened_df, glueContext, "flattened_dynamic_frame")
  1. Then, update your PII detection parameters:
detection_parameters = {
"INDIA_AADHAAR_NUMBER": [{
"action": "REDACT",
"sourceColumns": ["kycnumber"],
"actionOptions": {"redactText": "******"}
}]
}

entity_detector = EntityDetector()
PIIDetection_node2 = entity_detector.detect(flattened_dynamic_frame, detection_parameters, "DetectedEntities", "HIGH")

This approach first flattens the nested structure, making the "kycnumber" field accessible at the top level. Then you can apply the PII detection on this flattened structure.

Remember that after applying PII detection, you might need to restructure your data back to its original nested format if required for downstream processes.
Sources
AWS Glue Scala DynamicFrame class - AWS Glue
Detect and process sensitive data - AWS Glue

answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.