- Newest
- Most votes
- Most comments
You’re trying to reference a column inside an array of structs (kyc[0].kycnumber), which could cause issues because Glue does not support array indexing ([0]) in sourceColumns. So, you can flatten the nested structure before running the PII detection, for example:
S3bucket_node1_df_flattened = S3bucket_node1_df.withColumn("kyc_element", explode("kyc"))
S3bucket_node1_flattened = S3bucket_node1_df_flattened.select("kyc_element.kycnumber")
S3bucket_node1_flattened_dynamic = DynamicFrame.fromDF(S3bucket_node1_flattened, glueContext, "S3bucket_node1_flattened")
detection_parameters = { "INDIA_AADHAAR_NUMBER": [{ "action": "REDACT", "sourceColumns": ["kycnumber"], "actionOptions": {"redactText": "******"} }] }
PIIDetection_node2 = entity_detector.detect(S3bucket_node1_flattened_dynamic, detection_parameters, "DetectedEntities", "HIGH")
The error you're encountering is likely due to the way you're trying to access nested columns in your PII detection script. When working with nested structures in AWS Glue, you need to use a different approach to select and process nested fields.
For nested columns, you can't directly use dot notation like "kyc[0].kycnumber" in the sourceColumns parameter. Instead, you should first flatten or unnest the structure before applying the PII detection.
Here's a modified approach you can try:
- First, explode the array and select the nested field:
from pyspark.sql.functions import explode, col flattened_df = S3bucket_node1.toDF().select(explode("kyc").alias("kyc_exploded"), "*") flattened_df = flattened_df.select("kyc_exploded.kycnumber", "*") # Convert back to DynamicFrame flattened_dynamic_frame = DynamicFrame.fromDF(flattened_df, glueContext, "flattened_dynamic_frame")
- Then, update your PII detection parameters:
detection_parameters = { "INDIA_AADHAAR_NUMBER": [{ "action": "REDACT", "sourceColumns": ["kycnumber"], "actionOptions": {"redactText": "******"} }] } entity_detector = EntityDetector() PIIDetection_node2 = entity_detector.detect(flattened_dynamic_frame, detection_parameters, "DetectedEntities", "HIGH")
This approach first flattens the nested structure, making the "kycnumber" field accessible at the top level. Then you can apply the PII detection on this flattened structure.
Remember that after applying PII detection, you might need to restructure your data back to its original nested format if required for downstream processes.
Sources
AWS Glue Scala DynamicFrame class - AWS Glue
Detect and process sensitive data - AWS Glue
Relevant content
- asked a year ago
- asked 2 years ago
- asked 2 years ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 13 days ago
Consider this example, |-- fathername: struct (nullable = true) | |-- completename: string (nullable = true) | |-- firstname: string (nullable = true)
Even for the above schema I am not able to get have the source column like
"sourceColumns": ["fathername.firstname"],