I have an AWS Glue PII data detector job, its taking around 47 minutes to complete for 17.9 MB file size which is very long time for any spark job.
Sharing the code snippet used in the job
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
format_options={
"quoteChar": '"',
"withHeader": True,
"separator": ",",
"optimizePerformance": False,
},
connection_type="s3",
format="csv",
connection_options={"paths": [f'{input_location}{file_name}']},
transformation_ctx="S3bucket_node1",
)
# Script generated for node ApplyMapping
entity_detector = EntityDetector()
classified_map = entity_detector.classify_columns(
S3bucket_node1,
[
"PERSON_NAME",
"EMAIL",
"CREDIT_CARD",
"IP_ADDRESS",
"MAC_ADDRESS",
"PHONE_NUMBER",
"USA_PASSPORT_NUMBER",
"USA_SSN",
"USA_ITIN",
"BANK_ACCOUNT",
"USA_DRIVING_LICENSE",
"USA_HCPCS_CODE",
"USA_NATIONAL_DRUG_CODE",
"USA_NATIONAL_PROVIDER_IDENTIFIER",
"USA_DEA_NUMBER",
"USA_HEALTH_INSURANCE_CLAIM_NUMBER",
"USA_MEDICARE_BENEFICIARY_IDENTIFIER",
"JAPAN_BANK_ACCOUNT",
"JAPAN_DRIVING_LICENSE",
"JAPAN_MY_NUMBER",
"JAPAN_PASSPORT_NUMBER",
"UK_BANK_ACCOUNT",
"UK_BANK_SORT_CODE",
"UK_DRIVING_LICENSE",
"UK_ELECTORAL_ROLL_NUMBER",
"UK_NATIONAL_HEALTH_SERVICE_NUMBER",
"UK_NATIONAL_INSURANCE_NUMBER",
"UK_PASSPORT_NUMBER",
"UK_PHONE_NUMBER",
"UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER",
"UK_VALUE_ADDED_TAX",
"CANADA_SIN",
"CANADA_PASSPORT_NUMBER",
"GENDER",
],
1.0,
0.55,
)
I have spark application log file as well, can't attach in this question.
What is the root cause of time consumption by this job?
In my use case, I am getting files from different sources like customer, bank-loans, credit-risks and many more sources. I am profiling the data files and at the same time trying to detect PII data. If I reduce the number of entities then I might miss some of the PII data columns. As I am using Spark environment in Glue then parallel processing should happen and processing should complete within few minutes and NOT in 47 minutes.