AWS Glue PII detector job taking too much time

0

I have an AWS Glue PII data detector job, its taking around 47 minutes to complete for 17.9 MB file size which is very long time for any spark job.

Sharing the code snippet used in the job

S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={
        "quoteChar": '"',
        "withHeader": True,
        "separator": ",",
        "optimizePerformance": False,
    },
    connection_type="s3",
    format="csv",
    connection_options={"paths": [f'{input_location}{file_name}']},
    transformation_ctx="S3bucket_node1",
)
# Script generated for node ApplyMapping
entity_detector = EntityDetector()
classified_map = entity_detector.classify_columns(
    S3bucket_node1,
    [
        "PERSON_NAME",
        "EMAIL",
        "CREDIT_CARD",
        "IP_ADDRESS",
        "MAC_ADDRESS",
        "PHONE_NUMBER",
        "USA_PASSPORT_NUMBER",
        "USA_SSN",
        "USA_ITIN",
        "BANK_ACCOUNT",
        "USA_DRIVING_LICENSE",
        "USA_HCPCS_CODE",
        "USA_NATIONAL_DRUG_CODE",
        "USA_NATIONAL_PROVIDER_IDENTIFIER",
        "USA_DEA_NUMBER",
        "USA_HEALTH_INSURANCE_CLAIM_NUMBER",
        "USA_MEDICARE_BENEFICIARY_IDENTIFIER",
        "JAPAN_BANK_ACCOUNT",
        "JAPAN_DRIVING_LICENSE",
        "JAPAN_MY_NUMBER",
        "JAPAN_PASSPORT_NUMBER",
        "UK_BANK_ACCOUNT",
        "UK_BANK_SORT_CODE",
        "UK_DRIVING_LICENSE",
        "UK_ELECTORAL_ROLL_NUMBER",
        "UK_NATIONAL_HEALTH_SERVICE_NUMBER",
        "UK_NATIONAL_INSURANCE_NUMBER",
        "UK_PASSPORT_NUMBER",
        "UK_PHONE_NUMBER",
        "UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER",
        "UK_VALUE_ADDED_TAX",
        "CANADA_SIN",
        "CANADA_PASSPORT_NUMBER",
        "GENDER",
    ],
    1.0,
    0.55,
)

I have spark application log file as well, can't attach in this question.

What is the root cause of time consumption by this job?

asked a year ago421 views
1 Answer
0
Accepted Answer

Hi,

It could be because you are looking for a huge list of entity types against all the rows.

Does your use case requirements allow you to reduce the sample portion (which defines the percent of scanned rows for the PII entity) as well as detection threshold (which defines he percentage of rows that contain the PII entity in order for the entire column to be identified as having the PII entity)?

profile picture
EXPERT
answered a year ago
  • In my use case, I am getting files from different sources like customer, bank-loans, credit-risks and many more sources. I am profiling the data files and at the same time trying to detect PII data. If I reduce the number of entities then I might miss some of the PII data columns. As I am using Spark environment in Glue then parallel processing should happen and processing should complete within few minutes and NOT in 47 minutes.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions