【以下的问题经过翻译处理】 我有一个AWS Glue PII数据检测作业,对于17.9 MB的文件大小,完成需要大约47分钟,这对于任何spark作业来说都是非常长的时间。
以下是作业中使用的代码片段:
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
format_options={
"quoteChar": '"',
"withHeader": True,
"separator": ",",
"optimizePerformance": False,
},
connection_type="s3",
format="csv",
connection_options={"paths": [f'{input_location}{file_name}']},
transformation_ctx="S3bucket_node1",
)
# Script generated for node ApplyMapping
entity_detector = EntityDetector()
classified_map = entity_detector.classify_columns(
S3bucket_node1,
[
"PERSON_NAME",
"EMAIL",
"CREDIT_CARD",
"IP_ADDRESS",
"MAC_ADDRESS",
"PHONE_NUMBER",
"USA_PASSPORT_NUMBER",
"USA_SSN",
"USA_ITIN",
"BANK_ACCOUNT",
"USA_DRIVING_LICENSE",
"USA_HCPCS_CODE",
"USA_NATIONAL_DRUG_CODE",
"USA_NATIONAL_PROVIDER_IDENTIFIER",
"USA_DEA_NUMBER",
"USA_HEALTH_INSURANCE_CLAIM_NUMBER",
"USA_MEDICARE_BENEFICIARY_IDENTIFIER",
"JAPAN_BANK_ACCOUNT",
"JAPAN_DRIVING_LICENSE",
"JAPAN_MY_NUMBER",
"JAPAN_PASSPORT_NUMBER",
"UK_BANK_ACCOUNT",
"UK_BANK_SORT_CODE",
"UK_DRIVING_LICENSE",
"UK_ELECTORAL_ROLL_NUMBER",
"UK_NATIONAL_HEALTH_SERVICE_NUMBER",
"UK_NATIONAL_INSURANCE_NUMBER",
"UK_PASSPORT_NUMBER",
"UK_PHONE_NUMBER",
"UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER",
"UK_VALUE_ADDED_TAX",
"CANADA_SIN",
"CANADA_PASSPORT_NUMBER",
"GENDER",
],
1.0,
0.55,
)
我还有Spark应用程序日志文件,但没法在这贴出。
这个作业的时间消耗这么大的根本原因是什么?