AWS Glue - DataSink is taking long time to write

0

Hello

I'm using DataSink to write the results of the job. Input file size is just 70 MB. Job is reading from Catalogue and trying to write into S3 and update the target Catalogue. Have no clue why it is taking so long (> 2 hours) When I write simple job to read (sample raw file) CSV and write into S3 Parquet, it just take 2 minutes. Reason I am using the DatSink to avoid running Crawler on target data source. Pl. suggest

result_sink = glueContext.getSink(
    path=fld_path,
    connection_type="s3",
    updateBehavior="LOG",
    partitionKeys=partition_cols_list,
    compression="snappy",
    enableUpdateCatalog=True,
    transformation_ctx="result_sink"
)
result_sink.setCatalogInfo(
    catalogDatabase=target_db, catalogTableName=dataset_dict_obj.get("table_name")
)

#Raw input format conversion from CSV/txt into Parquet 
result_sink.setFormat("glueparquet")

#convert df to ddf
final_df = DynamicFrame.fromDF(
    inc_df, glueContext, "final_df"
)
#Job is taking to 1 hours to reach this point. 

print("final_df size is that:",final_df.count())

#Write the dataframe into AWS S3 bucket and, also update the Data Catalogue.
result_sink.writeFrame(final_df)

job.commit()
질문됨 9달 전386회 조회
1개 답변
0

If writing a plain file is fast, I suspect the performance issue is with "partitionKeys=partition_cols_list", maybe those columns have too much granularity and force writing lots of tiny files. Also the counting on the converted DF might result on double processing.

Since you already have a DataFrame, the DataFrame writer is faster doing table partitioning, you can achieve the same (as long as you are not writing to an S3 Lakeformation location) doing something like this (haven't tested it):

inc_df.writeTo(f'{target_db}.{dataset_dict_obj.get("table_name")}').partitionedBy(partition_cols_list).createOrReplace() 
profile pictureAWS
전문가
답변함 9달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠