How to Efficiently Perform a Count Query in AWS Glue Job Before Data Processing

0

I'm using an AWS Glue job for my data processing tasks, and my source system provides monthly snapshots of data. Before executing the create_dynamicframe function, I want to execute a select count(*) query on the source to get an idea of the data volume.

Here's my current function:


def create_dynamicframe(database, table, push_down_predicate=None, filter_function=None, primary_keys=None):
    outputsource = glueContext.create_dynamic_frame.from_catalog(database=database, table_name=table,
                                                                 transformation_ctx="outputsource",
                                                                 push_down_predicate=push_down_predicate)
    if filter_function is not None:
        outputsource = Filter.apply(frame=outputsource, f=filter_function).select_fields(primary_keys)
    return outputsource

However, when dealing with monthly snapshots, the job times out due to the large volume of data in the source system.

I'm looking for suggestions on how to modify this function or if there's an alternative approach to efficiently perform a count query in Athena before data processing. Specifically, I want to execute a query similar to:


select count(*) from [database].[table]

Any advice or best practices to optimize this process and prevent timeouts would be greatly appreciated. Thank you!

Vinod
질문됨 6달 전372회 조회
1개 답변
1

If the files in the table are parquet, doing that select count on the SparkSession should use the parquet file statistics, but still needs to open each file.
If you have a single system updating the table, you could have that system updating the count every time data is added (or removed). I don't see other ways.

profile pictureAWS
전문가
답변함 6달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인