I'm using an AWS Glue job for my data processing tasks, and my source system provides monthly snapshots of data. Before executing the create_dynamicframe function, I want to execute a select count(*) query on the source to get an idea of the data volume.
Here's my current function:
def create_dynamicframe(database, table, push_down_predicate=None, filter_function=None, primary_keys=None):
outputsource = glueContext.create_dynamic_frame.from_catalog(database=database, table_name=table,
transformation_ctx="outputsource",
push_down_predicate=push_down_predicate)
if filter_function is not None:
outputsource = Filter.apply(frame=outputsource, f=filter_function).select_fields(primary_keys)
return outputsource
However, when dealing with monthly snapshots, the job times out due to the large volume of data in the source system.
I'm looking for suggestions on how to modify this function or if there's an alternative approach to efficiently perform a count query in Athena before data processing. Specifically, I want to execute a query similar to:
select count(*) from [database].[table]
Any advice or best practices to optimize this process and prevent timeouts would be greatly appreciated. Thank you!