Are partitions advantageous for groupby operations in Glue jobs?
I have a table with columns A, B, C, D, ..., where A is a partition key. In a Glue job I want to group records of this table by column A. Is there a way to make the glue workers aware of the partitions in A, such that workers do not have to process dataframes created from different partitions in order to speed up the processing time?
Yes, The way to do this is with a pushdown predicate. When reading a dynamic frame, you would use the field push_down_predicate
.
https://aws.amazon.com/premiumsupport/knowledge-center/glue-job-specific-s3-partition/
Yeah sorry, I haven't run tests on the scaling of the number of partitions so high with low data, but my assumption is that it scales and using partitions is better since the query you would be making to s3 is using presto to optimize the data grabbed and how it is grabbed with the partitions organization.
This might help. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-s3select.html
Relevant questions
Update Records with AWS Glue
asked 3 months agoGlue transform columns limit
Accepted AnswerAre partitions advantageous for groupby operations in Glue jobs?
asked 4 months agoGrouping of partitioned dataframes
asked 4 months agoHow to escape a comma in a csv file in AWS Glue?
Accepted AnswerAWS Glue API get-partitions can't seem to cope with the partition column name "key" in the `expression` filter
asked 4 months agoWhat's the best way to filter out duplicated records in a Glue ETL Job with bookmarking enabled?
asked 6 months agoQuickSight Freeze First Column in a table view
asked 2 months agoPartition schema mismatch in Glue Table
asked a month agoOne glue job for multiple workflows
asked 6 days ago
In my case, the table has maybe in the order of 1e9 lines which can be grouped into around 1e6 groups. Do I understand it right, that I should then start 1e6 Glue jobs for each partition/group in parallel and perform the selection
push_down_predicate
? This does not sound practical to me, as I assume that it would be better to effieciently use Glue's internal parallelisation.