Grouping of partitioned dataframes
I have a large dataset (table) with >1e9 records (rows) in Glue. The tables are partitioned by column A, which is a n-letters subtring of column B. For example:
A (partition key) | B | ... |
---|---|---|
abc | abc123... | ... |
abc | abc123... | ... |
abc | abc456... | ... |
abc | abc456... | ... |
abc | abc456... | ... |
abc | abc789... | ... |
abc | abc789... | ... |
... | ... | ... |
xyz | xyz123... | ... |
xyz | xyz123... | ... |
xyz | xyz123... | ... |
xyz | xyz456... | ... |
xyz | xyz456... | ... |
xyz | xyz456... | ... |
xyz | xyz789... | ... |
xyz | xyz789... | ... |
There are >1e6 possible different values of column B and correspondingly significantly less for column A (maybe 1e3). Now I need to group records/rows by column B and the assumption is that it could be advantageous if the table was partitioned by column A, as it would be sufficient to load dataframes from single partitions for grouping instead of running the operation on the entire table. (Partitioning by column B would lead to unreasonably large numbers partitions.) Is my assumption right? How would I tell my Glue job the link between column A and B and profit from the partitioning?
Alternatively I could handle the 1e3 dataframes (one for each partition) separately in my Glue job and merge them lateron. But this looks a bit complicated to me.
This question is a follow-up question to https://repost.aws/questions/QUwxdl4EwTQcKBuL8MKCU0EQ/are-partitions-advantageous-for-groupby-operations-in-glue-jobs.
Hello,
Partitioning by column A could be advantageous in your use case and then groupBy column B in order to limit the number of partitions. You can create a dynamic frame and refer to partitions in it with predicate pushdown. Do you want to create multiple dynamic frames for multiple partitions and then combine them using join operator ? If you have any sample code you are using on a sample dataset, please share it with us on a support case and we can assist you by getting on a screen-share.
Relevant questions
Athena Partition Projection and Column Stats
asked 4 months agoHIVE_METASTORE_ERROR when running an Athena query to select the first 10 rows from a partitioned table created by a Glue Crawler.
asked 3 months agoUpdate Records with AWS Glue
asked 3 months agoAWS DMS doing a SELECT on TEXT column during continuous replication
asked 4 months agoQuickSight Freeze First Column in a table view
asked 2 months agoWhat's the best way to filter out duplicated records in a Glue ETL Job with bookmarking enabled?
asked 6 months agoPartition schema mismatch in Glue Table
asked a month agoalter column varchar resizing regression
asked 3 years agoGrouping of partitioned dataframes
asked 4 months agoCan Glue crawler be configured to include only the most recent partition in a table?
asked a month ago