1 個回答
- 最新
- 最多得票
- 最多評論
0
Yes, The way to do this is with a pushdown predicate. When reading a dynamic frame, you would use the field push_down_predicate
.
https://aws.amazon.com/premiumsupport/knowledge-center/glue-job-specific-s3-partition/
已回答 2 年前
相關內容
- AWS 官方已更新 1 年前
- AWS 官方已更新 2 年前
In my case, the table has maybe in the order of 1e9 lines which can be grouped into around 1e6 groups. Do I understand it right, that I should then start 1e6 Glue jobs for each partition/group in parallel and perform the selection
push_down_predicate
? This does not sound practical to me, as I assume that it would be better to effieciently use Glue's internal parallelisation.Yeah sorry, I haven't run tests on the scaling of the number of partitions so high with low data, but my assumption is that it scales and using partitions is better since the query you would be making to s3 is using presto to optimize the data grabbed and how it is grabbed with the partitions organization.
This might help. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-s3select.html