1 Antwort
- Neueste
- Die meisten Stimmen
- Die meisten Kommentare
0
The visual job is going to source with DynamicFrame, not apply pushdown filters automatically and probably read the files to determine the schema when converting to DataFrame so you can do the SQL. In your case, if you write a script job that runs the query directly on spark.sql(), you will get something closer (it won't be fast as Athena, specially with that modest capacity).
Relevanter Inhalt
- AWS OFFICIALAktualisiert vor 2 Jahren
- AWS OFFICIALAktualisiert vor 3 Jahren
- AWS OFFICIALAktualisiert vor 2 Jahren
- AWS OFFICIALAktualisiert vor 2 Jahren
Forgive me if I am totally wrong here, but I think my job is already doing this. The Visual tool generates an automatic script and I'm seeing the following nodes with the pushdown predicate, such as:
Does this not imply that it is in fact filtering the dataset using the pushdown predicate as it is creating the dynamic frame?
It's important to note that not all of the tables I am using are partitioned so I am unable to apply a pushdown predicate to all of them. I have yet to find another way to filter on these tables.
Later in the script it uses sparkSqlQuery to pull the query:
SparkSQLQuery is defined as the following:
Is this the method you were referring to or is there another way?
In this case, you are going to have more control if you use a script job, also the plan on SparkUI will be simpler to understand so you can see where it time spent