- 新しい順
- 投票が多い順
- コメントが多い順
Athena requires less effort to optimize the job, Glue since it allows so many ways of using it, doesn't self tune itself so well yet.
Before jumping to conclusions, check if your job is making good use of those 60 DPUs (using SparkUI or Glue metrics), maybe the data is unbalanced at source.
For that specific use case DynamicFrame is not as efficient as DataFrame, especially doing the partitioning.
In addition, you are converting forth and back, which often degrades performance.
Since your data is not dynamic, I suggest you do the same only with DataFrame and compare (e.g. spark.sql("your query").write.partitionBy("od").json("your path") )
Finally, Glue 4 has significant performance improvements, I encourage you to try it out on that version.
Thanks for getting back to me! It seems like the learning curve for Glue and Spark is quite steep, almost like climbing Everest.
I was under the impression that DynamicFrame was a crucial concept in Glue Jobs, which is why I used it. Unfortunately, I'm having trouble getting Spark SQL to recognize any columns other than partitioned ones.
Even after running
spark.sql('use default') spark.sql('describe my_table').show()
only the partitioned columns are consistently displayed. I've spent half a day poring over output stacks and documentation with no success. As a result, I'm unable to benchmark Spark DataFrame since I just can't get my data column to show up.
DynamicFrame is very relevant when the data is schema is not well defined but not necessarily on all cases. For catalogs it's easier to use SparkSQL. If Spark doesn't see those columns, it means the table is not well defined and Athena shouldn't see them either. How was the table created? It's better if you let tools create tables for you (e.g. Athena CTAS or Spark saveAsTable())
You may be doing something wrong, but Athena is generally faster and more cost-effective than Glue for ad-hoc querying of data in S3. Athena uses Presto, which is a high-performance SQL query engine optimized for reading and aggregating large data sets from S3. On the other hand, Glue is a fully-managed ETL service that is intended for data extraction, transformation, and loading, not for ad-hoc querying. So while it's possible to use Glue for basic partitioning, it may not be the most efficient or cost-effective option, especially if your data is partitioned and your queries are simple.
Another try without conversion to data frames. 17GB of data processed during 40minuted on 50 DPU. All workers loaded ~100%. So the effective speed of processing is 17000/40/50 = 8,5Mb/m per DPU. Just would like to know is it normal processing speed for AWS Glue? Or there is a spate for tuning? To me it is kind of slow-ish...
my_partition_predicate = "partition='main' and pos='DE' and pcc='YYYY' and year='2023' and month='3' and day='8'" source_dyf = glueContext.create_dynamic_frame.from_catalog(database = "estr_db", table_name = "estr_warehouse", push_down_predicate = my_partition_predicate, additional_options = {"recurse": True}) def add_od(record): od = record['originalrequest'][8:11] + "_" + record['originalrequest'][11:14] record['od'] = od return record source_dyf = source_dyf.map(add_od) # Set a suitable number of partitions and repartition the DynamicFrame num_partitions = 100 # Adjust this value based on your data size and cluster resources source_dyf = source_dyf.repartition(num_partitions) # Write the output DynamicFrame to an S3 bucket in JSON format with Gzip compression glueContext.write_dynamic_frame.from_options( frame=source_dyf, connection_type="s3", connection_options={ "path": "s3://mybucket/temp/booo", "partitionKeys": ["od"], "compression": "gzip" }, format="json", transformation_ctx="sink" )
関連するコンテンツ
- 質問済み 8ヶ月前
- AWS公式更新しました 2年前
- AWS公式更新しました 3年前
In both cases are you writing to JSON and compressing as GZIP?
Yep. Output is exactly the same. Partitioned JSON with GZIP compression.