Glue Spark Job is incomparable slower and more expensive than Athena

0

I've tried to use Glue Spark Job for very basic partitioning over GZIP JSON data about 50GB. The reason for trying Glue Job is my data could have more than 100 partitions and it is not really convinians to do it in Athena. The code is almost from AWS template.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import concat, col, lit, substring

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

my_partition_predicate = "partition='main' and year='2022' and month='12' and day='17'"
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_db", table_name = "my_table", push_down_predicate = my_partition_predicate, transformation_ctx = "datasource0", additional_options = {"recurse": True})

datasource0 = datasource0.toDF()
datasource0.printSchema()
datasource0 = datasource0.withColumn("od", concat(substring('originalrequest',9,3), lit("_"), substring('originalrequest',12,3)))
datasource0 = DynamicFrame.fromDF(datasource0, glueContext, "datasource0")

datasink4 = glueContext.write_dynamic_frame.from_options(
    frame = datasource0, 
    connection_type = "s3", 
    connection_options = {"path": "s3://my_b/glue4",  "partitionKeys": ["od"] , "compression": "gzip"}, 
    format = "json"
    )
job.commit()

The job were executed with 60 DPUs and after 20 minutes it timed out... This failed execution is cost $8.8.

Meanwhile, totally the same job were done in Athena in about 2 minutes and cost $0.25.

Am I doing something wrong, or Athena (Presto) is leaps ahead of Spark in terms of speed and cost effectiveness?

  • In both cases are you writing to JSON and compressing as GZIP?

  • Yep. Output is exactly the same. Partitioned JSON with GZIP compression.

profile picture
Smotrov
asked a year ago954 views
2 Answers
1

Athena requires less effort to optimize the job, Glue since it allows so many ways of using it, doesn't self tune itself so well yet.
Before jumping to conclusions, check if your job is making good use of those 60 DPUs (using SparkUI or Glue metrics), maybe the data is unbalanced at source.

For that specific use case DynamicFrame is not as efficient as DataFrame, especially doing the partitioning.
In addition, you are converting forth and back, which often degrades performance. Since your data is not dynamic, I suggest you do the same only with DataFrame and compare (e.g. spark.sql("your query").write.partitionBy("od").json("your path") )

Finally, Glue 4 has significant performance improvements, I encourage you to try it out on that version.

profile pictureAWS
EXPERT
answered a year ago
  • Thanks for getting back to me! It seems like the learning curve for Glue and Spark is quite steep, almost like climbing Everest.

    I was under the impression that DynamicFrame was a crucial concept in Glue Jobs, which is why I used it. Unfortunately, I'm having trouble getting Spark SQL to recognize any columns other than partitioned ones.

    Even after running

    spark.sql('use default') spark.sql('describe my_table').show()

    only the partitioned columns are consistently displayed. I've spent half a day poring over output stacks and documentation with no success. As a result, I'm unable to benchmark Spark DataFrame since I just can't get my data column to show up.

  • DynamicFrame is very relevant when the data is schema is not well defined but not necessarily on all cases. For catalogs it's easier to use SparkSQL. If Spark doesn't see those columns, it means the table is not well defined and Athena shouldn't see them either. How was the table created? It's better if you let tools create tables for you (e.g. Athena CTAS or Spark saveAsTable())

0

You may be doing something wrong, but Athena is generally faster and more cost-effective than Glue for ad-hoc querying of data in S3. Athena uses Presto, which is a high-performance SQL query engine optimized for reading and aggregating large data sets from S3. On the other hand, Glue is a fully-managed ETL service that is intended for data extraction, transformation, and loading, not for ad-hoc querying. So while it's possible to use Glue for basic partitioning, it may not be the most efficient or cost-effective option, especially if your data is partitioned and your queries are simple.

profile picture
answered a year ago
  • Another try without conversion to data frames. 17GB of data processed during 40minuted on 50 DPU. All workers loaded ~100%. So the effective speed of processing is 17000/40/50 = 8,5Mb/m per DPU. Just would like to know is it normal processing speed for AWS Glue? Or there is a spate for tuning? To me it is kind of slow-ish...

    my_partition_predicate = "partition='main' and pos='DE' and pcc='YYYY' and year='2023' and month='3' and day='8'"
    source_dyf = glueContext.create_dynamic_frame.from_catalog(database = "estr_db", table_name = "estr_warehouse", push_down_predicate = my_partition_predicate, additional_options = {"recurse": True})
    def add_od(record):
        od = record['originalrequest'][8:11] + "_" + record['originalrequest'][11:14]
        record['od'] = od
        return record
    source_dyf = source_dyf.map(add_od)
    # Set a suitable number of partitions and repartition the DynamicFrame
    num_partitions = 100  # Adjust this value based on your data size and cluster resources
    source_dyf = source_dyf.repartition(num_partitions)
    # Write the output DynamicFrame to an S3 bucket in JSON format with Gzip compression
    glueContext.write_dynamic_frame.from_options(
        frame=source_dyf,
        connection_type="s3",
        connection_options={
            "path": "s3://mybucket/temp/booo",
            "partitionKeys": ["od"],
            "compression": "gzip"
        },
        format="json",
        transformation_ctx="sink"
    )
    

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions