Glue PySpark saveAsTable vs write.parquet & crawl

0

Need advise on which approach is considered as best practice and correct while writing data to s3 and populating metadata to glue catalog. Is there any pros and cons?

  1. df.write.parquet("s3://path_to_my_database_and_table") and then run crawler on the same s3 path.

  2. df.write.mode("overwrite").saveAsTable("my_database.my_table")

  3. df.write.mode("overwrite").saveAsTable("my_database.my_table") and run crawler on already created tables.

Would appreciate if you would share your experiance. Thank you.

asked a year ago1968 views
2 Answers
0
Accepted Answer

I would always use option 2 if I can, meaning if I control the generation of the table data and control the schema changes.
When you saveAsTable, it creates the table using the schema of the DataFrame, instead of the crawler having to figuring it out (this is more problematic with some formats).
Also, when you write that way, the table is published after the data is generated, with a crawler you would have to coordinate the execution or if it runs regularly risk the table getting published while the job is still running.
Once the job finishes successfully, you know the table is ready, otherwise you would need to monitor the crawler finishing before doing further steps.
If you can do without a crawler, you have a simpler and leaner solution.

The drawback is that currently if you have schema changes, you have to handle them yourself (compare the table with the DataFrame and do ALTER TABLE), in that case a crawler can save you doing that work but if it doesn't do it correctly, it's harder to detect and can cause a bigger disruption, compared to when you handle that in the code.

I would avoid option 3, you have the downsides of both the previous options.

profile pictureAWS
EXPERT
answered a year ago
0

Hello Adas,

As a data architect, I have worked on architecting data lakes for multiple customer and I would say the answer really depends. Both PySpark saveAsTable and writing to Parquet with write.parquet and crawling with a Glue Crawler are valid approaches for saving and cataloging data in Glue. The approach you choose depends on your specific requirements and the type of data you are working with.

I see 1 and 2 as valid options based on your requirement. 3 does not seem necessary as saveAsTable takes care of adding partitions as well. I have been using df.write.parquet for large files (> 1GB) and for the smaller ones that need schema evolution.

option 1: If you need to save your data as Parquet files, then writing the data with write.parquet and crawling it with a Glue Crawler is the way to go. This approach is best suited for larger datasets that need to be stored as files, and you can use the Glue Crawler to populate the Glue Data Catalog with metadata about your data.

**option 2: **If you want to save the data as a managed table in the Glue Data Catalog, and you want to take advantage of the optimizations that come with a managed catalog, then using PySpark's saveAsTable method is a good option. This method creates a table in the Glue Data Catalog and automatically handles the partitioning, serialization, and compression of your data. Also prefer this method when there the data is a manageable size and the table schema is not expected to change.

one variation to option 1: I dont think there is a need to run crawlers after each and every run. Given the crawler runs are expensive on large files that accrue over time, I restrict the crawler execution only when I detect that the df.schema() is different from the schema in the Glue catalog. On the files that either dont need schema evolution or if the schema is the same, I add partitions by calling spark.sql(ALTER TABLE table_identifier ADD [IF NOT EXISTS] ( partition_spec [ partition_spec ... ] ) ).

If this helps, please upvote and accept answer to help everyone on this forum. Let me know if you need anything else.

profile pictureAWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions