AWS Glue reading glue catalog table VS reading files from s3

0

I am writing AWS Glue ETL job and I have 2 options to construct the spark dataframe :

Use the AWS Glue Data Catalog as the metastore for Spark SQL

df = spark.sql("select name from bronze_db.table_tbl")
df.write.save("s3://silver/...")

another options is to read directly from s3 location like this

df = spark.read.format("parquet").load("s3://bronze/table_tbl/1.parquet","s3://bronze/table_tbl/2.parquet")
df.write.save("s3://silver/...")

should I consider reading files directly to save cost or any limit on the number queries (select name from bronze_db.table_tbl) or to get better read performance ?

I am not sure if this query will be run on Athena to return the results

1개 답변
0

Hi,

the query will not be run by Athena, and there will not be any additional cost. When using AWS Glue Catalog to power Spark, the catalog replaces the Hive Metastore in informing Spark SQL on how to access the S3 data.

The 2 methods are equivalent, the first is only a bit more concise and user friendly not having to remember or know where the files for the table are.

Hope this helps.

AWS
전문가
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인