Spark.read.parquet fails when Job bookmark is enabled in aws glue

0

Hi,

I am experiencing a very wired behavior in glue. see below basic script. I am basically reading from a catalog a parquet data, and then
read another parquet file directly using spark context.

          database = "lh-datalake",   
          table_name = "livraison",  
          transformation_ctx="lh-datalakelivraison"  
    )  
  
produit_livre_gdf.toDF().show(2)  # this line works  
data = spark.read.parquet("s3://datalake-datalake/staging")  # this line fails with permission error if transformation_ctx is passed  
data.show(2)  

if I run this script with job bookmark enabled, it fails with s3 access denied error. I am 100% sure that the permission is properly set for this role. this line throws the error data = spark.read.parquet("s3://datalake-datalake/staging")

But when I remove the transformation_ctx is executes successfully.

I switched region, and even account still the same issue.

Does anyone have an idea what the issue could be?

regards,

Edited by: Trust on Jan 5, 2021 9:48 AM

Edited by: Trust on Jan 6, 2021 4:27 AM

Trust
質問済み 3年前497ビュー
1回答
0

I found the issue which seems to be a bug in Glue.
The problem is that the temporary folder path is defined in the same bucket as the partitions files and the bucket is registered as a data lake location under lake formation.
I basically pointed the temporary folder to another s3 bucket.

As mentioned in the initial question, this issue does not occur if you are reading csv based glue catalog, only when you are reading parquet based catalog table.

Edited by: Trust on Jan 6, 2021 6:52 AM

Trust
回答済み 3年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ