Spark.read.parquet fails when Job bookmark is enabled in aws glue

0

Hi,

I am experiencing a very wired behavior in glue. see below basic script. I am basically reading from a catalog a parquet data, and then
read another parquet file directly using spark context.

          database = "lh-datalake",   
          table_name = "livraison",  
          transformation_ctx="lh-datalakelivraison"  
    )  
  
produit_livre_gdf.toDF().show(2)  # this line works  
data = spark.read.parquet("s3://datalake-datalake/staging")  # this line fails with permission error if transformation_ctx is passed  
data.show(2)  

if I run this script with job bookmark enabled, it fails with s3 access denied error. I am 100% sure that the permission is properly set for this role. this line throws the error data = spark.read.parquet("s3://datalake-datalake/staging")

But when I remove the transformation_ctx is executes successfully.

I switched region, and even account still the same issue.

Does anyone have an idea what the issue could be?

regards,

Edited by: Trust on Jan 5, 2021 9:48 AM

Edited by: Trust on Jan 6, 2021 4:27 AM

Trust
gefragt vor 3 Jahren497 Aufrufe
1 Antwort
0

I found the issue which seems to be a bug in Glue.
The problem is that the temporary folder path is defined in the same bucket as the partitions files and the bucket is registered as a data lake location under lake formation.
I basically pointed the temporary folder to another s3 bucket.

As mentioned in the initial question, this issue does not occur if you are reading csv based glue catalog, only when you are reading parquet based catalog table.

Edited by: Trust on Jan 6, 2021 6:52 AM

Trust
beantwortet vor 3 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen