Spark.read.parquet fails when Job bookmark is enabled in aws glue

0

Hi,

I am experiencing a very wired behavior in glue. see below basic script. I am basically reading from a catalog a parquet data, and then
read another parquet file directly using spark context.

          database = "lh-datalake",   
          table_name = "livraison",  
          transformation_ctx="lh-datalakelivraison"  
    )  
  
produit_livre_gdf.toDF().show(2)  # this line works  
data = spark.read.parquet("s3://datalake-datalake/staging")  # this line fails with permission error if transformation_ctx is passed  
data.show(2)  

if I run this script with job bookmark enabled, it fails with s3 access denied error. I am 100% sure that the permission is properly set for this role. this line throws the error data = spark.read.parquet("s3://datalake-datalake/staging")

But when I remove the transformation_ctx is executes successfully.

I switched region, and even account still the same issue.

Does anyone have an idea what the issue could be?

regards,

Edited by: Trust on Jan 5, 2021 9:48 AM

Edited by: Trust on Jan 6, 2021 4:27 AM

Trust
demandé il y a 3 ans497 vues
1 réponse
0

I found the issue which seems to be a bug in Glue.
The problem is that the temporary folder path is defined in the same bucket as the partitions files and the bucket is registered as a data lake location under lake formation.
I basically pointed the temporary folder to another s3 bucket.

As mentioned in the initial question, this issue does not occur if you are reading csv based glue catalog, only when you are reading parquet based catalog table.

Edited by: Trust on Jan 6, 2021 6:52 AM

Trust
répondu il y a 3 ans

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions