Spark.read.parquet fails when Job bookmark is enabled in aws glue

0

Hi,

I am experiencing a very wired behavior in glue. see below basic script. I am basically reading from a catalog a parquet data, and then
read another parquet file directly using spark context.

          database = "lh-datalake",   
          table_name = "livraison",  
          transformation_ctx="lh-datalakelivraison"  
    )  
  
produit_livre_gdf.toDF().show(2)  # this line works  
data = spark.read.parquet("s3://datalake-datalake/staging")  # this line fails with permission error if transformation_ctx is passed  
data.show(2)  

if I run this script with job bookmark enabled, it fails with s3 access denied error. I am 100% sure that the permission is properly set for this role. this line throws the error data = spark.read.parquet("s3://datalake-datalake/staging")

But when I remove the transformation_ctx is executes successfully.

I switched region, and even account still the same issue.

Does anyone have an idea what the issue could be?

regards,

Edited by: Trust on Jan 5, 2021 9:48 AM

Edited by: Trust on Jan 6, 2021 4:27 AM

Trust
feita há 3 anos497 visualizações
1 Resposta
0

I found the issue which seems to be a bug in Glue.
The problem is that the temporary folder path is defined in the same bucket as the partitions files and the bucket is registered as a data lake location under lake formation.
I basically pointed the temporary folder to another s3 bucket.

As mentioned in the initial question, this issue does not occur if you are reading csv based glue catalog, only when you are reading parquet based catalog table.

Edited by: Trust on Jan 6, 2021 6:52 AM

Trust
respondido há 3 anos

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas