Spark.read.parquet fails when Job bookmark is enabled in aws glue

0

Hi,

I am experiencing a very wired behavior in glue. see below basic script. I am basically reading from a catalog a parquet data, and then
read another parquet file directly using spark context.

          database = "lh-datalake",   
          table_name = "livraison",  
          transformation_ctx="lh-datalakelivraison"  
    )  
  
produit_livre_gdf.toDF().show(2)  # this line works  
data = spark.read.parquet("s3://datalake-datalake/staging")  # this line fails with permission error if transformation_ctx is passed  
data.show(2)  

if I run this script with job bookmark enabled, it fails with s3 access denied error. I am 100% sure that the permission is properly set for this role. this line throws the error data = spark.read.parquet("s3://datalake-datalake/staging")

But when I remove the transformation_ctx is executes successfully.

I switched region, and even account still the same issue.

Does anyone have an idea what the issue could be?

regards,

Edited by: Trust on Jan 5, 2021 9:48 AM

Edited by: Trust on Jan 6, 2021 4:27 AM

Trust
질문됨 3년 전497회 조회
1개 답변
0

I found the issue which seems to be a bug in Glue.
The problem is that the temporary folder path is defined in the same bucket as the partitions files and the bucket is registered as a data lake location under lake formation.
I basically pointed the temporary folder to another s3 bucket.

As mentioned in the initial question, this issue does not occur if you are reading csv based glue catalog, only when you are reading parquet based catalog table.

Edited by: Trust on Jan 6, 2021 6:52 AM

Trust
답변함 3년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠