_$folder$ while writing to S3 from a Glue PySpark job

0

Hi

I have written a few Glue Jobs and not faced this situation , but all of a sudden this has started appearing for a new job that I wrote. I am using the below code to write data to S3 . The S3 path is "s3://...."

unionData_df.repartition(1).write.mode("overwrite").parquet(test_path)

In my test env, when I first ran the glue job , it created an empty file with the suffix _$folder$ The same happened in Prod also . My other jobs do not have this problem.

Why is it creating this file ? How to avoid it ? Any pointes on why is it not happening for other jobs but for this one? What should I be checking ? Note , I think the file gets created the first time the prefix/folder is created. Some blogposts suggest to change the S3 path to s3a , but I am not sure if that is the right thing to do .

질문됨 일 년 전1438회 조회
2개 답변
1
수락된 답변

This is done by Hadoop if the folder does not exist. This _$folder$ is just a placeholder. This is created by mkdir commands. The actual folder is only created when first file is placed. The other jobs where this is not happening might be writing to existing folders. These files should not cause a problem.

AWS
venky81
답변함 일 년 전
AWS
전문가
검토됨 일 년 전
0

This happens because of the S3 path you use during writing.

s3:// vs s3a://

s3:// will make the folder s3a:// will not

They both have their ups and downs and is generally recommended to stick with s3://.

답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠