Reading S3 objects whose basename begins with an underscore as a Glue DynamicFrame

1

I have JSON data stored on S3 which I have created Glue tables over. This data is partitioned and I use Glue crawlers to update the table partitions. I then load this data as a Glue DynamicFrame within a Glue job. I am using Glue 3.0 and creating the DynamicFrame with the GlueContext.create_dynamic_frame.from_catalog method of the awsglue package.

This works nicely for most of my data, but for some reason the DynamicFrame is unable to load data which has a basename in S3 which begins with an underscore. I was able to reproduce this error using two identical objects, one named something and the other named _something. The _something object was not read into the DynamicFrame, but after renaming the object something_else (no underscore at the beginning) it was successfully read into the DynamicFrame.

Some cursory googling tells me that this may be a Presto "feature". Athena uses Presto under the hood and Glue uses Athena under the hood(?) Presto ignores files that start with an underscore underscore _ or a dot starting from presto version 0.60. There is a org.apache.hadoop.hive.common.FileUtils.HIDDEN_FILES_PATH_FILTER property, but it's unclear to me if this is something I can configure in Glue.

已提問 1 年前檢視次數 476 次
1 個回答
1
已接受的答案

Hi,

Please be informed that this is a know limitation with Glue dynamic frame as of now when dealing with S3 object name starting with "_" underscore. Hence as a workaround you can rename the object that is starting with underscore. Further, Glue does not use Athena under the hood and you can't configure anything regarding this in glue as it is a limitation. Thank you !

已回答 1 年前
  • It's really by design, there is a convention coming from Hadoop that files starting from "_" or "." are considered internal (e.g. metadata) and not contain data

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南