Reading S3 objects whose basename begins with an underscore as a Glue DynamicFrame

1

I have JSON data stored on S3 which I have created Glue tables over. This data is partitioned and I use Glue crawlers to update the table partitions. I then load this data as a Glue DynamicFrame within a Glue job. I am using Glue 3.0 and creating the DynamicFrame with the GlueContext.create_dynamic_frame.from_catalog method of the awsglue package.

This works nicely for most of my data, but for some reason the DynamicFrame is unable to load data which has a basename in S3 which begins with an underscore. I was able to reproduce this error using two identical objects, one named something and the other named _something. The _something object was not read into the DynamicFrame, but after renaming the object something_else (no underscore at the beginning) it was successfully read into the DynamicFrame.

Some cursory googling tells me that this may be a Presto "feature". Athena uses Presto under the hood and Glue uses Athena under the hood(?) Presto ignores files that start with an underscore underscore _ or a dot starting from presto version 0.60. There is a org.apache.hadoop.hive.common.FileUtils.HIDDEN_FILES_PATH_FILTER property, but it's unclear to me if this is something I can configure in Glue.

asked a year ago461 views
1 Answer
1
Accepted Answer

Hi,

Please be informed that this is a know limitation with Glue dynamic frame as of now when dealing with S3 object name starting with "_" underscore. Hence as a workaround you can rename the object that is starting with underscore. Further, Glue does not use Athena under the hood and you can't configure anything regarding this in glue as it is a limitation. Thank you !

answered a year ago
  • It's really by design, there is a convention coming from Hadoop that files starting from "_" or "." are considered internal (e.g. metadata) and not contain data

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions