I have JSON data stored on S3 which I have created Glue tables over. This data is partitioned and I use Glue crawlers to update the table partitions. I then load this data as a Glue DynamicFrame within a Glue job. I am using Glue 3.0 and creating the DynamicFrame with the GlueContext.create_dynamic_frame.from_catalog
method of the awsglue
package.
This works nicely for most of my data, but for some reason the DynamicFrame is unable to load data which has a basename in S3 which begins with an underscore. I was able to reproduce this error using two identical objects, one named something
and the other named _something
. The _something
object was not read into the DynamicFrame, but after renaming the object something_else
(no underscore at the beginning) it was successfully read into the DynamicFrame.
Some cursory googling tells me that this may be a Presto "feature". Athena uses Presto under the hood and Glue uses Athena under the hood(?) Presto ignores files that start with an underscore underscore _ or a dot starting from presto version 0.60. There is a org.apache.hadoop.hive.common.FileUtils.HIDDEN_FILES_PATH_FILTER
property, but it's unclear to me if this is something I can configure in Glue.
It's really by design, there is a convention coming from Hadoop that files starting from "_" or "." are considered internal (e.g. metadata) and not contain data