Can't access data in subdirectories for partitioned Athena table

0

I have an Athena table with a partition for each day, where the actual files are in "sub-directories" by hour, as follows:

s3://my-bucket/data/2019/06/27/00/00001.json
s3://my-bucket/data/2019/06/27/00/00002.json
s3://my-bucket/data/2019/06/27/01/00001.json
s3://my-bucket/data/2019/06/27/01/00002.json

Athena is able to query this table without issue and find my data, but when using AWS Glue, it does not appear to be able to find this data.

ALTER TABLE mytable ADD 
PARTITION (year=2019, month=06, day=27) LOCATION 's3://my-bucket/data/2019/06/27/01';

select day, count(*)
from mytable
group by day;

day .   count
27 .    145431

Below are the partition properties in Glue. I was hoping that the storedAsSubDirectories setting would tell it to iterate the sub-directories, but this does not appear to be the case:

{
	"StorageDescriptor": {
		"cols": {
			"FieldSchema": [
				{
					"name": "userid",
					"type": "string",
					"comment": ""
				},
				{
					"name": "labels",
					"type": "array<string>",
					"comment": ""
				}
			]
		},
		"location": "s3://my-bucket/data/2019/06/27/01/",
		"inputFormat": "org.apache.hadoop.mapred.TextInputFormat",
		"outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
		"compressed": "false",
		"numBuckets": "0",
		"SerDeInfo": {
			"name": "JsonSerDe",
			"serializationLib": "org.openx.data.jsonserde.JsonSerDe",
			"parameters": {
				"serialization.format": "1"
			}
		},
		"bucketCols": [],
		"sortCols": [],
		"parameters": {},
		"SkewedInfo": {
			"skewedColNames": [],
			"skewedColValues": [],
			"skewedColValueLocationMaps": {}
		},
		"storedAsSubDirectories": "true"
	},
	"parameters": {}
}

When Glue runs against this same partition/table, it finds 0 rows.

However, if all the data files appear in the root "directory" of the partition (i.e. s3://my-bucket/data/2019/06/27/00001.json), then both Athena and Glue can find the data.

Is there some reason why Glue is unable to find the data files? I'd prefer not to create a partition for each hour, since that will mean 8700 partitions per year (and Athena has a limit of 20,000 partitions per table).

ebuck
gefragt vor 5 Jahren870 Aufrufe
1 Antwort
0

Apparently there's an undocumented additional option on create_dynamic_frame for "recurse":

additional_options = {"recurse": True}

Example:

athena_datasource = glueContext.create_dynamic_frame.from_catalog(database = target_database, table_name = target_table, push_down_predicate = "(year=='2019' and month=='06' and day=='27')", transformation_ctx = "athena_datasource", additional_options = {"recurse": True})
ebuck
beantwortet vor 5 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen