Can't get Partitions to work with my Glue Data Catalog

1

I have S3 files that are uploaded to a single bucket. There is no folders or anything like that, its just 1 file per hour uploaded to this bucket. I run a Glue ETL job on these files, do some transformations, and insert the data into a Glue Data Catalog stored in a different bucket. I can then query that Glue Data Catalog with Athena, and that works.

What I would like to do is store the files in the S3 folder of the Data Catalog as YEAR/MONTH/DAY, using partitions. Even though the SOURCE data is just files uploaded every hour with no partitions, I want to store them in the Data Catalog WITH partitions. So I extracted the YEAR, MONTH, DAY from the files during Glue ETL, and created columns in my Data Catalog table accordingly and marked them as partitions:

Partition 1 YEAR Partition 2 MONTH Partition 3 DAY

The proper values are in these columns, and I have verified that.

After creating the partitions I rand MSCK REPAIR TABLE on the table, and it came back with "Query Ok."

I then ran my Glue ETL job. When I look in the S3 bucket I do not see folders created. I just see regular r-part files. When I click on the Table Schema it shows the columns YEAR, MONTH, DAY marked as partitions, but when I click on View Partitions it just shows:

year month day No partitions found

What do I need to do? These are just CSV files. I can't control the process that is uploading the raw data to S3, it is just going to store hourly files in a bucket. I can control the ETL job and the Data Catalog. When I try to query after creating the partitions and running MSCK REPAIR TABLE, there is no data returned. Yet I can go into the Data Catalog bucket and pull up one of the r-part files and the data is there.

bfeeny
gefragt vor 2 Jahren2306 Aufrufe
1 Antwort
1
Akzeptierte Antwort

I believe I got this figured out. I was missing a step, where I had to add additionalOptions to my sink. As explained here: https://medium.com/analytics-vidhya/add-new-partitions-in-aws-glue-data-catalog-from-aws-glue-job-79b0442b17af

# parameter "enableUpdateCatalog" tells the aws glue job to update #the glue data catalog  as the new partitions are created
additionalOptions = {"enableUpdateCatalog": True}
# define the partition keys 
additionalOptions["partitionKeys"] = ["year", "month","day"]

After doing this, I saw that the partitions (nested directories) were being created. Now I am having some other isuses with querying, but I will post that as a separate question on here.

bfeeny
beantwortet vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen