Can't get Partitions to work with my Glue Data Catalog

1

I have S3 files that are uploaded to a single bucket. There is no folders or anything like that, its just 1 file per hour uploaded to this bucket. I run a Glue ETL job on these files, do some transformations, and insert the data into a Glue Data Catalog stored in a different bucket. I can then query that Glue Data Catalog with Athena, and that works.

What I would like to do is store the files in the S3 folder of the Data Catalog as YEAR/MONTH/DAY, using partitions. Even though the SOURCE data is just files uploaded every hour with no partitions, I want to store them in the Data Catalog WITH partitions. So I extracted the YEAR, MONTH, DAY from the files during Glue ETL, and created columns in my Data Catalog table accordingly and marked them as partitions:

Partition 1 YEAR Partition 2 MONTH Partition 3 DAY

The proper values are in these columns, and I have verified that.

After creating the partitions I rand MSCK REPAIR TABLE on the table, and it came back with "Query Ok."

I then ran my Glue ETL job. When I look in the S3 bucket I do not see folders created. I just see regular r-part files. When I click on the Table Schema it shows the columns YEAR, MONTH, DAY marked as partitions, but when I click on View Partitions it just shows:

year month day No partitions found

What do I need to do? These are just CSV files. I can't control the process that is uploading the raw data to S3, it is just going to store hourly files in a bucket. I can control the ETL job and the Data Catalog. When I try to query after creating the partitions and running MSCK REPAIR TABLE, there is no data returned. Yet I can go into the Data Catalog bucket and pull up one of the r-part files and the data is there.

bfeeny
asked 2 years ago2143 views
1 Answer
1
Accepted Answer

I believe I got this figured out. I was missing a step, where I had to add additionalOptions to my sink. As explained here: https://medium.com/analytics-vidhya/add-new-partitions-in-aws-glue-data-catalog-from-aws-glue-job-79b0442b17af

# parameter "enableUpdateCatalog" tells the aws glue job to update #the glue data catalog  as the new partitions are created
additionalOptions = {"enableUpdateCatalog": True}
# define the partition keys 
additionalOptions["partitionKeys"] = ["year", "month","day"]

After doing this, I saw that the partitions (nested directories) were being created. Now I am having some other isuses with querying, but I will post that as a separate question on here.

bfeeny
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions