Can't get Partitions to work with my Glue Data Catalog
I have S3 files that are uploaded to a single bucket. There is no folders or anything like that, its just 1 file per hour uploaded to this bucket. I run a Glue ETL job on these files, do some transformations, and insert the data into a Glue Data Catalog stored in a different bucket. I can then query that Glue Data Catalog with Athena, and that works.
What I would like to do is store the files in the S3 folder of the Data Catalog as YEAR/MONTH/DAY, using partitions. Even though the SOURCE data is just files uploaded every hour with no partitions, I want to store them in the Data Catalog WITH partitions. So I extracted the YEAR, MONTH, DAY from the files during Glue ETL, and created columns in my Data Catalog table accordingly and marked them as partitions:
Partition 1 YEAR Partition 2 MONTH Partition 3 DAY
The proper values are in these columns, and I have verified that.
After creating the partitions I rand MSCK REPAIR TABLE on the table, and it came back with "Query Ok."
I then ran my Glue ETL job. When I look in the S3 bucket I do not see folders created. I just see regular r-part files. When I click on the Table Schema it shows the columns YEAR, MONTH, DAY marked as partitions, but when I click on View Partitions it just shows:
year month day No partitions found
What do I need to do? These are just CSV files. I can't control the process that is uploading the raw data to S3, it is just going to store hourly files in a bucket. I can control the ETL job and the Data Catalog. When I try to query after creating the partitions and running MSCK REPAIR TABLE, there is no data returned. Yet I can go into the Data Catalog bucket and pull up one of the r-part files and the data is there.
I believe I got this figured out. I was missing a step, where I had to add additionalOptions to my sink. As explained here: https://medium.com/analytics-vidhya/add-new-partitions-in-aws-glue-data-catalog-from-aws-glue-job-79b0442b17af
# parameter "enableUpdateCatalog" tells the aws glue job to update #the glue data catalog as the new partitions are created
additionalOptions = {"enableUpdateCatalog": True}
# define the partition keys
additionalOptions["partitionKeys"] = ["year", "month","day"]
After doing this, I saw that the partitions (nested directories) were being created. Now I am having some other isuses with querying, but I will post that as a separate question on here.
Relevant questions
Can't get Partitions to work with my Glue Data Catalog
Accepted Answerasked a month agoHow to merge aws data pipeline output files into a single file?
asked 3 months agoAccess denied when trying to GET objects uploaded to s3 bucket via aws sdk using cloudfront
asked 7 months agoObjects Not Visible Within S3 Bucket for GroundTruth Labeling Job
asked 3 months agoUploading a Dataframe to AWS S3 Bucket from SageMaker
Accepted Answerasked 3 years agobulk move files from root of an S3 bucket to a folder (but not include other folders)
Accepted Answerasked 6 months agoGlue ETL job write part-r-00 files to same bucket as my input. Any way to change this?
Accepted Answerasked 3 months agoCleaning a bucket with purge_s3_path with Glue console.
Accepted Answerasked 7 months agoNew to AWS, my lifecycle rule seems to not be applied (Versioned bucket)
asked a year agoI need to read S3 data, transform and put into Data Catalog. Should I be using a Crawler?
Accepted Answerasked 3 months ago