AWS Glue: JSON to Parquet, keep JSON partitioning

0

I plan on writing an ETL job that converts S3 JSON files such as: s3://sample_bucket/logs/year=2023/month=08/day=03/hour=00/example_log.log.gz

However, the JSON schema itself does not contain the year, month, day, and hour columns.

If I convert the dynamic data frame into parquet, how can I keep this partition?

Into parquet: s3://sample_bucket/logs_parquet/year=2023/month=08/day=03/hour=00/example_log.snappy.parquet

When I try something like this:

outputDF = glueContext.write_dynamic_frame.from_options(\
    frame = inputDyf, \
    connection_type = "s3", \
    connection_options = {"path": output_loc,"partitionKeys": ["year", "month"]},
    format = "parquet")   

It complains: AnalysisException: Partition column year not found in schema

How could I keep the same partition in my source JSON bucket in my resulting parquet bucket?

asked 8 months ago779 views
1 Answer
0

Hi Kevin,

To partition your resulting dataset you must have these columns present in your dataframe.

There are two easy ways to get them:

  1. Create them if you have the date or timestamp column present;
  2. Catalog your json data using a Glue Crawler and partitions will be automatically inserted as columns.

To create these columns from a date or timestamp columns you can leverage PySpark functions, it's simples as this:

import pyspark.sql.function as F

df = myDynamicFrame.toDF()

# If your date or date time is not in ISO format you will need to cast them first
# the second argument of to_timestamp and to_date is the format of the string for a ts or date
df = (
    df
    .withColumn(
        "mytimestampcolumn",
        F.to_timestamp(F.col("mytimestampcolumn"), "yyyy/MM/dd HH:mm:ss.SSSSSSSSS")
    )
    .withColumn(
        "mydatecolumn",
        F.to_date(F.col("mydatecolumn"), "yyyy/MM/dd")
    )
)

df = (
    .withColumn("year", F.year(F.col("mydateortimestampcol")))
    .withColumn("month", F.month(F.col("mydateortimestampcol")))
)

Check here for Spark datetime pattern guide if you need: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

The second option requires you to catalog your data first, and read it from catalog, not from S3 directly. You can go to "Crawlers" in the Glue console and create your crawler using the creation wizard. After that you must run your crawler. If everything goes right your data should now be mapped into a Glue Table and you can read your data from it, then all your partitions will become columns.

Hope this helps you. Let me know if you have any further questions.

answered 8 months ago
  • Thanks for the answer, I'm curious about the performance of this.

    Ideally I want to read the JSON files in groups based on partition, convert them to parquet, and then write the parquet for that group.

    If I combine all the JSON, repartition the data all over again (even though the data was originally organized), and then write to parquet, I'm afraid that this won't be as efficient and will duplicate a lot of work.

    Any advice?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions