By using AWS re:Post, you agree to the Terms of Use

Generating Parquet files from Glue Data Catalog

0

I have a glue job that write to a Data Catalog. In the Data Catalog I originally set it up as CSV, and all works fine. Now I would like to try to use Parquet for the Data Catalog. I thought I would just have to re-create the table and select Parquet instead of CSV, so I did so like so:

CREATE EXTERNAL TABLE `gp550_load_database_beta.gp550_load_table_beta`(
  `vid` string,
  `altid` string,
  `vtype` string,
  `time` timestamp,
  `timegmt` timestamp,
  `value` float,
  `filename` string)
PARTITIONED BY (
  `year` int,
  `month` int,
  `day` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://ds905-load-forecast/data_store_beta/'
TBLPROPERTIES (
  'classification'='parquet')

I left my glue job unchanged. I have it sending its output to the Data Catalog Table like so:

    additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "LOG"}
    additionalOptions["partitionKeys"] = ["year", "month", "day"]
    
    # Data Catalog WRITE
    DataCatalogtable_node2 = glueContext.write_dynamic_frame.from_catalog(
        frame = dynamicDF,
        database = db_name,
        table_name = tbl_name,
        additional_options=additionalOptions,
        transformation_ctx = "DataCatalogtable_node2",
    )

When I checked the files being created by the Data Catalog in s3://ds905-load-forecast/data_store_beta/, they look to just be CSV. What do I need to do to use Parquet? Can I just change the sink routine to use glueContext_write_dynamic_frame.from_options()?