I have a glue job that write to a Data Catalog. In the Data Catalog I originally set it up as CSV, and all works fine. Now I would like to try to use Parquet for the Data Catalog. I thought I would just have to re-create the table and select Parquet instead of CSV, so I did so like so:
CREATE EXTERNAL TABLE `gp550_load_database_beta.gp550_load_table_beta`(
`vid` string,
`altid` string,
`vtype` string,
`time` timestamp,
`timegmt` timestamp,
`value` float,
`filename` string)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://ds905-load-forecast/data_store_beta/'
TBLPROPERTIES (
'classification'='parquet')
I left my glue job unchanged. I have it sending its output to the Data Catalog Table like so:
additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "LOG"}
additionalOptions["partitionKeys"] = ["year", "month", "day"]
# Data Catalog WRITE
DataCatalogtable_node2 = glueContext.write_dynamic_frame.from_catalog(
frame = dynamicDF,
database = db_name,
table_name = tbl_name,
additional_options=additionalOptions,
transformation_ctx = "DataCatalogtable_node2",
)
When I checked the files being created by the Data Catalog in s3://ds905-load-forecast/data_store_beta/, they look to just be CSV. What do I need to do to use Parquet? Can I just change the sink routine to use glueContext_write_dynamic_frame.from_options()?