Athena error while running UNLOAD to PARQUET query using column names with spaces in -- GENERIC_INTERNAL_ERROR: field ended by ';': expected ';'

0

Error while running UNLOAD to PARQUET query using column names with spaces in

Introduction

I have a table in Athena with the following column names ["column space 1", "column space 2"]. I cannot run an UNLOAD to Parquet query on a table that contains column names with spaces in.

The documentation states that one can use non-standard column names by wrapping them in "" (backticks no longer supported). This works for standard queries, but not UNLOAD or CTAS queries.

I cannot change the column names -- I must be able to support column names with spaces in.

The table was created using Glue with the following parameters:

{'Name': 'dummy_table', 'Parameters': {'classification': 'parquet'}, 'StorageDescriptor': {'Columns': [{'Name': 'column space 1', 'Type': 'string'}, {'Name': 'column space 2', 'Type': 'string'}], 'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat', 'Location': '<input_parquet_location>', 'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat', 'SerdeInfo': {'Parameters': {'serialization.format': '1'}, 'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'}}, 'TableType': 'EXTERNAL_TABLE'}

The Problem

I can query this table fine using the Athena Query Editor e.g. SELECT "column space 1" FROM "AwsDataCatalog"."<database_name>"."dummy_table" limit 10; returns the correct result.

However, I receive an error when I try to run the following UNLOAD query to dump the table into a PARQUET file:

UNLOAD (SELECT "column space 1" FROM "AwsDataCatalog"."<database_name>"."dummy_table") TO '<output_s3_location>' WITH (format = 'PARQUET', compression = 'SNAPPY')

Error: GENERIC_INTERNAL_ERROR: field ended by ';': expected ';' but got 'space' at line 1: optional binary column space

Things I've Tried

I can successfully run the same UNLOAD query using {ORC, CSV, TEXTFILE} formats e.g. UNLOAD (SELECT "column space 1" FROM "AwsDataCatalog"."<database_name>"."dummy_table") TO '<output_s3_location>' WITH (format = 'ORC', compression = 'SNAPPY')

I've tried a CREATE TABLE AS query, but am faced with the same error - GENERIC_INTERNAL_ERROR: field ended by ';': expected ';' but got 'space' at line 1

CREATE TABLE "{database_name}"."{table_name}_unload"
WITH (
    format = 'Parquet',
    table_type = 'HIVE',
    external_location = '{output_parquet_location}',
    is_external = true,
    write_compression = 'SNAPPY'
) AS
SELECT "column space 1" FROM "{database_name}"."{table_name}"

My question

It seems there is a problem specifically with trying to run an UNLOAD to Parquet query while using column names with spaces in.

Is this a bug or expected behaviour? I can see no reference in the documentation, and would be surprised if non-standard column names are not supported for the specific case of Unloading to Parquet, but is supported with ORC, CSV, TEXTFILE

Is there any way that I can UNLOAD a table that contains non-standard column names to a PARQUET file?

I'd be very grateful for any help or guidance, Toby

toby
asked a month ago578 views
1 Answer
0

The Parquet file format has more strict requirements around column names compared to formats like ORC, CSV or text files.

You can try the following:

  • Rename the columns in your table to remove any spaces before unloading to Parquet.
  • Use a different file format like ORC, CSV or text files that may support non-standard column names.
  • You could unload the data to S3 in a non-Parquet format, then use another tool like Spark to read from S3 and write the data back out as a new Parquet file after renaming the columns appropriately.

Unfortunately there doesn't seem to be a way currently to directly unload a table with non-standard column names to Parquet format from Athena. The column names need to adhere to the stricter naming requirements of the Parquet schema. Renaming columns is likely required as a workaround.

profile picture
EXPERT
answered a month ago
  • Many thanks for your answer.

    I am not sure it is correct that the Parquet file format has strict requirements around non-standard column names, and believe it to be a limitation enforced by Athena rather than the Parquet file format.

    As evidence to back this up, I can use pandas to create a parquet file that contains column names with spaces df.to_parquet("file.parquet"). This can be read by other tools such as Dask and Glue.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions