[Glue Crawler, Parquet] - Crawler infers timestamp[ns] as bigint

0

I'm using Pandas to create a DataFrame with a single column of dates, which I then save as a Parquet file to an S3 bucket. Here's the code snippet I'm using:

import pandas as pd

df = pd.DataFrame({'date': pd.date_range('2022-10-01', periods=10)})
df.to_parquet('s3://<bucket>/data/athena/test/test.parquet', version='2.6')

After uploading the Parquet file to S3, I use a crawler to populate the schema in a data catalog. However, I've noticed that the data type of the date column is inferred as bigint rather than a timestamp type.

Interestingly, when I explicitly convert the column to timestamp[us] before saving the DataFrame, the crawler correctly identifies the column as a timestamp type, and everything works as expected.

However, I'm curious why this adjustment is necessary. Is there a known limitation with timestamp[ns] data types in Parquet files or with how crawler interpret them? How can I ensure that my timestamp[ns] columns are correctly identified as timestamp types without needing to convert them to timestamp[us]?

Thank you for your help!

已提问 3 个月前418 查看次数
1 回答
0

There is a known limitation where Athena may not correctly infer timestamp columns from Parquet files generated in certain ways.

When writing the DataFrame to Parquet, Pandas uses nanosecond resolution timestamps which Parquet supports as INT96. However, some data catalogs and query engines may expect microsecond resolution instead. Explicitly converting the timestamp column to microsecond resolution before writing to Parquet ensures the data type will be correctly identified.

df['date'] = df['date'].astype('datetime64[us]')

Another option is to set the proper metadata in the Parquet file itself to specify nanosecond resolution timestamps. Tools like Spark handle this automatically, but you may need to configure other systems like DMS to do the same.

profile picture
专家
已回答 3 个月前
  • I used parquet format version 2.6 which stores nanosecond resolution timestamps as int64.

    import pyarrow.parquet as pq
    metadata = pq.read_metadata('s3://<bucket>/data/athena/test/test.parquet')
    print(metadata.schema)
    print(metadata.schema.to_arrow_schema().pandas_metadata)

    The previous code snippet returns the following values:

    <pyarrow._parquet.ParquetSchema object at 0x7f25a0bdd680>
    required group field_id=-1 schema {
      optional int64 field_id=-1 date (Timestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));
    }
    
    {'index_columns': [{'kind': 'range',
       'name': None,
       'start': 0,
       'stop': 10,
       'step': 1}],
     'column_indexes': [{'name': None,
       'field_name': None,
       'pandas_type': 'unicode',
       'numpy_type': 'object',
       'metadata': {'encoding': 'UTF-8'}}],
     'columns': [{'name': 'date',
       'field_name': 'date',
       'pandas_type': 'datetime',
       'numpy_type': 'datetime64[ns]',
       'metadata': None}],
     'creator': {'library': 'pyarrow', 'version': '14.0.2'},
     'pandas_version': '2.1.4'}
    

    So for me it's strange why I get this behavior with bigint

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则