HIVE_BAD_DATA: Error parsing field value '2021-10-31 22:00:00.0' for field 3: For input string: "2021-10-31 22:00:00.0"

0

I am reading files from S3 and using a Glue ETL job to populate a Data Catalog Table.

My S3 files look like this:

VID,ALTID,VTYPE,TIME,TIMEGMT,VALUE
ABC, ABC, DATAFEED,31-10-2021 22:00:00,01-11-2021 02:00:00,   11775.685

The scheme read in looks like this

root
|-- VID: string
|-- ALTID: string
|-- VTYPE: string
|-- TIME: string
|-- TIMEGMT: string
|-- VALUE: string

I am changing fields 3 and 4 from "strings" to timestamps, which matches the schema of my Data Catalog Table. I am also doing a few other transformations.

I am transforming them like so:

df = df.withColumn("time", to_timestamp("time", 'dd-MM-yyyy HH:mm:ss'))
df = df.withColumn("timegmt", to_timestamp("timegmt", 'dd-MM-yyyy HH:mm:ss'))

When I try to read the data with Athena, I get an error:

HIVE_BAD_DATA: Error parsing field value '2021-10-31 22:00:00.0' for field 3: For input string: "2021-10-31 22:00:00.0"

The Data Catalog Table Schema looks like this:

Column NameData Type
vidstring
altidstring
vtypestring
timetimestamp
timegmttimestamp
valueint
filenamestring

And the line in the run-1647806179090-part-r-00000 file it is choking on looks like this:

vid,altid,vtype,time,timegmt,value,filename
ABC,ABC, DATAFEED,"2021-10-31 22:00:00.0","2021-11-01 02:00:00.0",11775,"lf_buf_20211101_005904.csv"

Does anyone know why it would throw this error? I believe according to the documentation, this is the correct timestamp format.

bfeeny
asked 2 years ago3158 views
1 Answer
0
Accepted Answer

It turns out, the documentation for Athena is either incorrect or at best misleading. The excellent answer by Alexandre says it best here https://stackoverflow.com/questions/52564194/athena-unable-to-parse-date-using-opencsvserde

Basically you need to store the date or the timestamp in UNIX Epoch time. You wouldn't know that, because of all the emphasis on the format of the time. I tried as a timestamp, and that is why I got this error. As soon as I stored it as UNIX time I got somewhere. However, the unix_timestamp() function only returns time in seconds (long) and timestamp wants time in milliseconds (double). So I simply multiplied by 1000:

df = df.withColumn("time", f.unix_timestamp("time", 'dd-MM-yyyy HH:mm:ss') * 1000) After doing this, you will have a 13 digit double, and Athena will properly produce a timestamp from it if you have selected Timestamp as the Data Type.

bfeeny
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions