I am trying to write an ETL job to the Data Catalog but its writing the Headers as Data

0

I am trying to read data from S3, transform it and insert it into a table I have created in the Data Catalog. My ETL job reads data with create_dynamic_frame, converts to a Spark DataFrame, does its transforms, converts back to a Dynamic DataFrame and then writes that frame to the Data Catalog.

I do the write to the Data Catalog table like so:

DataCatalogtable_node2 = glueContext.write_dynamic_frame.from_catalog(
    frame = dynamic_df,
    database = db_name,
    table_name = tbl_name,
    transformation_ctx = "DataCatalogtable_node2",
)

I manually defined my Data Catalog table

Column NameData Type
vidstring
altidstring
vtypestring
timetimestamp
timegmttimestamp
valueint
filenamestring

This matches the schema of my Dynamic DataFrame write before I do the write operation

root
|-- vid: string
|-- altid: string
|-- vtype: string
|-- time: timestamp
|-- timegmt: timestamp
|-- value: int
|-- filename: string

When I attempt to look at the Data in Athena after the ETL job runs, I get errors, because stored in "time" the first value is called "time". My timestamps look like and I think they are in the correct format.

vid,altid,vtype,time,timegmt,value,filename
ABC,ABC,DATAFEED,"2021-10-31 22:00:00.0","2021-11-01 02:00:00.0",11775,"s3://bpf-load-data/abc_data/lf_buf_20211101_005904.csv"

Originally (in s3) time is in the format 31-10-2021 22:00:00 and so in order to get it into the correct format that the Data Catalog/Athena use for timestamp I am doing:

df = df.withColumn("time", to_timestamp("time", 'dd-MM-yyyy HH:mm:ss'))

I am not sure if this is correct. I get 2021-10-31 22:00:00.0 instead of what I would suspect which is 2021-10-31 22:00:00

In any case, in Athena when I attempt to read the table, the actually first Data in time is the word time and therefore it does not match timestamp and it all errors out.

HIVE_BAD_DATA: Error parsing field value 'TIME' for field 3: For input string: "TIME"

It is almost as if it is trying to insert the "header" of the dataframe as actual data.

Does anyone know what could be going on?

bfeeny
gefragt vor 2 Jahren2683 Aufrufe
1 Antwort
1
Akzeptierte Antwort

I was able to figure this out. I was creating my Data Catalog table manually, and I needed to add the table property "skip.header.line.count = 1".

bfeeny
beantwortet vor 2 Jahren
profile picture
EXPERTE
überprüft vor einem Monat
AWS
EXPERTE
überprüft vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen