I am trying to write an ETL job to the Data Catalog but its writing the Headers as Data

0

I am trying to read data from S3, transform it and insert it into a table I have created in the Data Catalog. My ETL job reads data with create_dynamic_frame, converts to a Spark DataFrame, does its transforms, converts back to a Dynamic DataFrame and then writes that frame to the Data Catalog.

I do the write to the Data Catalog table like so:

DataCatalogtable_node2 = glueContext.write_dynamic_frame.from_catalog(
    frame = dynamic_df,
    database = db_name,
    table_name = tbl_name,
    transformation_ctx = "DataCatalogtable_node2",
)

I manually defined my Data Catalog table

Column NameData Type
vidstring
altidstring
vtypestring
timetimestamp
timegmttimestamp
valueint
filenamestring

This matches the schema of my Dynamic DataFrame write before I do the write operation

root
|-- vid: string
|-- altid: string
|-- vtype: string
|-- time: timestamp
|-- timegmt: timestamp
|-- value: int
|-- filename: string

When I attempt to look at the Data in Athena after the ETL job runs, I get errors, because stored in "time" the first value is called "time". My timestamps look like and I think they are in the correct format.

vid,altid,vtype,time,timegmt,value,filename
ABC,ABC,DATAFEED,"2021-10-31 22:00:00.0","2021-11-01 02:00:00.0",11775,"s3://bpf-load-data/abc_data/lf_buf_20211101_005904.csv"

Originally (in s3) time is in the format 31-10-2021 22:00:00 and so in order to get it into the correct format that the Data Catalog/Athena use for timestamp I am doing:

df = df.withColumn("time", to_timestamp("time", 'dd-MM-yyyy HH:mm:ss'))

I am not sure if this is correct. I get 2021-10-31 22:00:00.0 instead of what I would suspect which is 2021-10-31 22:00:00

In any case, in Athena when I attempt to read the table, the actually first Data in time is the word time and therefore it does not match timestamp and it all errors out.

HIVE_BAD_DATA: Error parsing field value 'TIME' for field 3: For input string: "TIME"

It is almost as if it is trying to insert the "header" of the dataframe as actual data.

Does anyone know what could be going on?

bfeeny
질문됨 2년 전2682회 조회
1개 답변
1
수락된 답변

I was able to figure this out. I was creating my Data Catalog table manually, and I needed to add the table property "skip.header.line.count = 1".

bfeeny
답변함 2년 전
profile picture
전문가
검토됨 한 달 전
AWS
전문가
검토됨 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠