AWS Glue Add Column Transformation Not being written to output data

0

Hello,

I am trying to add a column to a data set as part of a transformation in an AWS Glue job. I am developing my script locally using interactive sessions and in my session I can see that the code I have written is adding the new column. However, when I write that data and view the resulting data, the column is not added.

Here is a brief walkthrough of how I am implementing the add-column transformation:

  1. I am working with a dynamic frame assigned to a variable named ChangeSchema_node1703083178011.
  2. I convert the dynamic frame to a spark data frame calling the toDF method on the ChangeSchema_node1703083178011 variable and assign the result to the sparkDf variable.
  3. I then call the withColumn method on the sparkDf variable, passing in two arguments, and I assign the result to the dfNewColumn variable.
  4. I then convert back to a dynamic frame using the DynamicFrame.fromDF method and assign the result to the dyF variable.
sparkDf = ChangeSchema_node1703083178011.toDF()
dfNewColumn = sparkDf.withColumn("test_col", lit(None))
dyF = DynamicFrame.fromDF(dfNewColumn, glueContext, "convert")

As I said, I can verify that the column has been added by calling the show method on the dyF variable and seeing the printed out result in my interactive session. However, when I write the dynamic frame in order to produce my output data files with the following code:

AmazonS3_node1702921197069 = glueContext.write_dynamic_frame.from_options(
    frame=dyF,
    connection_type="s3",
    format="glueparquet",
    connection_options={
        "path": "s3://smedia-data-processing-dev/google/cron_name/",
        "partitionKeys": [],
    },
    format_options={"compression": "snappy"},
    transformation_ctx="AmazonS3_node1702921197069",
)

job.commit()

...the resulting output data does not include the added column. The job runs successfully, but the added column is not in the resulting data.

However, interestingly when I try to write a value to the added column instead of using None in this line

dfNewColumn = sparkDf.withColumn("test_col", lit(None))

such as a string value like "hello" as in the following:

dfNewColumn = sparkDf.withColumn("test_col", lit("hello"))

then the the output data does include the added column.

Matt_J
질문됨 5달 전382회 조회
1개 답변
1
수락된 답변

DynamicFrame as the name indicates is dynamic and will infer the schema on the fly for the data, if the rows have that column null cannot really infer the type.
Why don't you just do:

dfNewColumn.write.mode("append").parquet(3://smedia-data-processing-dev/google/cron_name/)
profile pictureAWS
전문가
답변함 5달 전
profile picture
전문가
검토됨 5달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠