Glue python repartition while retaining old partition column

0

I have data currently partitioned on a key (say cluster) and I'm repartitioning to a new key 'date'. So I do (in Python)

df = glueContext.create_dynamic_frame.from_options(...)
df = df.rename_field('cluster', 'old_cluster')
p_df = df.toDF().repartition(1, 'date')
p_d_df = DynamicFrame.fromDF(p_df, ...)

This works and I can get the 'date' value as a partition/column. However, I cannot see 'cluster' or 'old_cluster'. How can I retail the old cluster key? Thanks

AWS
질문됨 4달 전167회 조회
1개 답변
0

When you rename the old name no longer exists, to make a copy just declare a new column in DataFrame taking the value from the other column (while keeping it):

df = df.withColumn("cluster", df['old_cluster'])  # notice df is a DataFrame

Also note that when you repartition there you are not creating a partition column, just reorganizing the data by that column (which with 1 partition is pointless)

profile pictureAWS
전문가
답변함 4달 전
  • Thanks, that did not work. Here is what I did:

    df = glueContext.create_dynamic_frame.from_options(...)
    d_df = df.rename_field('cluster', 'old_cluster').toDF()
    p_df = d_df.withColumn('cluster', d_df['old_cluster']).repartition(1, 'date')
    p_d_df = DynamicFrame.fromDF(p_df, ...)
    

    And I get an error "Error Category: QUERY_ERROR; AnalysisException: Cannot resolve column name "old_cluster" among (<all columns except cluster or old_cluster)". Cluster is a partition column and so is not explicitly in the parquet object itself.

  • you are still doing the rename, so the old name is gone

  • Sorry, not following you. I could not find either old_cluster or new cluster. Both of these columns were not there in the error message. I also tried

    df = glueContext.create_dynamic_frame.from_options(...)
    d_df = df.toDF()
    p_df = d_df.withColumn('cluster', d_df['cluster']).repartition(1, 'date')
    p_d_df = DynamicFrame.fromDF(p_df, ...)
    

    and it said "Error Category: QUERY_ERROR; AnalysisException: Cannot resolve column name "cluster" among (<all columns except cluster>)"

  • you cannot create a column name the same of an existing column, not cannot reference a column that doesn't exist. To make the copy you need to pass on withColumn the name of the new column and in the value the reference to the column that you want to copy from

  • That doesn't work either. I get "Error Category: QUERY_ERROR; AnalysisException: Cannot resolve column name "cluster" among (<all the other columns>)".

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠