Glue catalog table has columns numbered like "column#1", "column#2", ... "column21"

0

I am using an AWS Glue to write an ETL pipeline that gets data from an S3 bucket, processes them and writes them back to the bucket while also creating a Glue catalog table. The code I am using is the following:

DyF = DynamicFrame.fromDF(spark_df, glueContext, "etl_convert")

s3output = glueContext.getSink(
  path="s3://my_bucket/clean/",
  connection_type="s3",
  updateBehavior="LOG",
  partitionKeys=["year", "version"],
  compression="snappy",
  enableUpdateCatalog=True,
  transformation_ctx="s3output_final",
)
s3output.setCatalogInfo(
  catalogDatabase=my-database", catalogTableName="clean_data"
)
s3output.setFormat("glueparquet")
s3output.writeFrame(DyF)

job.commit()

The problem that I am facing is that when the Glue table is created, the columns are numbered like "column#1", "column#2", ... "column21" etc. I am not sure how to get around that, I followed the documentation here: https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html and I kept enableUpdateCatalog=True because I want to be able to add new partitions for year and version in the newly created table.

Any help will be very welcomed.

asked a month ago43 views
1 Answer
0

The issue you're facing with column names being automatically generated as "column#1", "column#2", etc., in your Glue catalog table is likely due to the schema information not being properly passed or recognized when writing the DynamicFrame to S3.

To resolve this issue, you can try the following approach:

  1. Ensure that your spark_df (Spark DataFrame) has the correct column names before converting it to a DynamicFrame. You can verify this by printing the schema of your spark_df.

  2. When converting the Spark DataFrame to a DynamicFrame, explicitly pass the schema:

from awsglue.dynamicframe import DynamicFrame
from awsglue.types import *

# Assuming spark_df is your Spark DataFrame with correct column names
schema = spark_df.schema
DyF = DynamicFrame.fromDF(spark_df, glueContext, "etl_convert", schema=schema)
  1. Before writing the frame, you can also try to explicitly set the schema on the DynamicFrame:
from awsglue.types import *

# Create a schema that matches your data
schema = StructType([
StructField("column1_name", StringType(), True),
StructField("column2_name", IntegerType(), True),
# ... add all your columns here
])

DyF = DyF.withFrameSchema(lambda: schema)
  1. If the above steps don't work, you can try to rename the columns in your DynamicFrame before writing:
from awsglue.transforms import *

# Assuming you know the correct column names
column_names = ["column1_name", "column2_name", ..., "column21_name"]
DyF = RenameField.apply(frame=DyF, old_name="column#1", new_name=column_names[0])
DyF = RenameField.apply(frame=DyF, old_name="column#2", new_name=column_names[1])
# ... repeat for all columns

By implementing these steps, you should be able to maintain the correct column names in your Glue catalog table while still being able to add new partitions for year and version. Remember to adjust the column names and data types according to your specific schema.
Sources
AWS Glue Scala DynamicFrame class - AWS Glue
DynamicFrameReader class - AWS Glue

profile picture
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions