Remove Non-ASCII characters from S3 file using Glue


I have a CSV file delivered by external vendor, to S3 and this file has some Non-ASCII/Junk characters. Before loading this to redahft table, I will need to remove these characters. I tried TRIMBLANKS in COPY command and that didn't help. When opening this CSV file in notepads++, it's showing as space. After selecting "show hidden characters" these space is shown as "." I tried calling this file in glue studio and using pyspark code, I am getting error. Can anyone please help me on this. Can it be rectified in COPY command or AWS glue or LAMBDA? I am new to lambda.

Sample data: 003800019779
**Code used by me: **
def ascii_ignore(x): return x.encode('ascii', 'ignore').decode('acsii') ascii_udf = udf(ascii_ignore) mapping_node1694802690218 = mapping_node1694802690218.withColumn("UPC",ascii_udf(col("UPC")))

**Getting error as: ** AttributeError: 'DynamicFrame' object has no attribute 'withColumn'

1 Answer

I have figured out the solution. "withColumn' is not a part of DynamicFrame. I had to convert the Dynamic Frame into Data Frame and then use 'withColumn' attribute to implement the regex function like below: from pyspark.sql.types import * from pyspark.sql.functions import regexp_replace from awsglue.dynamicframe import DynamicFrame data_frames = full_order_node.toDF() ##converting to DataFrame data_frames = data_frames.withColumn("featured",regexp_replace(col("featured")," ", "")) final_output = DynamicFrame.fromDF(data_frames, glueContext, "final_output") ##converting back to DynamicFrame

answered 16 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions