Passer au contenu

pyspark trimming all fields bydefault while writing into csv in aws glue python

0

I am trying to write the dataset into csv file using spark 3.3 , Scala 2 in aws glue python code and bydefault its trimming all the String fields. For example, for the below column values :

" Text123"," jacob "

the output in csv is:

"Text123","jacob"

I dont want to trim any String fields.

Below is my python code in aws glue:

args = getResolvedOptions(sys.argv, ['target_BucketName', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Script generated for node AWS Glue Data Catalog AKS
AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog(database="test", table_name="_test_rw_omuc_baag__test1_app", additional_options = {"jobBookmarkKeys":["seps"],"jobBookmarkKeysSortOrder":"asc","enableUpdateCatalog": False}, transformation_ctx="AWSGlueDataCatalog_node")

# Convert DynamicFrame to DataFrame 
df_app = AWSGlueDataCatalog_node.toDF()

# Repartition the DataFrame to control output files APP
df_repartitioned_app = df_app.repartition(10)  

# Check for empty partitions and write only if data is present
if not df_repartitioned_app.rdd.isEmpty():
    df_repartitioned_app.write.format("csv") \
        .option("compression", "gzip") \
        .option("header", "true") \
        .option("delimiter", "|") \
        .save(output_path_app)

demandé il y a un an275 vues
2 réponses
2
Réponse acceptée

Hello,

I was able to replicate the issue at my end where the white space were getting trimmed in the output file using your script. To workaround the issue, I used the dynamic frame to write the output CSV file instead which produced the file in the expected format. Please find below the code that I used at my end:

#Reading the data AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog(database="test", table_name="test")

#Repartition the DynamicFrame to control output files APP df_repartitioned_app = AWSGlueDataCatalog_node.repartition(10)

#Writing the output glueContext.write_dynamic_frame.from_options(
frame = dy_frame,
connection_options = {'path': 's3://output_bucket/', 'compressionType':'gzip'},
connection_type = 's3',
format = 'csv', format_options={"separator": "|","withHeader":"true"})

I would kindly request you to try the same at your end and let me know how that goes. Please refer [1][2][3] for more details.

References: [1] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer.html [2] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-csv-home.html [3] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-s3-home.html#aws-glue-programming-etl-connect-s3

AWS
répondu il y a un an
EXPERT
vérifié il y a un an
1

The issue you're experiencing with unintended trimming of string fields when writing to CSV is not a default behavior of PySpark or AWS Glue. PySpark typically preserves whitespace when writing to CSV files.

To ensure that your string fields are not trimmed, you can try the following modifications to your code:

  1. Add the "quote" option to your write operation. This ensures that all fields are enclosed in quotes, which can help preserve leading and trailing spaces:

    df_repartitioned_app.write.format("csv") \
        .option("compression", "gzip") \
        .option("header", "true") \
        .option("delimiter", "|") \
        .option("quote", "\"") \
        .save(output_path_app)
  2. If you're still experiencing issues, you can try explicitly setting the "escapeQuotes" option to false:

    df_repartitioned_app.write.format("csv") \
        .option("compression", "gzip") \
        .option("header", "true") \
        .option("delimiter", "|") \
        .option("quote", "\"") \
        .option("escapeQuotes", "false") \
        .save(output_path_app)
  3. Another approach is to use the "quoteAll" option, which forces all fields to be quoted:

    df_repartitioned_app.write.format("csv") \
        .option("compression", "gzip") \
        .option("header", "true") \
        .option("delimiter", "|") \
        .option("quoteAll", "true") \
        .save(output_path_app)

If the issue persists after trying these options, it's possible that the trimming is occurring earlier in your data processing pipeline. You might want to check if any transformations or operations applied to your data before the write step are inadvertently trimming the strings.

Additionally, ensure that your AWS Glue job has the latest version of Spark and related libraries, as older versions might have different default behaviors or bugs related to CSV writing.

If none of these solutions work, you may need to investigate further by examining your data at various stages of the pipeline to identify where the trimming is occurring.

Sources
TRIM - AWS Glue DataBrew

répondu il y a un an
EXPERT
vérifié il y a un an
  • thanks for the feedback. The issue while using .option("quote", """) or .option("quoteAll", "true") is i have some fields which is string and which contains double quote values for example """abcd". In this case using quote option is not preserving original text value which i would like to have in output csv fileat the end.

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.