- Neueste
- Die meisten Stimmen
- Die meisten Kommentare
Hello,
I was able to replicate the issue at my end where the white space were getting trimmed in the output file using your script. To workaround the issue, I used the dynamic frame to write the output CSV file instead which produced the file in the expected format. Please find below the code that I used at my end:
#Reading the data AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog(database="test", table_name="test")
#Repartition the DynamicFrame to control output files APP df_repartitioned_app = AWSGlueDataCatalog_node.repartition(10)
#Writing the output
glueContext.write_dynamic_frame.from_options(
frame = dy_frame,
connection_options = {'path': 's3://output_bucket/', 'compressionType':'gzip'},
connection_type = 's3',
format = 'csv',
format_options={"separator": "|","withHeader":"true"})
I would kindly request you to try the same at your end and let me know how that goes. Please refer [1][2][3] for more details.
References: [1] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer.html [2] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-csv-home.html [3] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-s3-home.html#aws-glue-programming-etl-connect-s3
The issue you're experiencing with unintended trimming of string fields when writing to CSV is not a default behavior of PySpark or AWS Glue. PySpark typically preserves whitespace when writing to CSV files.
To ensure that your string fields are not trimmed, you can try the following modifications to your code:
-
Add the "quote" option to your write operation. This ensures that all fields are enclosed in quotes, which can help preserve leading and trailing spaces:
df_repartitioned_app.write.format("csv") \ .option("compression", "gzip") \ .option("header", "true") \ .option("delimiter", "|") \ .option("quote", "\"") \ .save(output_path_app) -
If you're still experiencing issues, you can try explicitly setting the "escapeQuotes" option to false:
df_repartitioned_app.write.format("csv") \ .option("compression", "gzip") \ .option("header", "true") \ .option("delimiter", "|") \ .option("quote", "\"") \ .option("escapeQuotes", "false") \ .save(output_path_app) -
Another approach is to use the "quoteAll" option, which forces all fields to be quoted:
df_repartitioned_app.write.format("csv") \ .option("compression", "gzip") \ .option("header", "true") \ .option("delimiter", "|") \ .option("quoteAll", "true") \ .save(output_path_app)
If the issue persists after trying these options, it's possible that the trimming is occurring earlier in your data processing pipeline. You might want to check if any transformations or operations applied to your data before the write step are inadvertently trimming the strings.
Additionally, ensure that your AWS Glue job has the latest version of Spark and related libraries, as older versions might have different default behaviors or bugs related to CSV writing.
If none of these solutions work, you may need to investigate further by examining your data at various stages of the pipeline to identify where the trimming is occurring.
Sources
TRIM - AWS Glue DataBrew
Relevanter Inhalt
- AWS OFFICIALAktualisiert vor 8 Monaten
- AWS OFFICIALAktualisiert vor 4 Jahren
- AWS OFFICIALAktualisiert vor einem Jahr
- AWS OFFICIALAktualisiert vor 9 Monaten

thanks for the feedback. The issue while using .option("quote", """) or .option("quoteAll", "true") is i have some fields which is string and which contains double quote values for example """abcd". In this case using quote option is not preserving original text value which i would like to have in output csv fileat the end.