- 最新
- 最多得票
- 最多評論
The issue you're experiencing is related to the default behavior of PySpark when writing decimal values to CSV files. By default, PySpark uses the period ('.') as the decimal separator when writing numeric values, regardless of the original format in the source data.
To maintain the comma (',') as the decimal separator in your output CSV file, you'll need to modify your code to format the decimal fields before writing. Here's an approach you can try:
- First, identify the columns that contain decimal values.
- Use the
withColumnfunction to format these decimal columns, replacing the period with a comma. - Then write the modified DataFrame to CSV.
Here's an example of how you could modify your code:
from pyspark.sql.functions import format_string, col # Identify decimal columns (replace with your actual decimal column names) decimal_columns = ["column1", "column2", "column3"] # Format decimal columns for column in decimal_columns: df_repartitioned_app = df_repartitioned_app.withColumn( column, format_string("%.2f", col(column)).cast("string").alias(column) ) df_repartitioned_app = df_repartitioned_app.withColumn( column, regexp_replace(col(column), r"\.", ",") ) # Write to CSV if not df_repartitioned_app.rdd.isEmpty(): df_repartitioned_app.write.format("csv") \ .option("compression", "gzip") \ .option("header", "true") \ .option("delimiter", "Ə") \ .option("ignoreLeadingWhiteSpace", "false") \ .option("ignoreTrailingWhiteSpace", "false") \ .save(output_path)
This approach first formats the decimal columns to ensure a consistent number of decimal places, then replaces the period with a comma. The resulting columns will be strings, preserving the comma as the decimal separator when written to CSV.
Remember to replace "column1", "column2", "column3" with your actual decimal column names. Also, adjust the "%.2f" format string if you need a different number of decimal places.
If you need to maintain the numeric data type for further processing before writing to CSV, you can apply this formatting as the final step before the write operation.
Sources
pyspark trimming all fields bydefault while writing into csv in aws glue python | AWS re:Post
NUMBER_FORMAT - AWS Glue DataBrew
