Saltar al contenido

pyspark decimal value issue while writing into csv in aws glue python

0

I am trying to read the data from Oracle and write the dataset into csv file using spark 3.3 , Scala 2 in aws glue python code and bydefault all the Number fields in Oracle where the decimal separator is ',' in csv its written as '.' . For example, for the below column values : 2,6 the output in csv is: 2.6

Below is my python code in aws glue:

args = getResolvedOptions(sys.argv, ['target_BucketName', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Script generated for node AWS Glue Data Catalog AKS
AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog(database="test", table_name="_test_rw__test1_app", additional_options = {"jobBookmarkKeys":["seps"],"jobBookmarkKeysSortOrder":"asc","enableUpdateCatalog": False}, transformation_ctx="AWSGlueDataCatalog_node")

# Convert DynamicFrame to DataFrame 
df_app = AWSGlueDataCatalog_node.toDF()

# Repartition the DataFrame to control output files APP
df_repartitioned_app = df_app.repartition(10)  

# Check for empty partitions and write only if data is present
if not df_repartitioned_aks.rdd.isEmpty():
    df_repartitioned_aks.write.format("csv") \
        .option("compression", "gzip") \
        .option("header", "true") \
        .option("delimiter", "Ə") \
        .option("ignoreLeadingWhiteSpace", "false") \
        .option("ignoreTrailingWhiteSpace", "false") \
        .save(output_path)
preguntada hace 10 meses385 visualizaciones
1 Respuesta
0

The issue you're experiencing is related to the default behavior of PySpark when writing decimal values to CSV files. By default, PySpark uses the period ('.') as the decimal separator when writing numeric values, regardless of the original format in the source data.

To maintain the comma (',') as the decimal separator in your output CSV file, you'll need to modify your code to format the decimal fields before writing. Here's an approach you can try:

  1. First, identify the columns that contain decimal values.
  2. Use the withColumn function to format these decimal columns, replacing the period with a comma.
  3. Then write the modified DataFrame to CSV.

Here's an example of how you could modify your code:

from pyspark.sql.functions import format_string, col

# Identify decimal columns (replace with your actual decimal column names)
decimal_columns = ["column1", "column2", "column3"]

# Format decimal columns
for column in decimal_columns:
    df_repartitioned_app = df_repartitioned_app.withColumn(
        column, 
        format_string("%.2f", col(column)).cast("string").alias(column)
    )
    df_repartitioned_app = df_repartitioned_app.withColumn(
        column, 
        regexp_replace(col(column), r"\.", ",")
    )

# Write to CSV
if not df_repartitioned_app.rdd.isEmpty():
    df_repartitioned_app.write.format("csv") \
        .option("compression", "gzip") \
        .option("header", "true") \
        .option("delimiter", "Ə") \
        .option("ignoreLeadingWhiteSpace", "false") \
        .option("ignoreTrailingWhiteSpace", "false") \
        .save(output_path)

This approach first formats the decimal columns to ensure a consistent number of decimal places, then replaces the period with a comma. The resulting columns will be strings, preserving the comma as the decimal separator when written to CSV.

Remember to replace "column1", "column2", "column3" with your actual decimal column names. Also, adjust the "%.2f" format string if you need a different number of decimal places.

If you need to maintain the numeric data type for further processing before writing to CSV, you can apply this formatting as the final step before the write operation.
Sources
pyspark trimming all fields bydefault while writing into csv in aws glue python | AWS re:Post
NUMBER_FORMAT - AWS Glue DataBrew

respondido hace 10 meses
AWS
INGENIERO DE SOPORTE
revisado hace 10 meses

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.