내용으로 건너뛰기

pyspark decimal value issue while writing into csv in aws glue python

0

I am trying to read the data from Oracle and write the dataset into csv file using spark 3.3 , Scala 2 in aws glue python code and bydefault all the Number fields in Oracle where the decimal separator is ',' in csv its written as '.' . For example, for the below column values : 2,6 the output in csv is: 2.6

Below is my python code in aws glue:

args = getResolvedOptions(sys.argv, ['target_BucketName', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Script generated for node AWS Glue Data Catalog AKS
AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog(database="test", table_name="_test_rw__test1_app", additional_options = {"jobBookmarkKeys":["seps"],"jobBookmarkKeysSortOrder":"asc","enableUpdateCatalog": False}, transformation_ctx="AWSGlueDataCatalog_node")

# Convert DynamicFrame to DataFrame 
df_app = AWSGlueDataCatalog_node.toDF()

# Repartition the DataFrame to control output files APP
df_repartitioned_app = df_app.repartition(10)  

# Check for empty partitions and write only if data is present
if not df_repartitioned_aks.rdd.isEmpty():
    df_repartitioned_aks.write.format("csv") \
        .option("compression", "gzip") \
        .option("header", "true") \
        .option("delimiter", "Ə") \
        .option("ignoreLeadingWhiteSpace", "false") \
        .option("ignoreTrailingWhiteSpace", "false") \
        .save(output_path)
질문됨 10달 전395회 조회
1개 답변
0

The issue you're experiencing is related to the default behavior of PySpark when writing decimal values to CSV files. By default, PySpark uses the period ('.') as the decimal separator when writing numeric values, regardless of the original format in the source data.

To maintain the comma (',') as the decimal separator in your output CSV file, you'll need to modify your code to format the decimal fields before writing. Here's an approach you can try:

  1. First, identify the columns that contain decimal values.
  2. Use the withColumn function to format these decimal columns, replacing the period with a comma.
  3. Then write the modified DataFrame to CSV.

Here's an example of how you could modify your code:

from pyspark.sql.functions import format_string, col

# Identify decimal columns (replace with your actual decimal column names)
decimal_columns = ["column1", "column2", "column3"]

# Format decimal columns
for column in decimal_columns:
    df_repartitioned_app = df_repartitioned_app.withColumn(
        column, 
        format_string("%.2f", col(column)).cast("string").alias(column)
    )
    df_repartitioned_app = df_repartitioned_app.withColumn(
        column, 
        regexp_replace(col(column), r"\.", ",")
    )

# Write to CSV
if not df_repartitioned_app.rdd.isEmpty():
    df_repartitioned_app.write.format("csv") \
        .option("compression", "gzip") \
        .option("header", "true") \
        .option("delimiter", "Ə") \
        .option("ignoreLeadingWhiteSpace", "false") \
        .option("ignoreTrailingWhiteSpace", "false") \
        .save(output_path)

This approach first formats the decimal columns to ensure a consistent number of decimal places, then replaces the period with a comma. The resulting columns will be strings, preserving the comma as the decimal separator when written to CSV.

Remember to replace "column1", "column2", "column3" with your actual decimal column names. Also, adjust the "%.2f" format string if you need a different number of decimal places.

If you need to maintain the numeric data type for further processing before writing to CSV, you can apply this formatting as the final step before the write operation.
Sources
pyspark trimming all fields bydefault while writing into csv in aws glue python | AWS re:Post
NUMBER_FORMAT - AWS Glue DataBrew

답변함 10달 전
AWS
지원 엔지니어
검토됨 10달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.