inject non quoted csv file into RDS via glue

0

I have a pyspark script generated by my glue job that aims to read data from a CSV file in an S3 bucket and write it on my SQL RDS table. in my CSV file, I have string multi-lines. if the strings is quoted the job pass, but in my case, multi-line strings are not quoted so the job cannot insert data in my table; I tried :

spark.read.option("multiLine", "true").option("quoteChar", -1).option("header","true")

it doesn't work. I also tried :

datasink5 = glueContext.write_dynamic_frame.from_options(
    frame = dynamic_frame_write,
    connection_type = "s3", 
    connection_options = {
        "path": "s3://mycsvFile"
        }, 
    format = "csv", 
    format_options={
        "quoteChar": -1, 
        "separator": ","
        }, 
    transformation_ctx = "datasink5")

but this wrote the data back to s3 not to my RDS table.

this is my glue job :

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
import pyspark.sql.functions as f

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
## spark.read.option("multiLine", "true").option("quoteChar", -1).option("header","true").option("escape","\'")
    
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

def otherTreatment(dfa):
...
   return dfa

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_rds", table_name = "tbl_csv_extract", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("id", "string", "id", "string"), ("created", "string", "created", "timestamp"), ("name", "string", "name", "string"), ("high", "string", "high", "decimal(22,7)")], transformation_ctx = "applymapping1")

selectfields2 = SelectFields.apply(frame = applymapping1, paths = ["created", "name", "high", "id"], transformation_ctx = "selectfields2")

resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "db_rds_sql", table_name = "tbl_teststring", transformation_ctx = "resolvechoice3")

resolvechoice4 = ResolveChoice.apply(frame = resolvechoice3, choice = "make_cols", transformation_ctx = "resolvechoice4")

data_frame = resolvechoice4.toDF()



data_frame = otherTreatment(data_frame)
dynamic_frame_write = DynamicFrame.fromDF(data_frame, glueContext, "dynamic_frame_write")
datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = dynamic_frame_write, database = "db_rds_sql", table_name = "tbl_teststring", transformation_ctx = "datasink5")

## with the flowing script write output back to s3 not in my sql table
datasink5 = glueContext.write_dynamic_frame.from_options(
    frame = dynamic_frame_write,
    connection_type = "s3", 
    connection_options = {
        "path": "s3://mycsvFile"
        }, 
    format = "csv", 
    format_options={
        "quoteChar": -1, 
        "separator": ","
        }, 
    transformation_ctx = "datasink5")
    
job.commit()

does anyone have any idea how can I write My CSV file with non quoted multiline with glue pyspark?

1 Antwort
1
Akzeptierte Antwort

I think your likely solution will be along the lines of finding a way to quote your strings.

Think of it this way: when you wrote "in my case, multi-line strings are not quoted", this is really saying, "my CSV file is invalid." Because it's really not valid to have unquoted multi-line strings. The parser cannot in principle know what to do.

Probably you need to do one of the following:

  1. Regenerate your CSV from the original source again to get a valid CSV.
  2. If you know enough about your data to determine where the quotes ought to be, then do string parsing to add in the quotes.
beantwortet vor 2 Jahren
AWS
EXPERTE
überprüft vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen