inject non quoted csv file into RDS via glue

0

I have a pyspark script generated by my glue job that aims to read data from a CSV file in an S3 bucket and write it on my SQL RDS table. in my CSV file, I have string multi-lines. if the strings is quoted the job pass, but in my case, multi-line strings are not quoted so the job cannot insert data in my table; I tried :

spark.read.option("multiLine", "true").option("quoteChar", -1).option("header","true")

it doesn't work. I also tried :

datasink5 = glueContext.write_dynamic_frame.from_options(
    frame = dynamic_frame_write,
    connection_type = "s3", 
    connection_options = {
        "path": "s3://mycsvFile"
        }, 
    format = "csv", 
    format_options={
        "quoteChar": -1, 
        "separator": ","
        }, 
    transformation_ctx = "datasink5")

but this wrote the data back to s3 not to my RDS table.

this is my glue job :

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
import pyspark.sql.functions as f

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
## spark.read.option("multiLine", "true").option("quoteChar", -1).option("header","true").option("escape","\'")
    
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

def otherTreatment(dfa):
...
   return dfa

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_rds", table_name = "tbl_csv_extract", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("id", "string", "id", "string"), ("created", "string", "created", "timestamp"), ("name", "string", "name", "string"), ("high", "string", "high", "decimal(22,7)")], transformation_ctx = "applymapping1")

selectfields2 = SelectFields.apply(frame = applymapping1, paths = ["created", "name", "high", "id"], transformation_ctx = "selectfields2")

resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "db_rds_sql", table_name = "tbl_teststring", transformation_ctx = "resolvechoice3")

resolvechoice4 = ResolveChoice.apply(frame = resolvechoice3, choice = "make_cols", transformation_ctx = "resolvechoice4")

data_frame = resolvechoice4.toDF()



data_frame = otherTreatment(data_frame)
dynamic_frame_write = DynamicFrame.fromDF(data_frame, glueContext, "dynamic_frame_write")
datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = dynamic_frame_write, database = "db_rds_sql", table_name = "tbl_teststring", transformation_ctx = "datasink5")

## with the flowing script write output back to s3 not in my sql table
datasink5 = glueContext.write_dynamic_frame.from_options(
    frame = dynamic_frame_write,
    connection_type = "s3", 
    connection_options = {
        "path": "s3://mycsvFile"
        }, 
    format = "csv", 
    format_options={
        "quoteChar": -1, 
        "separator": ","
        }, 
    transformation_ctx = "datasink5")
    
job.commit()

does anyone have any idea how can I write My CSV file with non quoted multiline with glue pyspark?

1 Answer
1
Accepted Answer

I think your likely solution will be along the lines of finding a way to quote your strings.

Think of it this way: when you wrote "in my case, multi-line strings are not quoted", this is really saying, "my CSV file is invalid." Because it's really not valid to have unquoted multi-line strings. The parser cannot in principle know what to do.

Probably you need to do one of the following:

  1. Regenerate your CSV from the original source again to get a valid CSV.
  2. If you know enough about your data to determine where the quotes ought to be, then do string parsing to add in the quotes.
answered 2 years ago
AWS
EXPERT
reviewed 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions