inject non quoted csv file into RDS via glue

0

I have a pyspark script generated by my glue job that aims to read data from a CSV file in an S3 bucket and write it on my SQL RDS table. in my CSV file, I have string multi-lines. if the strings is quoted the job pass, but in my case, multi-line strings are not quoted so the job cannot insert data in my table; I tried :

spark.read.option("multiLine", "true").option("quoteChar", -1).option("header","true")

it doesn't work. I also tried :

datasink5 = glueContext.write_dynamic_frame.from_options(
    frame = dynamic_frame_write,
    connection_type = "s3", 
    connection_options = {
        "path": "s3://mycsvFile"
        }, 
    format = "csv", 
    format_options={
        "quoteChar": -1, 
        "separator": ","
        }, 
    transformation_ctx = "datasink5")

but this wrote the data back to s3 not to my RDS table.

this is my glue job :

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
import pyspark.sql.functions as f

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
## spark.read.option("multiLine", "true").option("quoteChar", -1).option("header","true").option("escape","\'")
    
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

def otherTreatment(dfa):
...
   return dfa

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_rds", table_name = "tbl_csv_extract", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("id", "string", "id", "string"), ("created", "string", "created", "timestamp"), ("name", "string", "name", "string"), ("high", "string", "high", "decimal(22,7)")], transformation_ctx = "applymapping1")

selectfields2 = SelectFields.apply(frame = applymapping1, paths = ["created", "name", "high", "id"], transformation_ctx = "selectfields2")

resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "db_rds_sql", table_name = "tbl_teststring", transformation_ctx = "resolvechoice3")

resolvechoice4 = ResolveChoice.apply(frame = resolvechoice3, choice = "make_cols", transformation_ctx = "resolvechoice4")

data_frame = resolvechoice4.toDF()



data_frame = otherTreatment(data_frame)
dynamic_frame_write = DynamicFrame.fromDF(data_frame, glueContext, "dynamic_frame_write")
datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = dynamic_frame_write, database = "db_rds_sql", table_name = "tbl_teststring", transformation_ctx = "datasink5")

## with the flowing script write output back to s3 not in my sql table
datasink5 = glueContext.write_dynamic_frame.from_options(
    frame = dynamic_frame_write,
    connection_type = "s3", 
    connection_options = {
        "path": "s3://mycsvFile"
        }, 
    format = "csv", 
    format_options={
        "quoteChar": -1, 
        "separator": ","
        }, 
    transformation_ctx = "datasink5")
    
job.commit()

does anyone have any idea how can I write My CSV file with non quoted multiline with glue pyspark?

1回答
1
承認された回答

I think your likely solution will be along the lines of finding a way to quote your strings.

Think of it this way: when you wrote "in my case, multi-line strings are not quoted", this is really saying, "my CSV file is invalid." Because it's really not valid to have unquoted multi-line strings. The parser cannot in principle know what to do.

Probably you need to do one of the following:

  1. Regenerate your CSV from the original source again to get a valid CSV.
  2. If you know enough about your data to determine where the quotes ought to be, then do string parsing to add in the quotes.
回答済み 2年前
AWS
エキスパート
レビュー済み 2年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ