Read the input file name from S3 in AWS Glue into redshift

0

I am reading multiple files from S3 and writing the output to Redshift DB. Below is my code to read all the files from a S3 location (s3://abc/oms/YFS_CATEGORY_ITEM/)

yfs_category_item_df = glueContext.create_dynamic_frame.from_options(
    format_options={},
    connection_type="s3",
    format="parquet",
    connection_options={
        "path": [
            #"s3://" + args['s3_bucket'] + "/" + args['s3_key']
            f"s3://abc/oms/YFS_CATEGORY_ITEM/"
            ],
        "recurse": True,
    },
    transformation_ctx="yfs_category_item_df",
)

Below is my code to write the data to Redshift:

RedshiftCluster_node3 = glueContext.write_dynamic_frame.from_jdbc_conf(
        frame=yfs_category_item_df,
        catalog_connection="dodsprd_connection",
        connection_options={
            "database": "dodsprod",
            "dbtable": "dw_replatform_stage.rt_category_item",
            "preactions": pre_query,
            "postactions": post_query,
        },
        redshift_tmp_dir=tmp_dir,
        transformation_ctx="upsert_to_redshift",
    )

All my work is going on as expected. My new requirement is to capture the input file name (that I am reading from S3) and insert that into my target table as a new column. Can someone please help me to do this? I am not using any crawler or Glue tables in my entire script. Also i tried using "input_file_name()" option. This does not work in JDBC option and works only if crawler/glue tables are used. Also tried using "format_options" and this is also not supported in JDBC_CONF. Any help please?

2 個答案
0

The sink being JDBC (BTW for Redshift it would be better to use the Redshift connector) has no impact on what you are trying to do, you could show() the DF before writing and if the filename is there, it will be written like any other columns.
You could use the option attachFilename on the source, see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html#aws-glue-programming-etl-format-shared-reference or convert to DataFrame, call the function and then back to DynamicFrame (input_file_name() is not an "option", it's a SQL function)

profile pictureAWS
專家
已回答 1 個月前
AWS
支援工程師
已審閱 1 個月前
  • Thanks for your input. "attchFilename" cannot b added when we are using JDBC method to write to Redshift. it threw an error. You are right, input_file_name() is a SQL function only. The way I gave my statement was wrong. When I said "i tried using input_file_name() option ", I meant that this function was also tried and it did not work.

0

Just a small tweak is needed to get the filename:

yfs_category_item_df = glueContext.create_dynamic_frame.from_options(
    format_options={"attachFilename": "your_filename_column_name"},
    connection_type="s3",
    format="parquet",
    connection_options={
        "path": [
            #"s3://" + args['s3_bucket'] + "/" + args['s3_key']
            f"s3://abc/oms/YFS_CATEGORY_ITEM/"
            ],
        "recurse": True,
    },
    transformation_ctx="yfs_category_item_df",
)

Change the value of attachFilename in format_options to have the column name you desire.

AWS
Don_D
已回答 1 個月前
  • i tried this also, but did not work. I got blank value. I read somewhere that while using CRWALER/Glue Table only this works. Anyway, I found a workaround. IN the code you have mentioned above, I am using "args['s3_key']". This value is coming from Lambda for me. I am passing this value to a variable, and I am using that variable in my "Post Query" while doing a "write_dynamic_frame.from_jdbc_conf". Thanks for your inputs.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南