- Newest
- Most votes
- Most comments
Hi,
To add the specified two columns “EXECUTION_MARKET” and “TA_NUMBERSYS” from the database table “DSUSER00.SECTANR” using glue python script can be done as follows:
- First create a Dynamic Frame to select the columns that you want to add in the data catalog table:
#Create a DynamicFrame from your source table
source_dyf = glueContext.create_dynamic_frame.from_catalog(
database="database_name”,
table_name="table_name”
)
#Select only the desired columns
selected_dyf = DynamicFrame.select_fields(source_dyf, ['EXECUTION_MARKET', ‘TA_NUMBERSYS'])
- Then convert the Dynamic Frame script into Data Frame script:
source_df = source_dyf.toDF()
selected_df = source_df.select("EXECUTION_MARKET”, "TA_NUMBERSYS")
selected_dyf = DynamicFrame.fromDF(selected_df, glueContext, "selected_dyf")
- And write the data into the desired path as per your use case:
glueContext.write_dynamic_frame.from_options(
frame=selected_dyf,
connection_type="s3",
connection_options={
"path": "s3://bucket/folder/“
)
This approach will ensure that only the specified columns are read from the database and processed in your Glue job.
Note: The code snippets that shared about are just for your reference. Kindly modify it according to your use case.
Sources:
[+]https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python.html [+]https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html
- First, when creating the dynamic frame, specify the columns you want to select:
RelationalDB_node_sectanr = glueContext.create_dynamic_frame.from_options( connection_type = "oracle", connection_options = { "useConnectionProperties": "true", "dbtable": "DSUSER00.SECTANR", "connectionName": "DS7PSEC", "hashfield":"AENDERUNGSDATLO", "hashpartitions":"152", "jobBookmarkKeys":["AENDERUNGSDATLO"], "jobBookmarkKeysSortOrder":"asc", "customSql": "SELECT EXECUTION_MARKET, TA_NUMBERSYS, AENDERUNGSDATLO FROM DSUSER00.SECTANR" }, transformation_ctx = "RelationalDB_node_sectanr" )
- Then, when you convert the DynamicFrame to a DataFrame, you'll only have these columns:
df_sectanr = RelationalDB_node_sectanr.toDF()
- You can keep your filtering logic, but it will only apply to the AENDERUNGSDATLO column:
df_sectanr_filtered = df_sectanr.filter( to_date('19700101', 'yyyyMMdd').cast('timestamp') + (col('AENDERUNGSDATLO') / 60 / 60 / 24).cast('interval') > to_date('20220101', 'yyyyMMdd') )
- The rest of your code can remain the same. When you write the DataFrame to CSV, it will only contain the EXECUTION_MARKET and TA_NUMBERSYS columns, plus the AENDERUNGSDATLO column used for filtering.
This approach will ensure that only the specified columns are read from the database and processed in your Glue job, which can improve performance and reduce data transfer if these columns are significantly smaller than the full table.
Remember to adjust any other parts of your script that might depend on columns that are no longer present in the DataFrame after this change.
Sources
Program AWS Glue ETL scripts in PySpark - AWS Glue
pyspark trimming all fields bydefault while writing into csv in aws glue python | AWS re:Post
Relevant content
- AWS OFFICIALUpdated 2 years ago
