We are observing that writing to redshift using glue dynamic frame errors out when the input file >1GB.
Setup :
Redshift Cluster : 2 node DC2
Glue job
temp_df = glueContext.create_dynamic_frame.from_options(connection_type="s3", format="csv", connection_options={"paths": [source]}, format_options={"withHeader": True, "separator": ","}, transformation_ctx="path={}".format(source)).toDF()
redshift_df = DynamicFrame.fromDF(output_df, glueContext, "redshift_df")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame=redshift_df, catalog_connection="pilot-rs", connection_options={"preactions": "truncate table tablename;", "dbtable": "tablename", "database": "dev"}, redshift_tmp_dir='s3://bucket/path/', transformation_ctx="datasink4")
Observation :
Code works when the input file is under 1GB. It is able to write to redshift table.
Code fails when input file size is >1Gb and job run time is around 10 mins.
Error:
An error occurred while calling o260.save. Timeout waiting for connection from pool
and sometimes
“An error occurred while calling o334.pyWriteDynamicFrame. Timeout waiting for connection from pool"
Portion of glue error log
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)