【以下的问题经过翻译处理】 我们发现使用Glue dynamic frame将数据写入Redshift时,如果输入文件大于1GB,会出现错误。
设置:
Redshift群集:2个DC2节点
Glue作业
temp_df = glueContext.create_dynamic_frame.from_options(connection_type="s3", format="csv", connection_options={"paths": [source]}, format_options={"withHeader": True, "separator": ","}, transformation_ctx="path={}".format(source)).toDF()
redshift_df = DynamicFrame.fromDF(output_df, glueContext, "redshift_df")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame=redshift_df, catalog_connection="pilot-rs", connection_options={"preactions": "truncate table tablename;", "dbtable": "tablename", "database": "dev"}, redshift_tmp_dir='s3://bucket/path/', transformation_ctx="datasink4")
观察:
当输入文件小于1GB时,代码可以正常工作并将数据写入Redshift表中。
当输入文件大小大于1GB时,作业运行时间约在10分钟时失败。
错误:
An error occurred while calling o260.save. Timeout waiting for connection from pool
并且有时还会出现以下错误:
“An error occurred while calling o334.pyWriteDynamicFrame. Timeout waiting for connection from pool"
Glue错误日志片段
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)