Glue etl job fails to write to Redshift using dynamic frame - reason ?

0

We are observing that writing to redshift using glue dynamic frame errors out when the input file >1GB. Setup : Redshift Cluster : 2 node DC2 Glue job

temp_df = glueContext.create_dynamic_frame.from_options(connection_type="s3", format="csv", connection_options={"paths": [source]}, format_options={"withHeader": True, "separator": ","}, transformation_ctx="path={}".format(source)).toDF()

    redshift_df = DynamicFrame.fromDF(output_df, glueContext, "redshift_df")
    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame=redshift_df, catalog_connection="pilot-rs", connection_options={"preactions": "truncate table tablename;", "dbtable": "tablename", "database": "dev"}, redshift_tmp_dir='s3://bucket/path/', transformation_ctx="datasink4")

Observation : Code works when the input file is under 1GB. It is able to write to redshift table. Code fails when input file size is >1Gb and job run time is around 10 mins.

Error:

An error occurred while calling o260.save. Timeout waiting for connection from pool

and sometimes

“An error occurred while calling o334.pyWriteDynamicFrame. Timeout waiting for connection from pool"

Portion of glue error log

Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
	at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)
	at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)
AWS
전문가
질문됨 4년 전2459회 조회
1개 답변
0
수락된 답변

Seems like the you ran out of HTTPConnection objects that is either trying to connect to source (s3) or connect to sink (temp location of s3). I have seen issues with EMR like this before and I did set fs.s3.maxConnections to high value to increase the connection pool size. You can increase that as mentioned here 1. You can set the value by below

scala: sparkcontext.hadoopConfiguration.set("spark.hadoop.fs.s3.maxConnections", 1000)

python: sparkcontext._jsc.hadoopConfiguration().set('spark.hadoop.fs.s3.maxConnections', '1000')

The issue might be because of large files being fetched and written to sink and thus the HTTP Connection might take longer then normal and the pool might not just be enough for you.

AWS
답변함 4년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠