Exception in User Class: com.amazonaws.SdkClientException : Unable to execute HTTP request: readHandshakeRecord AWS GLUE

0

Hey Guys!

I am trying to Read a large amout of data(About 45GB in 5.500.000 files) in S3 and rewrite in a partitioned folder (In another Folder inside the same Bucket) but I am facing this error: Exception in User Class: com.amazonaws.SdkClientException : Unable to execute HTTP request: readHandshakeRecord

When I tried with just one file in the same folder it works. do you have any Ideia what could be the Problem?

Code(running using 60 DPUs, Glue 4.0):


import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType="s3",
      format="parquet",
      options=JsonOptions("""{"paths": ["s3://bucket/raw-folder"],"recurse": true, "groupFiles": "inPartition", "useS3ListImplementation": true}""")
    ).getDynamicFrame()
    
    
    glueContext.getSinkWithFormat(
        connectionType="s3",
        options=JsonOptions("""{"path": "s3://bucket/partition-folder"}"""),
        format="parquet",
        formatOptions=JsonOptions("""{"compression": "snappy","blockSize": 268435456, "pageSize": 1048576, "useGlueParquetWriter": true}""")
    ).writeDynamicFrame(dynamicFrame.repartition(10))
        }  
}

Best

lp_evan
asked a year ago379 views
1 Answer
0

The "Unable to execute HTTP request: readHandshakeRecord" exception is likely caused by insufficient memory issue. When reading large amount of small file,s spark driver keeps track of metadata for each file it reads and keeps this in memory, such may put a lot pressure on the driver memory can cause malfunctions with Http/API calls. I'd recommend you to check "glue.driver.jvm.heap.usage" and "glue.ALL.jvm.heap.usage" metrics in cloudwatch console [1] to see what the memory usage looks like.

To fix this issue, I suggest you to consider following optimization:

  1. use S3ListImplementation When AWS Glue lists files, it creates a file index in driver memory. If 'useS3ListImplementation' is set to True, it doesn't cache the list of files in memory all at once. Instead, AWS Glue caches the list in batches. This will help reduce out of memory errors within Spark driver. [2]

  2. change Worker types. a. Use G.2x worker type. The G.2x worker type has more memory. This will help alleviate memory pressure on the driver. b. Increase the number of worker. You can refer to "glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors" [3]to determine the optimal number of worker which can maximize the parallelism and performance.

  3. Use bounded execution If you still face the same issue, you can try the bounded execution [4] to split and distribute the large number of files to multiple job runs. With this setting, you can set how many files will be processed at one glue job.

I hope this information helps you resolve the issue. In case the issue still persists after implemented the optimization above, I'd recommend you to cut a support ticket with the Glue job details, our support team will be happy to help you further troubleshoot and resolve the issue.

Ref: [1] https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html#:~:text=all%20Spark%20executors.-,AWS%20Glue%20Metrics,-AWS%20Glue%20profiles [2] https://aws.amazon.com/premiumsupport/knowledge-center/glue-oom-java-heap-space-error/ [3] https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html#glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors [4] https://docs.aws.amazon.com/glue/latest/dg/bounded-execution.html

AWS
Ethan_H
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions