Slave Nodes Unable to Talk to Master in Spark Application

0

I'm attempting to run a basic word count program on an EMR cluster as a PoC using Spark and Yarn. The step immediately fails, it seems due to the fact that the slave nodes cannot contact the master in some form. Yarn fails to create any containers, and the slave nodes fail to connect to the master node.

I'm running the below scala code as a spark-submit step in the EMR cluster with the class name provided as an argument.

object App {

  def main(args: Array[String]): Unit = {
    //Set logging level to ERROR
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)

    //Reading a local file on HDFS
    val myInput = "s3://texfile.txt" // Should be some file on your local HDFS
    val conf = new SparkConf().setAppName("Word Count")
    val sc = new SparkContext(conf)
    val inputData = sc.textFile(myInput, 2).cache()

    //Find words having words 'island' and 'the'
    val wordA = inputData.filter(line => line.contains("islands")).count()
    val wordB = inputData.filter(line => line.contains("the")).count
    println("Number of lines with word 'islands'  %s".format(wordA))
    println("Number of lines with word 'the'  %s".format(wordB))
  }

}

LOGS:

YARN RESOURCE MANAGER

2018-07-17 12:40:21,188 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl (AsyncDispatcher event handler): Application application_1531831066208_0001 failed 2 times due to AM Container for appattempt_1531831066208_0001_
000002 exited with  exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1531831066208_0001_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
        at org.apache.hadoop.util.Shell.run(Shell.java:869)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-10-162-1-222.ec2.internal:8088/cluster/app/application_1531831066208_0001 Then click on links to logs of each attempt.
. Failing the application.

SLAVE NODE

2018-07-17 12:37:29,827 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:30,828 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:31,829 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:32,830 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:33,831 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:34,832 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:35,832 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:36,833 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:37,834 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:38,835 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:38,836 WARN org.apache.hadoop.ipc.Client (main): Failed to connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025: retries get failed due to exceeded maximum allowed retries number: 10
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788)
        at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550)
        at org.apache.hadoop.ipc.Client.call(Client.java:1381)
        at org.apache.hadoop.ipc.Client.call(Client.java:1345)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy75.registerNodeManager(Unknown Source)
        at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:73)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
        at com.sun.proxy.$Proxy76.registerNodeManager(Unknown Source)

I can ping the slave nodes fine from the master, and vice versa, when using strictly the IP addresses. I suspect the issue might have to do with DNS lookups going wonky, but am unsure how to debug this any further, and can seem to find no relevant settings in creating the EMR cluster. Any time spent helping me on this issue is greatly appreciated.

已提问 6 年前1527 查看次数
1 回答
0

So turns out the error message for this was bad. I had a typo in my argument that specified the main class for my jar. Instead of giving me an error pointing me in that direction, YARN decided to just blow up. If you run into this, it's worth checking the little things if nothing else makes sense.

已回答 6 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则