I'm attempting to run a basic word count program on an EMR cluster as a PoC using Spark and Yarn. The step immediately fails, it seems due to the fact that the slave nodes cannot contact the master in some form. Yarn fails to create any containers, and the slave nodes fail to connect to the master node.
I'm running the below scala code as a spark-submit step in the EMR cluster with the class name provided as an argument.
object App {
def main(args: Array[String]): Unit = {
//Set logging level to ERROR
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
//Reading a local file on HDFS
val myInput = "s3://texfile.txt" // Should be some file on your local HDFS
val conf = new SparkConf().setAppName("Word Count")
val sc = new SparkContext(conf)
val inputData = sc.textFile(myInput, 2).cache()
//Find words having words 'island' and 'the'
val wordA = inputData.filter(line => line.contains("islands")).count()
val wordB = inputData.filter(line => line.contains("the")).count
println("Number of lines with word 'islands' %s".format(wordA))
println("Number of lines with word 'the' %s".format(wordB))
}
}
LOGS:
YARN RESOURCE MANAGER
2018-07-17 12:40:21,188 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl (AsyncDispatcher event handler): Application application_1531831066208_0001 failed 2 times due to AM Container for appattempt_1531831066208_0001_
000002 exited with exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1531831066208_0001_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-10-162-1-222.ec2.internal:8088/cluster/app/application_1531831066208_0001 Then click on links to logs of each attempt.
. Failing the application.
SLAVE NODE
2018-07-17 12:37:29,827 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:30,828 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:31,829 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:32,830 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:33,831 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:34,832 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:35,832 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:36,833 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:37,834 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:38,835 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl
eepTime=1000 MILLISECONDS)
2018-07-17 12:37:38,836 WARN org.apache.hadoop.ipc.Client (main): Failed to connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025: retries get failed due to exceeded maximum allowed retries number: 10
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788)
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
at org.apache.hadoop.ipc.Client.call(Client.java:1345)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy75.registerNodeManager(Unknown Source)
at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:73)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
at com.sun.proxy.$Proxy76.registerNodeManager(Unknown Source)
I can ping the slave nodes fine from the master, and vice versa, when using strictly the IP addresses. I suspect the issue might have to do with DNS lookups going wonky, but am unsure how to debug this any further, and can seem to find no relevant settings in creating the EMR cluster. Any time spent helping me on this issue is greatly appreciated.