I want to view Apache Spark web interfaces that are hosted on Amazon EMR clusters.
The Spark History Server is a Web UI where you can view the status of running and completed Spark jobs on your EMR cluster.
The following are common ways to access the Spark UI hosted in a public and private subnet:
- Persistent application user interfaces
- On-cluster application user interfaces
Persistent application user interfaces
In your EMR cluster, the apppusher daemon periodically sends Spark event logs to Amazon EMR production buckets. The persistent Spark UI uses the event logs to display Spark applications.
This feature works when the event log directory for the application is in HDFS. By default, Amazon EMR stores event logs in the /var/log/spark/apps directory of HDFS. If you change the default directory to a different file system, such as Amazon Simple Storage Service (Amazon S3), then this feature doesn't work. For more information, see Considerations and limitations.
You can access the application history and relevant log files for active and terminated clusters. The logs are available for 30 days after the application ends. For more information, see View persistent application user interfaces.
On-cluster application user interfaces
On-cluster user interfaces are hosted on the primary node and require an SSH connection to the web server.
To access the on-cluster UI, do the following:
1. Connect to the primary node using SSH.
2. Configure SSH tunneling with dynamic port forwarding.
3. Configure your internet browser to use an add-on such as FoxyProxy for Firefox or SwitchyOmega for Chrome to manage your SOCKS proxy settings.
This method automatically filters URLs based on text patterns. Also, this method limits the proxy settings to domains that match the form of the primary node's DNS name.
ssh -i ~/mykeypair.pem -N -L 8157:ec2-###-##-##-###.compute-1.amazonaws.com:18080 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com
For more information, see Option 1: Set up an SSH tunnel to the primary node using local port forwarding.
An on-cluster UI in a private subnet isn't directly accessible unless you're using a local network through a VPN connection or AWS Direct Connect. And, you must configure the route so that communication spans the AWS and local networks.
Or, you can connect to a private subnet using a bastion or jump server hosted in a public subnet. Then, create SSH tunneling with dynamic port forwarding.
For more information, see Securely access web interfaces on Amazon EMR launched in a private subnet.