Accessing Spark Web UI for Interactive Endpoints in EMR on EKS

3 minute read
Content level: Advanced
2

This article might provide guidance on configuring and accessing the Spark application UI for Interactive Endpoints that are either self-hosted notebooks or EMR Studio managed notebooks.

At the time of writing this article(May-2024), there is no direct method to get endpoint URL for accessing the Spark UI in self-hosted notebooks that deployed using EMR on EKS same as the one we get in EMR studio with EMR on EC2 or Serverless based notebook.

To gain access to the Spark UI on the EMR notebook deployed utilizing EMR on EKS, the following outlined steps will facilitate the process for accessing the relevant interface.

Upon the successful creation of the EMR-flavored notebook, either through the EMR managed endpoint or the self-hosted notebook approach, for interactive workloads, you will proceed to attach the interactive endpoint to establish a connection with the EKS cluster, as shown below,


Enter image description here


Upon launching the kernel and initializing the Spark execution process, the sequence of operations will not include any direct URL to open the spark UI as demonstrated below,


Enter image description here


When you enter the spark command in the cell, it will provide a Spark UI link. However, this link will direct you to the driver pod, which cannot be accessed directly.


Enter image description here


Port-forwarding and SSH tunneling

Further, to access the job user interface, you can connect to the edge node or the node that is capable of interacting with your EMR on EKS cluster. Subsequently, you can execute the following command to retrieve the details of the currently running driver container pod. In the provided example, I have ssh'ed into the Cloud9 EC2 instance, which serves as my edge node, and the running pod includes the jupyter-notebook pod, which is the self-hosted notebook pod, as well as the driver pod kdeca319e-c99a-4a1f-bf41-a260cbe1f222-c4c9268f3dc00844-driver

:~ $ kubectl get pods -n emr-eks-workshop-namespace -w
NAME                                                            READY   STATUS    RESTARTS   AGE
jeg-2rah4axbitmz0-76955cbffc-rgvrb                              1/1     Running   0          23m
jeg-uawxnb4543fr1-9bdd46cc6-ld47g                               2/2     Running   0          14h
jupyter-notebook                                                1/1     Running   0          14h
kdeca319e-c99a-4a1f-bf41-a260cbe1f222-9dd0368f3dc0275e-exec-1   1/1     Running   0          5m15s
kdeca319e-c99a-4a1f-bf41-a260cbe1f222-9dd0368f3dc0275e-exec-2   1/1     Running   0          5m15s
kdeca319e-c99a-4a1f-bf41-a260cbe1f222-c4c9268f3dc00844-driver   2/2     Running   0          5m23s

Once obtained the driver pod name from the above command, execute the following port-forwarding command, which will forward port 4040 to the edge node. Additionally, I have included the optional --address parameter, which allows for the port 4040 to be accessible from all IP addresses. Also, please make sure to include the namespace if the pods created on the specific namespace for EMR. In my case, the namespace is emr-eks-workshop-namespace.

kubectl -n emr-eks-workshop-namespace port-forward kdeca319e-c99a-4a1f-bf41-a260cbe1f222-c4c9268f3dc00844-driver 4040:4040 --address 0.0.0.0

Since I have established a connection to the edge node from my base machine via SSH method, I have initiated an SSH tunneling for a specific port. This configuration enables direct access to the Spark user interface from my web browser.

ssh -i testemr.pem -N -L 4040:localhost:4040 ec2-user@<Ip-address-of-Edge-node>

Enter image description here


In addition to the previously mentioned steps, you have the option to enable the self-hosted Spark history server that provides comprehensive information and logs about completed Spark applications. Here is the step to configure it in the EMR on EKS.

AWS
SUPPORT ENGINEER
published 3 months ago1329 views