Help us improve the AWS re:Post Knowledge Center by sharing your feedback in a brief survey. Your input can influence how we create and update our content to better support your AWS journey.
How do I set up a Spark History Server locally to view AWS Glue job Spark UI logs without using Docker?
I want to view the Spark UI for my AWS Glue job runs, but I cannot use Docker on my local machine. I need an alternative way to run the Apache Spark History Server natively on macOS to read Spark event logs stored in Amazon Simple Storage Service (Amazon S3).
Short description
AWS Glue provides Spark UI logs that help you monitor and debug Spark applications. The standard approach uses a Docker container to run the Spark History Server, but this is not always feasible due to organizational restrictions or environment constraints. You can run the Spark History Server natively by installing a Java Development Kit (JDK) and downloading Apache Spark directly, then configuring it to read event logs from Amazon S3 using the Hadoop S3A filesystem client.
Resolution
Prerequisites
Before you begin, confirm the following:
- You have the AWS Command Line Interface (AWS CLI) installed and configured with credentials that have read access to the S3 bucket containing your Spark event logs.
- You have Homebrew installed on macOS, or the ability to download files using curl.
- Your AWS Glue job has Spark UI enabled with event logs written to an S3 path.
Step 1: Enable Spark UI on your AWS Glue job
If you have not already enabled Spark UI logging, add the following job parameters to your AWS Glue job configuration:
--enable-spark-ui true
--spark-event-logs-path s3://your-bucket-name/sparkHistoryLogs/
Note: Replace your-bucket-name with the name of your S3 bucket and adjust the path as needed.
Step 2: Install a Java Development Kit
The Spark History Server requires Java 11 or later. Download and extract OpenJDK 17 to a local directory.
- Run the following command to download OpenJDK 17 for macOS (Apple Silicon):
curl -L -o /tmp/jdk17.tar.gz "https://download.java.net/java/GA/jdk17.0.2/dfd4a8d0985749f896bed50d7138ee7f/8/GPL/openjdk-17.0.2_macos-aarch64_bin.tar.gz"
Note: If you are using an Intel-based Mac, replace macos-aarch64 with macos-x64 in the URL.
- Extract the archive:
tar -xzf /tmp/jdk17.tar.gz -C /tmp/
- Verify the installation:
/tmp/jdk-17.0.2.jdk/Contents/Home/bin/java -version
You should see output similar to:
openjdk version "17.0.2" 2022-01-18
OpenJDK Runtime Environment (build 17.0.2+8-86)
OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)
Step 3: Download Apache Spark
- Download a pre-built Apache Spark distribution with Hadoop support:
curl -L -o /tmp/spark.tgz "https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz"
- Extract the archive:
tar -xzf /tmp/spark.tgz -C /tmp/
Step 4: Add the Hadoop AWS and AWS SDK JARs
The default Spark distribution does not include the JARs required to access Amazon S3 through the Hadoop S3A filesystem. You must download these separately.
- Download the hadoop-aws JAR that matches the bundled Hadoop version (3.3.4):
curl -L -o /tmp/spark-3.5.1-bin-hadoop3/jars/hadoop-aws-3.3.4.jar \ "https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar"
- Download the AWS Java SDK bundle:
curl -L -o /tmp/spark-3.5.1-bin-hadoop3/jars/aws-java-sdk-bundle-1.12.262.jar \ "https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar"
Important: The hadoop-aws JAR version must match the Hadoop version bundled with your Spark distribution. To check the Hadoop version, run:
ls /tmp/spark-3.5.1-bin-hadoop3/jars/ | grep hadoop-client
Step 5: Configure the Spark History Server to read from Amazon S3
- Create the Spark defaults configuration file:
cat > /tmp/spark-3.5.1-bin-hadoop3/conf/spark-defaults.conf << 'EOF' spark.history.fs.logDirectory=s3a://your-bucket-name/sparkHistoryLogs/ spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain spark.hadoop.fs.s3a.endpoint=s3.amazonaws.com spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem EOF
Note: Replace your-bucket-name/sparkHistoryLogs/ with the S3 path where your Glue job writes Spark event logs. This is the same path you specified in the --spark-event-logs-path job parameter.
The DefaultAWSCredentialsProviderChain setting allows the history server to use your existing AWS CLI credentials from ~/.aws/credentials without hardcoding access keys in the configuration file.
- If your S3 bucket is in a specific AWS Region other than us-east-1, add the endpoint for that Region:
spark.hadoop.fs.s3a.endpoint=s3.eu-west-1.amazonaws.com
Note: Replace eu-west-1 with the Region where your S3 bucket is located.
Step 6: Start the Spark History Server
- Run the following command to start the history server on port 18080:
JAVA_HOME=/tmp/jdk-17.0.2.jdk/Contents/Home \ SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080" \ /tmp/spark-3.5.1-bin-hadoop3/sbin/start-history-server.sh
- Verify the server is running by opening a browser and navigating to:
http://localhost:18080
You should see the Spark History Server UI listing your completed and in-progress Glue job runs.
- You can also verify using the REST API:
curl -s http://localhost:18080/api/v1/applications | python3 -m json.tool
Step 7: Stop the Spark History Server
When you are finished reviewing your Spark UI logs, stop the history server:
JAVA_HOME=/tmp/jdk-17.0.2.jdk/Contents/Home \ /tmp/spark-3.5.1-bin-hadoop3/sbin/stop-history-server.sh
Alternative: Read from local event log files
If you prefer not to configure S3 access, you can download the event logs locally and point the history server at a local directory.
- Create a local directory for the event logs:
mkdir -p /tmp/spark-events
- Download the logs from S3:
aws s3 cp s3://your-bucket-name/sparkHistoryLogs/ /tmp/spark-events/ --recursive
- Update the configuration to use the local path:
cat > /tmp/spark-3.5.1-bin-hadoop3/conf/spark-defaults.conf << 'EOF' spark.history.fs.logDirectory=file:///tmp/spark-events EOF
- Start the history server using the same command from Step 6. This approach does not require the hadoop-aws or aws-java-sdk-bundle JARs.
Troubleshooting
If the history server starts but shows no applications, verify the following:
- The S3 path in spark.history.fs.logDirectory matches the exact path where your Glue job writes event logs.
- Your AWS CLI credentials have s3:GetObject and s3:ListBucket permissions on the S3 bucket.
- The event log files exist in the S3 path. Run aws s3 ls s3://your-bucket-name/sparkHistoryLogs/ to confirm.
If you see java.lang.ClassNotFoundException: org.apache.hadoop.fs.s3a.S3AFileSystem, the hadoop-aws JAR is missing or the version does not match the Hadoop version in your Spark distribution.
If you see com.amazonaws.SdkClientException: Unable to load AWS credentials, verify that your AWS CLI is configured by running aws sts get-caller-identity.
Related information
- Language
- English
Much valuable information!
Relevant content
- Accepted Answerasked 3 years ago
- Accepted Answerasked 3 years ago
AWS OFFICIALUpdated 9 months ago
AWS OFFICIALUpdated 7 months ago