- Newest
- Most votes
- Most comments
Hi Joshua,
Thank you for your question to re:Post. There are a few different potential causes of this. Here are the potential reasons and ways you can troueblshoot and fix. Thank you!
- Different credential chains: Spark uses Hadoop's S3A filesystem (Java-based) which has a different credential provider chain than boto3 (Python-based)
- Windows path mounting issues: The
~/.awsvolume mount on Windows may not correctly map to the container's expected location - Missing Hadoop S3 configuration: Spark needs explicit Hadoop configuration to use the credentials properly
Option 1: Explicit Spark Configuration (Recommended for Windows)
Configure Spark to use the correct AWS credential provider when initializing your SparkContext:
from pyspark import SparkConf, SparkContext from awsglue.context import GlueContext from awsglue.job import Job # Create Spark configuration with S3A settings conf = SparkConf() conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider") conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") # If using a specific profile conf.set("spark.hadoop.fs.s3a.aws.credentials.provider.profile.name", "your_profile_name") # Initialize contexts with configuration sc = SparkContext(conf=conf) glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext)
Option 2: Fix Windows Credential Mounting
On Windows, the ~/.aws path may not resolve correctly. Use explicit paths:
PowerShell:
docker run -it ` -v $env:USERPROFILE\.aws:/home/glue_user/.aws:ro ` -e AWS_PROFILE=your_profile_name ` -e DISABLE_SSL=true ` --rm ` -p 4040:4040 -p 18080:18080 ` --name glue_pyspark ` amazon/aws-glue-libs:glue_libs_3.0.0_image_01 ` pyspark
CMD:
docker run -it ^ -v %USERPROFILE%\.aws:/home/glue_user/.aws:ro ^ -e AWS_PROFILE=your_profile_name ^ -e DISABLE_SSL=true ^ --rm ^ -p 4040:4040 -p 18080:18080 ^ --name glue_pyspark ^ amazon/aws-glue-libs:glue_libs_3.0.0_image_01 ^ pyspark
Key changes:
- Use
$env:USERPROFILE(PowerShell) or%USERPROFILE%(CMD) instead of~ - Add
:ro(read-only) flag to the volume mount - Ensure both
credentialsandconfigfiles exist in your.awsdirectory
Option 3: Use Environment Variables
Pass credentials directly as environment variables (less secure, but useful for troubleshooting):
docker run -it ` -e AWS_ACCESS_KEY_ID=your_access_key ` -e AWS_SECRET_ACCESS_KEY=your_secret_key ` -e AWS_DEFAULT_REGION=us-east-1 ` -e DISABLE_SSL=true ` --rm ` -p 4040:4040 -p 18080:18080 ` --name glue_pyspark ` amazon/aws-glue-libs:glue_libs_3.0.0_image_01 ` pyspark
Then configure Spark to use environment variable credentials:
conf = SparkConf() conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.EnvironmentVariableCredentialsProvider") conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") sc = SparkContext(conf=conf) glueContext = GlueContext(sc)
Option 4: Combined Credential Provider Chain
Use multiple credential providers as fallback:
from pyspark import SparkConf, SparkContext from awsglue.context import GlueContext conf = SparkConf() conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.EnvironmentVariableCredentialsProvider," "com.amazonaws.auth.profile.ProfileCredentialsProvider," "com.amazonaws.auth.InstanceProfileCredentialsProvider") conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") sc = SparkContext(conf=conf) glueContext = GlueContext(sc) spark = glueContext.spark_session
Verification Steps
1. Verify Credentials Are Mounted
Inside the container, check if credentials are accessible:
docker exec -it glue_pyspark bash ls -la /home/glue_user/.aws/ cat /home/glue_user/.aws/credentials cat /home/glue_user/.aws/config
2. Test boto3 Access (Python SDK)
import boto3 s3 = boto3.client('s3') response = s3.list_objects_v2(Bucket='your-bucket', Prefix='your-prefix/') print(response)
If this works but Spark doesn't, it confirms the credential provider issue.
3. Test Spark S3 Access
# After applying the Spark configuration above df = spark.read.json("s3://bucket_name/path/to/folder/*.json") df.show()
Common Windows-Specific Issues
Issue 1: Docker Desktop Linux Container Mode
Ensure Docker Desktop is running in Linux container mode (not Windows containers):
- Right-click Docker Desktop icon → "Switch to Linux containers"
Issue 2: File Permissions
Windows file permissions may not translate correctly to Linux containers:
- Ensure
.aws/credentialsand.aws/configfiles are readable - Try mounting as read-only with
:roflag
Issue 3: Path Separators
Windows uses backslashes, but Docker expects forward slashes:
- Use forward slashes in S3 paths:
s3://bucket/path/file.json - Avoid backslashes in any path configuration
Issue 4: AWS Config File Missing
The Glue container expects both credentials and config files:
~/.aws/credentials:
[default] aws_access_key_id = YOUR_ACCESS_KEY aws_secret_access_key = YOUR_SECRET_KEY
~/.aws/config:
[default] region = us-east-1 output = json
Complete Working Example
import sys from pyspark import SparkConf, SparkContext from awsglue.context import GlueContext from awsglue.job import Job from awsglue.utils import getResolvedOptions # Configure Spark with proper S3A credentials conf = SparkConf() conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider") conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") # Initialize Glue context sc = SparkContext(conf=conf) glueContext = GlueContext(sc) spark = glueContext.spark_session # Get job parameters args = getResolvedOptions(sys.argv, ['JOB_NAME']) job = Job(glueContext) job.init(args['JOB_NAME'], args) # Now S3 access should work file_name = "s3://bucket_name/path/to/folder/*.json" df = spark.read.json(file_name) df.show() # Or using Glue catalog dyf = glueContext.create_dynamic_frame.from_catalog( database="database_name", table_name="table_name" ) dyf.printSchema() job.commit()
Why boto3 Works But Spark Doesn't
- boto3 uses the AWS SDK for Python, which directly reads from
~/.aws/credentials - Spark uses Hadoop's S3A filesystem (Java-based), which requires explicit configuration to use the same credentials
- The credential provider chain must be explicitly configured for Hadoop/Spark to recognize mounted AWS credentials
References
The AWS Glue Docker container uses Hadoop's S3 file system implementations, and there are three different S3 URI schemes that map to different filesystem implementation:
| URI scheme | Filesystem class | Notes |
|---|---|---|
s3:// | EmrFileSystem | Used by AWS Glue in production, not fully functional in local Docker |
s3a:// | S3AFileSystem | Hadoop-native, works in local Docker |
s3n:// | NativeS3FileSystem | Legacy, deprecated |
When you use spark.read.json("s3://...") or the Glue Data Catalog returns an S3 location starting with s3://, the container tries to use EMRFS, which requires internal AWS infrastructure that doesn't exist in a local Docker container. The EMRFS implementation fails to parse the bucket name from the URI, resulting in the bucket is null/empty error.
boto3 works fine because it uses the AWS SDK directly and doesn't go through Hadoop's filesystem abstraction at all.
You can fix this either by mapping s3:// to s3a:// via Hadoop configuration by adding a
conf.set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
before the SparkContext() call, or, if you only care about direct reads like in your example, by changing the URI scheme in your code:
# Instead of:
df = spark.read.json("s3://bucket_name/path/to/folder/*.json")
# Use:
df = spark.read.json("s3a://bucket_name/path/to/folder/*.json")

Hi Jen,
Thank you for your answer. I tried the recommended solutions, but unfortunately they were not successful for me even after setting the spark configurations and checking for mounting errors. Boto3 was successful in listing the files, but when trying to access the glue data catalog I received the same error as before: " WARN FileSystem: Failed to initialize filesystem s3://bucket_name/path/to/folder: java.lang.IllegalArgumentException: bucket is null/empty"