Skip to content

bucket null/empty for local glue development

0

WHAT I'VE TRIED
I followed the AWS Documentation for setting up AWS Glue for local development on Windows. I saw multiple guides suggest enabling "Expose daemon on tcp://localhost:2375 without TLS" on docker for windows, but this did not solve my issue. the AWS Profile I'm using has full s3 and glue access and the necessary kms key permissions. I can list the files within the bucket using boto3 in the docker container and can see the database and table names in the docker container as well. When I try to do the following:

file_name = "s3://bucket_name/path/to/folder/*.json"
df = spark.read.json(file_name)

or

dyf = glueContext.create_dynamic_frame.from_catalog(database=database_name,
                                                   table_name=table_name,
                                                   push_down_predicate=push_down_predicate_str
                                                   )

I receive the following output for both methods:
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/share/aws/aws-java-sdk-v2/aws-sdk-java-bundle-2.29.52.jar!/software/amazon/awssdk/thirdparty/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/share/aws/glue-pds/jars/bundle-2.24.6.jar!/software/amazon/awssdk/thirdparty/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. ANTLR Tool version 4.3 used for code generation does not match the current runtime version 4.9.3 ANTLR Tool version 4.3 used for code generation does not match the current runtime version 4.9.3 26/03/12 15:58:55 WARN FileSystem: Failed to initialize filesystem s3://bucket_name/path/to/file/: java.lang.IllegalArgumentException: bucket is null/empty Traceback (most recent call last): File "/home/hadoop/workspace/.vscode/test.py", line 375, in <module> dyf = glueContext.create_dynamic_frame.from_catalog(database=database_name, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/share/aws/glue-pds/PyGlue.zip/awsglue/dynamicframe.py", line 629, in from_catalog File "/usr/share/aws/glue-pds/PyGlue.zip/awsglue/context.py", line 188, in create_dynamic_frame_from_catalog File "/usr/share/aws/glue-pds/PyGlue.zip/awsglue/data_source.py", line 36, in getFrame File "/usr/local/lib/python3.11/site-packages/py4j/java_gateway.py", line 1322, in call return_value = get_return_value( ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py", line 185, in deco raise converted from None pyspark.errors.exceptions.captured.IllegalArgumentException: bucket is null/empty

THE ERROR looking at the output, it seems to me the most important part is: "WARN FileSystem: Failed to initialize filesystem s3://bucket_name/path/to/file/". I'm not sure how to resolve this issue. Do I need to set some configuration when creating my spark job? The examples I've seen are able to access data with the docker container "out of the box".

EDIT
My main goal is to use the aws glue data catalog. I can view the databases that exist in the catalog and the tables, however, when i try to access the table (using the code below). In this event it says that that the bucket is null/empty.

conf = SparkConf()
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider.profile.name", "profileName")
conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

spark.sql("SHOW DATABASES").show()
spark.sql("SHOW TABLES IN database").show()
dyf = glueContext.create_dynamic_frame.from_catalog(database=database_name,
                                                   table_name=table_name,
                                                   push_down_predicate=push_down_predicate_str
                                                   )
asked 2 months ago69 views
2 Answers
0

Hi Joshua,

Thank you for your question to re:Post. There are a few different potential causes of this. Here are the potential reasons and ways you can troueblshoot and fix. Thank you!

  • Different credential chains: Spark uses Hadoop's S3A filesystem (Java-based) which has a different credential provider chain than boto3 (Python-based)
  • Windows path mounting issues: The ~/.aws volume mount on Windows may not correctly map to the container's expected location
  • Missing Hadoop S3 configuration: Spark needs explicit Hadoop configuration to use the credentials properly

Option 1: Explicit Spark Configuration (Recommended for Windows)

Configure Spark to use the correct AWS credential provider when initializing your SparkContext:

from pyspark import SparkConf, SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Create Spark configuration with S3A settings
conf = SparkConf()
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", 
         "com.amazonaws.auth.profile.ProfileCredentialsProvider")
conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

# If using a specific profile
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider.profile.name", "your_profile_name")

# Initialize contexts with configuration
sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Option 2: Fix Windows Credential Mounting

On Windows, the ~/.aws path may not resolve correctly. Use explicit paths:

PowerShell:

docker run -it `
  -v $env:USERPROFILE\.aws:/home/glue_user/.aws:ro `
  -e AWS_PROFILE=your_profile_name `
  -e DISABLE_SSL=true `
  --rm `
  -p 4040:4040 -p 18080:18080 `
  --name glue_pyspark `
  amazon/aws-glue-libs:glue_libs_3.0.0_image_01 `
  pyspark

CMD:

docker run -it ^
  -v %USERPROFILE%\.aws:/home/glue_user/.aws:ro ^
  -e AWS_PROFILE=your_profile_name ^
  -e DISABLE_SSL=true ^
  --rm ^
  -p 4040:4040 -p 18080:18080 ^
  --name glue_pyspark ^
  amazon/aws-glue-libs:glue_libs_3.0.0_image_01 ^
  pyspark

Key changes:

  • Use $env:USERPROFILE (PowerShell) or %USERPROFILE% (CMD) instead of ~
  • Add :ro (read-only) flag to the volume mount
  • Ensure both credentials and config files exist in your .aws directory

Option 3: Use Environment Variables

Pass credentials directly as environment variables (less secure, but useful for troubleshooting):

docker run -it `
  -e AWS_ACCESS_KEY_ID=your_access_key `
  -e AWS_SECRET_ACCESS_KEY=your_secret_key `
  -e AWS_DEFAULT_REGION=us-east-1 `
  -e DISABLE_SSL=true `
  --rm `
  -p 4040:4040 -p 18080:18080 `
  --name glue_pyspark `
  amazon/aws-glue-libs:glue_libs_3.0.0_image_01 `
  pyspark

Then configure Spark to use environment variable credentials:

conf = SparkConf()
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider",
         "com.amazonaws.auth.EnvironmentVariableCredentialsProvider")
conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)

Option 4: Combined Credential Provider Chain

Use multiple credential providers as fallback:

from pyspark import SparkConf, SparkContext
from awsglue.context import GlueContext

conf = SparkConf()
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider",
         "com.amazonaws.auth.EnvironmentVariableCredentialsProvider,"
         "com.amazonaws.auth.profile.ProfileCredentialsProvider,"
         "com.amazonaws.auth.InstanceProfileCredentialsProvider")
conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

Verification Steps

1. Verify Credentials Are Mounted

Inside the container, check if credentials are accessible:

docker exec -it glue_pyspark bash
ls -la /home/glue_user/.aws/
cat /home/glue_user/.aws/credentials
cat /home/glue_user/.aws/config

2. Test boto3 Access (Python SDK)

import boto3
s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket='your-bucket', Prefix='your-prefix/')
print(response)

If this works but Spark doesn't, it confirms the credential provider issue.

3. Test Spark S3 Access

# After applying the Spark configuration above
df = spark.read.json("s3://bucket_name/path/to/folder/*.json")
df.show()

Common Windows-Specific Issues

Issue 1: Docker Desktop Linux Container Mode

Ensure Docker Desktop is running in Linux container mode (not Windows containers):

  • Right-click Docker Desktop icon → "Switch to Linux containers"

Issue 2: File Permissions

Windows file permissions may not translate correctly to Linux containers:

  • Ensure .aws/credentials and .aws/config files are readable
  • Try mounting as read-only with :ro flag

Issue 3: Path Separators

Windows uses backslashes, but Docker expects forward slashes:

  • Use forward slashes in S3 paths: s3://bucket/path/file.json
  • Avoid backslashes in any path configuration

Issue 4: AWS Config File Missing

The Glue container expects both credentials and config files:

~/.aws/credentials:

[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY

~/.aws/config:

[default]
region = us-east-1
output = json

Complete Working Example

import sys
from pyspark import SparkConf, SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

# Configure Spark with proper S3A credentials
conf = SparkConf()
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider",
         "com.amazonaws.auth.profile.ProfileCredentialsProvider")
conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

# Initialize Glue context
sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Get job parameters
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Now S3 access should work
file_name = "s3://bucket_name/path/to/folder/*.json"
df = spark.read.json(file_name)
df.show()

# Or using Glue catalog
dyf = glueContext.create_dynamic_frame.from_catalog(
    database="database_name",
    table_name="table_name"
)
dyf.printSchema()

job.commit()

Why boto3 Works But Spark Doesn't

  • boto3 uses the AWS SDK for Python, which directly reads from ~/.aws/credentials
  • Spark uses Hadoop's S3A filesystem (Java-based), which requires explicit configuration to use the same credentials
  • The credential provider chain must be explicitly configured for Hadoop/Spark to recognize mounted AWS credentials

References

AWS
EXPERT
answered 2 months ago
  • Hi Jen,

    Thank you for your answer. I tried the recommended solutions, but unfortunately they were not successful for me even after setting the spark configurations and checking for mounting errors. Boto3 was successful in listing the files, but when trying to access the glue data catalog I received the same error as before: " WARN FileSystem: Failed to initialize filesystem s3://bucket_name/path/to/folder: java.lang.IllegalArgumentException: bucket is null/empty"

0

The AWS Glue Docker container uses Hadoop's S3 file system implementations, and there are three different S3 URI schemes that map to different filesystem implementation:

URI schemeFilesystem classNotes
s3://EmrFileSystemUsed by AWS Glue in production, not fully functional in local Docker
s3a://S3AFileSystemHadoop-native, works in local Docker
s3n://NativeS3FileSystemLegacy, deprecated

When you use spark.read.json("s3://...") or the Glue Data Catalog returns an S3 location starting with s3://, the container tries to use EMRFS, which requires internal AWS infrastructure that doesn't exist in a local Docker container. The EMRFS implementation fails to parse the bucket name from the URI, resulting in the bucket is null/empty error.

boto3 works fine because it uses the AWS SDK directly and doesn't go through Hadoop's filesystem abstraction at all.

You can fix this either by mapping s3:// to s3a:// via Hadoop configuration by adding a

conf.set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

before the SparkContext() call, or, if you only care about direct reads like in your example, by changing the URI scheme in your code:

# Instead of:
df = spark.read.json("s3://bucket_name/path/to/folder/*.json")

# Use:
df = spark.read.json("s3a://bucket_name/path/to/folder/*.json")
AWS
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.