In a workflow where Airflow spins up an EMR cluster to run a PySpark job, in the job I attempt to copy a string to s3 using the boto3 api. I can terminal into the cluster, start pyspark and copy the string manually with no issues. However, when it runs in the job I get this error:
botocore.exceptions.SSLError: SSL validation failed for https://my-bucket-name.s3.amazonaws.com/path/to/file.txt [Errno 20] Not a directory
This exception is spawned from the following two stack traces in the urllib3 package:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/util/ssl_.py", line 322, in ssl_wrap_socket
context.load_verify_locations(ca_certs, ca_cert_dir)
NotADirectoryError: [Errno 20] Not a directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/botocore/httpsession.py", line 262, in send
chunked=self._chunked(request.headers),
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 641, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 344, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 603, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 344, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 843, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 370, in connect
ssl_context=context)
File "/usr/local/lib/python3.6/site-packages/urllib3/util/ssl_.py", line 324, in ssl_wrap_socket
raise SSLError(e)
urllib3.exceptions.SSLError: [Errno 20] Not a directory
I am at a loss for what is going on here. It appears to be some issue with the certs but I don't understand why it works manually but not through the job. Any insight would be greatly appreciated.