AWS Docker container for Glue and Databricks JDBC connection

Question

Hello,
We are using the AWS Docker container for Glue (available [here](https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/)) and we are trying to connect to a Databricks JDBC connection using the DatabricksJDBC42.jar (available [here](https://docs.databricks.com/integrations/jdbc-odbc-bi.html)).  We placed the jar file both in the same folder as the jupyter notebook, and have also placed it in the C:/.aws/ folder.  When we try to connect we get the error "java.lang.ClassNotFoundException: com.databricks.client.jdbc.Driver".

We have used DB2 driver without issue, using the same setup.  Also, when we upload the jar to AWS and attach it to the glue job as an --extra-jars parameter it works fine.

Has anyone gotten this to successfully work?

Answer

Gonzalo's answer worked, but also I found that adding the jar in the **docker run** command was easiest.  There was no need to commit the modified docker container image.  However, I am now facing a new error related to SSL `PKIX path building failed`.  I will post it as a separate question.  Thanks for your attention team!  Appreciate the inputs. :)

docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true **PYSPARK_SUBMIT_ARGS="--jars /root/.aws/db2jcc4.jar,/root/.aws/DatabricksJDBC42.jar,/root/.aws/AthenaJDBC42-2.0.35.1000,/root/.aws/presto-jdbc-0.225-SNAPSHOT.jar pyspark-shell"** --rm -p 4040:4040 -p 18080:18080 --name glue_spark_submit amazon/aws-glue-libs:glue_libs_3.0.0_image_01 spark-submit /home/glue_user/workspace/src/$SCRIPT_FILE_NAME

Answer

Hello,

I understand that you are receiving the following error while trying to connect to your Databricks cluster when you are following the blog post “Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container” :

```
java.lang.ClassNotFoundException: com.databricks.client.jdbc.Driver
```

Since you are using the updated DatabricksJDBC42.jar driver, please ensure that the naming convention used for the JDBC URL is also as per DatabricksJDBC42.jar and not according to the legacy SparkJDBC42.jar.

Refer to: https://docs.databricks.com/integrations/jdbc-odbc-bi.html#building-the-connection-url-for-the-databricks-driver

Modified params:
- jdbc:databricks://
- Use HttpPath
- Supply driver class name as 'com.databricks.client.jdbc.Driver'

If the issue still persists, then please open a [support case](https://support.console.aws.amazon.com/support/home#/case/create) with AWS providing the connection details and code snippet used - https://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-case

Thank you.

Answer

If it works with --extra-jars, it means in the docker container Glue is not able to find the jar, placing it in the notebook folder or .aws won't do.  
The safest thing is to ssh into the container and put the jar under /home/glue_user/spark/jars

AWS Docker container for Glue and Databricks JDBC connection

相關內容