Issue importing custom Python modules in Jupyter notebooks with SparkMagic and AWS Glue endpoint

0

I'm encountering an issue while attempting to run ETL scripts in Jupyter notebooks using SparkMagic, which is connected to an AWS Glue endpoint via SSH. I followed the tutorial provided in the AWS Glue documentation (link: https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-local-jupyter.html) and successfully set up the connection. I can run Pyspark without any problems.

However, I'm facing difficulties when trying to import custom Python modules that I created. I have ensured that I uploaded my files to the Glue endpoint and placed them in a file directory that I appended to the Jupyter notebook's search path. When attempting to import the module using the following code:

from test.base.text import DataFields

It fails to import because the Python interpreter is set to use /usr/bin/python3 by default. Instead, I need to use the **/usr/bin/gluepython3 **interpreter.

sys.executable

returns

/usr/bin/python3

I have tried several steps to make it use the correct Python interpreter, including:

  1. Configuring sparkmagic to use gluepython3: %%configure -f { "conf": { "spark.pyspark.python": "/usr/bin/gluepython3" } }

  2. Setting the PYSPARK_PYTHON environment variable in the notebook: alias python="/usr/bin/gluepython3"

  3. Modifying the .bashrc file on the Glue endpoint and creating an alias for Python to point to gluepython3: alias python="/usr/bin/gluepython3"

Despite trying these approaches, I have only been able to successfully import the module when running the code outside of Jupyter notebook using an SSH shell and manually invoking the Python file on the endpoint:

/usr/bin/gluepython3 /location/to/file.py

Any suggestions or guidance on how to resolve this issue and make the custom module import work within Jupyter notebooks using SparkMagic and the AWS Glue endpoint would be greatly appreciated.

Thank you in advance!

ftwGlue
已提問 1 年前檢視次數 686 次
2 個答案
1

To Add to the question above , I would recommend that you move to Glue Interactive Sessions.

Dev end-points are no more developed and do support only Glue version 1.

Switching to Interactive sessions, you can dynamically choose the Glue version (2,3 or 4) and hence spark version. Furthermore it highly simplify the way you can add additional python modules by using a magics

AWS
專家
已回答 1 年前
0

According to my understanding, you are including the custom module by uploading to the local directory structure of your Glue Dev endpoint using SSH and trying to link them. However, according to the documentation, the way you import custom python modules that you use for your development endpoint is by adding them as a dependency S3 path at the time of creation of the development endpoint.

Python library path

    Comma-separated Amazon Simple Storage Service (Amazon S3) paths to Python libraries that are required by your script. Multiple values must be complete paths separated by a comma (,). Only individual files are supported, not a directory path.

In order to use your custom libraries, generate a .whl file of your custom module and upload to an S3 path and use it as a parameter with "Python library path" in dev endpoint.

For more information on how you can create a .whl file refer Python Official documentation or use the following pip command in the directory of the python package with the latest pip and wheel:

pip wheel .

If this still doesn't resolve the issue, I would recommend trying out different options like Glue Interactive Sessions/Notebooks

Unlike AWS Glue development endpoints, AWS Glue interactive sessions are serverless with no infrastructure to manage. You can start interactive sessions very quickly. Interactive sessions have a 1-minute billing minimum with cost-control features. This reduces the cost of developing data preparation applications.

AWS
已回答 1 年前
AWS
專家
已審閱 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南