How to use external libraries in AWS Glue Python Shell

1

I am trying to use external libraries like openpyxl, added the wheel for it in S3 and referenced in the Job details, but it seems that it is not working. Tried adding a parameter with the version needed too. But nothing is working. Can you please suggest any other way of doing the same or any other service through which I can run my python jobs(Contains code for fetching data from different dbs, transformation and creating reports with aggregated values)

asked 2 years ago5602 views
2 Answers
1

Hi,

to successfully add an external library to a Glue Python Shell job you should follow the documentation at this link.

UPDATE as described i the link above, when using python 3.9 the best option to install external libraries is:

--additional-python-modules s3://aws-glue-native-spark/tests/j4.2/fbprophet-0.6-py3-none-any.whl,scikit-learn==0.21.3

For previous version the following is still correct considering you have already downloaded the wheel file and uploaded it to Amazon S3, then if you are creating your job via command line you need to add the parameter:

--default-arguments '{"--extra-py-files" : ["s3://MyBucket/python/library/openpyxl-3.0.9-py2.py3-none-any.whl"]}

if you are creating/editing the Python shell in the console:

  • for the new Glue Studio Job Editor : look under Job Details , Advanced properties.

  • for the legacy Job Editor - look under the Security configuration, script libraries, and job parameters (optional) section

Once you locate the text box under Python library path paste the full S3 URI for your wheel file.

I tested it with your library and it works in my environment.

Processing ./glue-python-libs-cr2dddvq/openpyxl-3.0.9-py2.py3-none-any.whl
Collecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.9

hope this helps,

AWS
EXPERT
answered 2 years ago
  • I did the same thing but my problem was not solved. I have edited the Python shell in the console. In new version user interface of Jobs, not legacy version jobs, I found the Python library path in the library section in the Job details tab.

0

Thank you for your question. Without some context its hard to say what is the reason, but in general i was able to make it work as based on this article https://aws.amazon.com/premiumsupport/knowledge-center/glue-version2-external-python-libraries/

AWS
Alex_T
answered 2 years ago
  • The article shared does not work either, right now I am importing just one library - openpyxl. Gives "No module named openpyxl" error. Have passed the wheel file downloaded from internet and added, also tried passing the key value job parameter (key:--additional-python-modules, value: openpyxl==3.0.9)

  • Hi, the answer is actually incorrect, the link provided works for AWS Glue Spark JObs , not for Glue Pyhon Shell as requested in the question.

    it also could be improved by mentioning that it is possible to understand if an error is happening during the import of the external library by checking in the Cloudwatch logs for the job.

    If no error are presents the logs under the job run will show the package installed.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions