PermissionError: [Errno 13] Permission denied: '/.cache' while installing python packages as part of AWS Glue job using --additional-python-modules parameter

0

I am trying to install sentence_transformers Python package as part of my AWS Glue Python script job. I am doing that by using the job parameter --additional-python-modules with the value of sentence_transformers.

However, while loading a sentence_transformers model, I consistently got Permission denied: '/.cache' error. The issue is caused by pip trying to write some package files to /.cache.... I tried to disable that using --no-cache-dir but no luck and not sure where to pass this correctly.

Could you please help how I can solve this; either on how to disable cache while installing Python packages using the Glue Job parameter of --additional-python-modules, or on how to give access to my AWS Glue job to write into /.cache directory?

Further details: I am using Python 3.9, AWS Glue 3.0, and IAM roles added to my Job include AWSGlueConsoleFullAccess.

Abri
asked 8 months ago963 views
2 Answers
0
Accepted Answer

For anyone who faces this issue in the future; I was able to make this work by passing S3 path as a target location for caching.

Either one of these should work:

  1. Set ENV Variable before starting importing:

import os os.environ['SENTENCE_TRANSFORMERS_HOME'] = 's3-path'

  1. Pass cache_folder when loading the target model:

EMBEDDING_MODEL = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME, cache_folder = "s3-path")

Abri
answered 8 months ago
0

You must ensure that the IAM role you add to glue has read permissions for that bucket in S3

Enter image description here

After the comments I tried installing, I had no problems, I copied the information from my configuration, the only difference at first glance between my environment and the tutorial is that my Glue Job Role has adminAccess, you could try with a role that only has adminAccess for testing, then lower permissions

Configuration

Enter image description here

Running

Enter image description here

I hope I have helped you, if you have more details about the error write me in the comments

profile picture
EXPERT
answered 8 months ago
  • My IAM has full access to S3, and the /.cache directory is not S3 directory right? I assume it's a directory where the Job is running. Plus my Glue job is Python Script.

  • I am trying to replicate your problem, meanwhile I tried this tutorial and I installed the library without problems. --additional-python-modules pymysql==1.0.2 https://repost.aws/knowledge-center/glue-version2-external-python-libraries My Glue Role have full admin access, Now I'm going to try with sentence_transformers

  • I have tried with the following configuration and I have not had problems, I am going to edit my answer to show you my results --additional-python-modules sentence_transformers

  • Thank you for looking into this MaxCloud, so the problem of caching is happening when loading the model, installation seems to succeed normally, but can you try this code in your script and re-run the job?

    from langchain.embeddings import HuggingFaceEmbeddings EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2" EMBEDDING_MODEL = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)

    Also need to add second python package - langchain - in your--additional-python-modules meaning it will be --additional-python-modules=sentence_transformers, langchain

  • EMBEDDING_MODEL = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME) this is the part where the problem is happening... I think the caching happens at loading the model not at installation.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions