using transformers module with sagemaker studio project: ModuleNotFoundError: No module named 'transformers'

0

So as mentioned in my other recent post, I'm trying to modify the sagemaker example abalone xgboost template to use tensorfow.

My current problem is that running the pipeline I get a failure and in the logs I see:

ModuleNotFoundError: No module named 'transformers'

NOTE: I am importing 'transformers' in preprocess.py not in pipeline.py

Now I have 'transformers' listed in various places as a dependency including:

  • setup.py - required_packages = ["sagemaker==2.93.0", "sklearn", "transformers", "openpyxl"]
  • pipelines.egg-info/requires.txt - transformers (auto-generated from setup.py?)

but so I'm keen to understand, how can I ensure that additional dependencies are available in the pipline itself?

Many thanks in advance




ADDITIONAL DETAILS ON HOW I ENCOUNTERED THE ERROR

From one particular notebook (see previous post for more details) I have succesfully constructed the new topic/tensorflow pipeline and run the following steps:

pipeline.upsert(role_arn=role)
execution = pipeline.start()
execution.describe()

the describe() method gives this output:

{'PipelineArn': 'arn:aws:sagemaker:eu-west-1:398371982844:pipeline/topicpipeline-example',
 'PipelineExecutionArn': 'arn:aws:sagemaker:eu-west-1:398371982844:pipeline/topicpipeline-example/execution/0aiczulkjoaw',
 'PipelineExecutionDisplayName': 'execution-1664394415255',
 'PipelineExecutionStatus': 'Executing',
 'PipelineExperimentConfig': {'ExperimentName': 'topicpipeline-example',
  'TrialName': '0aiczulkjoaw'},
 'CreationTime': datetime.datetime(2022, 9, 28, 19, 46, 55, 147000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2022, 9, 28, 19, 46, 55, 147000, tzinfo=tzlocal()),
 'CreatedBy': {'UserProfileArn': 'arn:aws:sagemaker:eu-west-1:398371982844:user-profile/d-5qgy6ubxlbdq/sjoseph-reg-genome-com-273',
  'UserProfileName': 'sjoseph-reg-genome-com-273',
  'DomainId': 'd-5qgy6ubxlbdq'},
 'LastModifiedBy': {'UserProfileArn': 'arn:aws:sagemaker:eu-west-1:398371982844:user-profile/d-5qgy6ubxlbdq/sjoseph-reg-genome-com-273',
  'UserProfileName': 'sjoseph-reg-genome-com-273',
  'DomainId': 'd-5qgy6ubxlbdq'},
 'ResponseMetadata': {'RequestId': 'f949d6f4-1865-4a01-b7a2-a96c42304071',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'f949d6f4-1865-4a01-b7a2-a96c42304071',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '882',
   'date': 'Wed, 28 Sep 2022 19:47:02 GMT'},
  'RetryAttempts': 0}}

Waiting for the execution I get:

---------------------------------------------------------------------------
WaiterError                               Traceback (most recent call last)
<ipython-input-14-72be0c8b7085> in <module>
----> 1 execution.wait()

/opt/conda/lib/python3.7/site-packages/sagemaker/workflow/pipeline.py in wait(self, delay, max_attempts)
    581             waiter_id, model, self.sagemaker_session.sagemaker_client
    582         )
--> 583         waiter.wait(PipelineExecutionArn=self.arn)
    584 
    585 

/opt/conda/lib/python3.7/site-packages/botocore/waiter.py in wait(self, **kwargs)
     53     # method.
     54     def wait(self, **kwargs):
---> 55         Waiter.wait(self, **kwargs)
     56 
     57     wait.__doc__ = WaiterDocstring(

/opt/conda/lib/python3.7/site-packages/botocore/waiter.py in wait(self, **kwargs)
    376                     name=self.name,
    377                     reason=reason,
--> 378                     last_response=response,
    379                 )
    380             if num_attempts >= max_attempts:

WaiterError: Waiter PipelineExecutionComplete failed: Waiter encountered a terminal failure state: For expression "PipelineExecutionStatus" we matched expected path: "Failed"

Which I assume is corresponding to the failure I see in the logs:

buildl pipeline error message on preprocessing step

I did also run python setup.py build to ensure my build directory was up to date ... here's the terminal output of that command:

sagemaker-user@studio$ python setup.py build
/opt/conda/lib/python3.9/site-packages/setuptools/dist.py:771: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/setuptools/config/setupcfg.py:508: SetuptoolsDeprecationWarning: The license_file parameter is deprecated, use license_files instead.
  warnings.warn(msg, warning_class)
running build
running build_py
copying pipelines/topic/pipeline.py -> build/lib/pipelines/topic
running egg_info
writing pipelines.egg-info/PKG-INFO
writing dependency_links to pipelines.egg-info/dependency_links.txt
writing entry points to pipelines.egg-info/entry_points.txt
writing requirements to pipelines.egg-info/requires.txt
writing top-level names to pipelines.egg-info/top_level.txt
reading manifest file 'pipelines.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'pipelines.egg-info/SOURCES.txt'

It seems like the dependencies are being written to pipelines.egg-info/requires.txt but are these not being picked up by the pipeline?

1 Answer
1
Accepted Answer

Hi! There are two places where you need to install the dependencies / requirements:

  1. In your environment where you execute pipeline.start() – can be Amazon SageMaker Studio, your local machine or CI/CD pipeline executor, e. g. AWS CodeBuild. These dependencies are installed in setup.py.
  2. Inside the SageMaker processing and training jobs as well as in inference endpoints. This is usually done via requirements.txt file that you submit as part of your source_dir.

In your example, I recommend you to use the TensorFlowProcessor. The way how to install dependencies into it is described in the corresponding section of the documentation, in particular:

SageMaker Processing installs the dependencies in requirements.txt in the container for you.

Same applies to your model training and to the TensorFlow estimator. See the section Use third-party libraries in the TensorFlow documentation of the SageMaker Python SDK, in particular:

If there are other packages you want to use with your script, you can use a requirements.txt to install other dependencies at runtime.

Hope it helps!

profile picture
Ivan
answered 2 months ago
  • Hi @ivan thanks so much for replying, it's very much appreciated :-)

    So you mention setup.py - as I described in my original question I have the dependencies added there already. Please see the full contents of my setup.py at the end of this comment.

    You mention "requirements.txt file that you submit as part of your source_dir". I'm using the sagemaker studio default project template, and there's no requirements.txt file. Naturally I can make one, but I am not personally submitting any source_dir. There is a requires.txt file that's in the pipelines.egg-info folder, which contains the same as the setup.py

    Thanks for the various links you shared, which point to putting requirements.txt in the source_dir that one passes in like so:

    tp.run(
        code='processing-script.py',
        source_dir='scripts',
    

    but that doesn't seem to help me so much for sagemaker studio, as the place that transformers is first required is in the preprocessing step, which is not going to be calling that

    setup.py(subsection)

    required_packages = ["sagemaker==2.93.0", "sklearn", "transformers", "openpyxl"]
    extras = {
        "test": [
            "black",
            "coverage",
            "flake8",
            "mock",
            "pydocstyle",
            "pytest",
            "pytest-cov",
            "sagemaker",
            "tox",
        ]
    }
    

    pipelines.egg-info/require.txt

    sagemaker==2.93.0
    sklearn
    transformers
    openpyxl
    
    [test]
    black
    coverage
    flake8
    mock
    pydocstyle
    pytest
    pytest-cov
    sagemaker
    tox
    
  • Hi, @regulatansaku.

    See my comments below.

    I'm using the sagemaker studio default project template, and there's no requirements.txt file.

    You're right, you need to create this file and put it into source_dir='scripts'.

    Naturally I can make one, but I am not personally submitting any source_dir.

    That's the reason, you need to submit a directory with at least two files: processing-script.py and requirements.txt. Put transformers both into setup.py and to scripts/requirements.txt. In setup.py it's probably not even needed. The most important is scripts/requirements.txt.

    There is a requires.txt file that's in the pipelines.egg-info folder, which contains the same as the setup.py

    It's not what you need. You can ignore it. Looks like this file is built from setup.py upon package installation and it's irrelevant to your processing or training jobs (neither setup.py nor this file are copied to your processing and training jobs).

    Hope it now better clarifies my answer.

  • Hi @Ivan, many thanks for that update.

    It's great to hear that I should ignore the setup.py. I think I basically understand what you mean about putting requirements.txt in the source_dir when I get on to using the TensorFlowProcessor, but at the moment the error I get with 'transformers' is trying to use it in the preprocess step, before I each try to do anything TF.

    Maybe that's where I'm going wrong. Maybe there's only a limited number of libraries available in the preprocess.py and actually that should be ignored for TF stuff, with the TF preprocessing going into some part of TensorFlowProcessor ...?

    Following your advice in the other post I've now been able to deploy via pushing to main, but I still get this error:

    1664464168973 | [ 2022-09-29T15:09:28.973Z ] Traceback (most recent call last): File "/opt/ml/processing/input/code/preprocess.py", line 12, in <module> import transformers

    You've provided so much help already, and I'm sure you're exhausted by all my follow up questions, but I'm loathe to even try to implement TensorFlowProcessor without understanding how I can adjust what libraries are available to the preprocessing step.

    Can I not import arbitrary python modules during preprocessing like import transformers in my topic/preprocess.py? Or is there some other secret technique for that?

  • Hi, @regulatansaku.

    The error that you posted for me indicates that you still missing the requirements.txt file with transformers as a dependency. You should put it into 'topic' dir so the source dir that you send to TensorFlowProcessor looks like this:

    ./topic/
        preprocess.py
        requirements.txt
    

    And requirements.txt should look like this:

    transformers
    

    Then you create a processor like described in the section TensorFlow Framework Processor:

    tp = TensorFlowProcessor(
    ...
    )
    
    tp.run(
        code='preprocess.py',
        source_dir='topic',
    

    My appologies that I've probably sent you a broken link to this doc earlier. Pay attention to the requirements.txt mentioned in the doc:

    If you have a requirements.txt file, it should be a list of libraries you want to install in the container.

  • Thanks so much for your further follow up @ivan - very very much appreciated.

    So I have now tried this approach in two separate projects - both my own topic one, and in a customer_churn one created following the lab you sent. In both cases I have put a requirements.txt file in the pipelines/<project-name> directory.

    I am very familiar with what the contents of a requirements.txt file should be. In both cases I have created the file by using the following command on the CLI:

    python -m pip freeze > pipelines/<project name>/requirements.txt
    

    I've commited the file to the main branch and pushed up. In both cases I get the same sort of error:

    1664541835894 | [ 2022-09-30T12:43:55.894Z ] ModuleNotFoundError: No module named 'transformers'
    
    1664541835894 | [ 2022-09-30T12:43:55.894Z ] Traceback (most recent call last): File "/opt/ml/processing/input/code/preprocess.py", line 22, in <module> import transformers
    

    The key missing info for me yesterday was that the source_dir is the pipelines/<project-name> directory - that's clear now, thank you so much.

    Is there some extra step required that we have overlooked? Is there something about the transformers library that makes it impossible to require? do we have an example repo of someone getting this to work with some other lib?

    I've also tried with a requirements.txt file with just "transformers", same issue. Is there something else we need to do in sagemaker studio?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions