Unable to import sklearn.model_selection.StratifiedGroupKFold in Glue 4.0

0

I'm running a Glue 4.0 job with some local algorithmic process. I tested this on my local instance and it works fine.

from sklearn.model_selection import StratifiedGroupKFold, RandomizedSearchCV

But when I run it on Glue, it gives me exception,

ImportError: cannot import name 'StratifiedGroupKFold' from 'sklearn.model_selection' (/home/spark/.local/lib/python3.10/site-packages/sklearn/model_selection/__init__.py)

The Glue 4.0 does have a scikit-learn=1.1.3, which are compatible with the version on my local instance according to this https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html not sure why this happens?

Update I

A little bit weird. I tried output the sklearn version in the Glue job, it shows scikit-learn==0.24.2, which doesn't match the official doc. Was there a mismatch?

Update II

I tried to append below configs to force upgrade the scikit-learn version. But just not a perfect solution since the lib version mismatch.

--additional-python-modules: scikit-learn
--python-modules-installer-option: --upgrade
asked a year ago667 views
1 Answer
0

Hello,

I have replicated the use case in Glue 4.0 by using the scikit-learn version 1.1.3 which is default version installed for Glue 4.0 and after defining the same in glue job parameters and it works without any errors as follows:

--additional-python-modules scikit-learn==1.1.3

I have also tried to print the scikit-learn version in Glue 4.0 that returned 1.1.3 which confirms that Glue 4.0 uses default version of scikit-learn version 1.1.3.

Also checking the source code of scikit-learn 0.24.2 I can see there is no “StratifiedGroupKFold” library under the path scikit-learn-0.24.2 2/sklearn/model_selection/ init.py. You can see the source code of scikit-learn in the reference document[1].

Further running the same Glue 4.0 job by specifying the scikit-learn==0.24.2 in additional-python-modules parameter, job fails with similar error as you mentioned in your query:


ImportError: cannot import name 'StratifiedGroupKFold' from 'sklearn.model_selection' (/home/spark/.local/lib/python3.10/site-packages/sklearn/model_selection/init.py)

Also I tried adding the following parameters, and the Glue job worked without any errors:


--additional-python-modules: scikit-learn --python-modules-installer-option: --upgrade

I would suggest you to create a new job with Glue 4.0 and either specify the additional-python-modules parameter as scikit-learn==1.1.3 or simply run the job without specifying any parameter.

If the issue still persists, in order to troubleshoot further, please feel free to open a support case with AWS by specifying the error along with job run ID using the following link. We will be happy to assist you.

Reference: [1] https://github.com/scikit-learn/scikit-learn/releases/tag/0.24.2

AWS
SUPPORT ENGINEER
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions