Unable to import sklearn.model_selection.StratifiedGroupKFold in Glue 4.0

0

I'm running a Glue 4.0 job with some local algorithmic process. I tested this on my local instance and it works fine.

from sklearn.model_selection import StratifiedGroupKFold, RandomizedSearchCV

But when I run it on Glue, it gives me exception,

ImportError: cannot import name 'StratifiedGroupKFold' from 'sklearn.model_selection' (/home/spark/.local/lib/python3.10/site-packages/sklearn/model_selection/__init__.py)

The Glue 4.0 does have a scikit-learn=1.1.3, which are compatible with the version on my local instance according to this https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html not sure why this happens?

Update I

A little bit weird. I tried output the sklearn version in the Glue job, it shows scikit-learn==0.24.2, which doesn't match the official doc. Was there a mismatch?

Update II

I tried to append below configs to force upgrade the scikit-learn version. But just not a perfect solution since the lib version mismatch.

--additional-python-modules: scikit-learn
--python-modules-installer-option: --upgrade
gefragt vor einem Jahr682 Aufrufe
1 Antwort
0

Hello,

I have replicated the use case in Glue 4.0 by using the scikit-learn version 1.1.3 which is default version installed for Glue 4.0 and after defining the same in glue job parameters and it works without any errors as follows:

--additional-python-modules scikit-learn==1.1.3

I have also tried to print the scikit-learn version in Glue 4.0 that returned 1.1.3 which confirms that Glue 4.0 uses default version of scikit-learn version 1.1.3.

Also checking the source code of scikit-learn 0.24.2 I can see there is no “StratifiedGroupKFold” library under the path scikit-learn-0.24.2 2/sklearn/model_selection/ init.py. You can see the source code of scikit-learn in the reference document[1].

Further running the same Glue 4.0 job by specifying the scikit-learn==0.24.2 in additional-python-modules parameter, job fails with similar error as you mentioned in your query:


ImportError: cannot import name 'StratifiedGroupKFold' from 'sklearn.model_selection' (/home/spark/.local/lib/python3.10/site-packages/sklearn/model_selection/init.py)

Also I tried adding the following parameters, and the Glue job worked without any errors:


--additional-python-modules: scikit-learn --python-modules-installer-option: --upgrade

I would suggest you to create a new job with Glue 4.0 and either specify the additional-python-modules parameter as scikit-learn==1.1.3 or simply run the job without specifying any parameter.

If the issue still persists, in order to troubleshoot further, please feel free to open a support case with AWS by specifying the error along with job run ID using the following link. We will be happy to assist you.

Reference: [1] https://github.com/scikit-learn/scikit-learn/releases/tag/0.24.2

AWS
SUPPORT-TECHNIKER
beantwortet vor einem Jahr

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen