I have been experimenting using Glue 4 which supports Python 3.10 and pandas.
I am adding pandas as a zipped library through the --extra-py-files
functionality for a gluetl
job.
When running my job, it fails importing pandas (version 1.4.3) (import pandas as pd
) with the following which I copy-pasted from the cloudwatch logs:
2022-12-06 16:49:09,450 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last):
File "/tmp/database_monitoring.py", line 2, in <module>
import pandas as pd
File "/home/spark/.local/lib/python3.10/site-packages/pandas/__init__.py", line 48, in <module>
from pandas.core.api import (
File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/api.py", line 47, in <module>
from pandas.core.groupby import (
File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/__init__.py", line 1, in <module>
from pandas.core.groupby.generic import (
File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 76, in <module>
from pandas.core.frame import DataFrame
File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 170, in <module>
from pandas.core.generic import NDFrame
File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 147, in <module>
from pandas.core.describe import describe_ndframe
File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/describe.py", line 45, in <module>
from pandas.io.formats.format import format_percentiles
File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/formats/format.py", line 105, in <module>
from pandas.io.common import (
File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/common.py", line 8, in <module>
import bz2
File "/usr/local/lib/python3.10/bz2.py", line 17, in <module>
from _bz2 import BZ2Compressor, BZ2Decompressor
ModuleNotFoundError: No module named '_bz2'
2022-12-06 16:49:09,450 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last): File "/tmp/database_monitoring.py", line 2, in <module> import pandas as pd File "/home/spark/.local/lib/python3.10/site-packages/pandas/__init__.py", line 48, in <module> from pandas.core.api import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/api.py", line 47, in <module> from pandas.core.groupby import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/__init__.py", line 1, in <module> from pandas.core.groupby.generic import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 76, in <module> from pandas.core.frame import DataFrame File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 170, in <module> from pandas.core.generic import NDFrame File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 147, in <module> from pandas.core.describe import describe_ndframe File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/describe.py", line 45, in <module> from pandas.io.formats.format import format_percentiles File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/formats/format.py", line 105, in <module> from pandas.io.common import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/common.py", line 8, in <module> import bz2 File "/usr/local/lib/python3.10/bz2.py", line 17, in <module> from _bz2 import BZ2Compressor, BZ2Decompressor ModuleNotFoundError: No module named '_bz2'
I believe this is a bug in AWS Glue 4.0 as opposed to a user issue. Is anyone able to advise or confirm? And if so, is there a bug fix planned for this?
I did attempt this but found it was more trouble that it was worth. I am using libraries which also use pandas so I would need to add custom logic to ignore the pandas dependency when installing those libraries. And even then, that is supposing the pandas version AWS offers is compatible.
This would add significant overhead to something which is supposed to be an out-of-the-box solution. Hence, I just won't use Glue 4.0 and will think of an alternative unless this is resolved.
I guess there was some context missing and neither me nor you did actually google the error message ;)