Bug: Glue 4.0 | Python 3.10 | pandas library

0

I have been experimenting using Glue 4 which supports Python 3.10 and pandas.

I am adding pandas as a zipped library through the --extra-py-files functionality for a gluetl job.

When running my job, it fails importing pandas (version 1.4.3) (import pandas as pd) with the following which I copy-pasted from the cloudwatch logs:

2022-12-06 16:49:09,450 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last):
  File "/tmp/database_monitoring.py", line 2, in <module>
    import pandas as pd
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/__init__.py", line 48, in <module>
    from pandas.core.api import (
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/api.py", line 47, in <module>
    from pandas.core.groupby import (
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/__init__.py", line 1, in <module>
    from pandas.core.groupby.generic import (
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 76, in <module>
    from pandas.core.frame import DataFrame
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 170, in <module>
    from pandas.core.generic import NDFrame
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 147, in <module>
    from pandas.core.describe import describe_ndframe
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/describe.py", line 45, in <module>
    from pandas.io.formats.format import format_percentiles
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/formats/format.py", line 105, in <module>
    from pandas.io.common import (
  File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/common.py", line 8, in <module>
    import bz2
  File "/usr/local/lib/python3.10/bz2.py", line 17, in <module>
    from _bz2 import BZ2Compressor, BZ2Decompressor
ModuleNotFoundError: No module named '_bz2'
2022-12-06 16:49:09,450 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last): File "/tmp/database_monitoring.py", line 2, in <module> import pandas as pd File "/home/spark/.local/lib/python3.10/site-packages/pandas/__init__.py", line 48, in <module> from pandas.core.api import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/api.py", line 47, in <module> from pandas.core.groupby import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/__init__.py", line 1, in <module> from pandas.core.groupby.generic import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 76, in <module> from pandas.core.frame import DataFrame File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 170, in <module> from pandas.core.generic import NDFrame File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 147, in <module> from pandas.core.describe import describe_ndframe File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/describe.py", line 45, in <module> from pandas.io.formats.format import format_percentiles File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/formats/format.py", line 105, in <module> from pandas.io.common import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/common.py", line 8, in <module> import bz2 File "/usr/local/lib/python3.10/bz2.py", line 17, in <module> from _bz2 import BZ2Compressor, BZ2Decompressor ModuleNotFoundError: No module named '_bz2'

I believe this is a bug in AWS Glue 4.0 as opposed to a user issue. Is anyone able to advise or confirm? And if so, is there a bug fix planned for this?

JDay
已提问 2 年前1074 查看次数
2 回答
0

Do you need a specific / higher version than the included one? If not, no need to provide any zip at all. I can't / won't test Glue v4 at all - since the documentation (e.g. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html) and other actual service functionality like the aws_glue_interactive_sessions module is not updated. So much about "general availability".

已回答 2 年前
  • I did attempt this but found it was more trouble that it was worth. I am using libraries which also use pandas so I would need to add custom logic to ignore the pandas dependency when installing those libraries. And even then, that is supposing the pandas version AWS offers is compatible.

    This would add significant overhead to something which is supposed to be an out-of-the-box solution. Hence, I just won't use Glue 4.0 and will think of an alternative unless this is resolved.

  • I guess there was some context missing and neither me nor you did actually google the error message ;)

0

The integrated pandas version faces the same error and that must be fixed on the system level, e.g. https://stackoverflow.com/questions/50335503/no-module-named-bz2-in-python3.

已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则