EMR Serverless 6.6.0 Python SWIG Lib dependency

0

I'm trying to create an isolated Python virtual environment to package Python libraries necessary for a Pyspark job.

I was successful to make it work by simply following these steps https://github.com/aws-samples/emr-serverless-samples/tree/main/examples/pyspark/dependencies

However, there is one Python library dependency (SWIG) failing to install because it requires additional libs to be installed such as gcc gcc-c++ python3-devel. LIB: https://github.com/51Degrees/Device-Detection/tree/master/python

So I added RUN yum install -y gcc gcc-c++ python3-devel to the Dockerfile image https://github.com/aws-samples/emr-serverless-samples/blob/main/examples/pyspark/dependencies/Dockerfile and it installed sucessfully and then I packaged the virtual env.

However, the emr job fails with that lib python modules not being found, which makes me think that python3-devel is not present in EMR Serverless 6.6.0

Since I don't have control over the serverless environment, is any way around this? Or am I missing something?

stderr

An error occurred while calling o198.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage 0.0 failed 4 times, most recent failure: Lost task 19.3 in stage 0.0 (TID 89) ([2600:1f18:153d:6601:bfcc:6ff:50bc:240e] executor 7): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 15, in swig_import_helper
    return importlib.import_module(mname)
  File "/usr/lib64/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'FiftyOneDegrees._fiftyone_degrees_mobile_detector_v3_wrapper'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
    process()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "./jobs.zip/jobs/parsed_events_orc_processor/etl.py", line 360, in enrich_events
    event['device'] = calculate_device_data(event)
  File "./jobs.zip/jobs/parsed_events_orc_processor/etl.py", line 152, in calculate_device_data
    device_data = mobile_detector.match(user_agent)
  File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 225, in match
    else settings.DETECTION_METHOD)
  File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 63, in instance
    cls._INSTANCES[method] = cls._METHODS[method]()
  File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 98, in __init__
    from FiftyOneDegrees import fiftyone_degrees_mobile_detector_v3_wrapper
  File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 18, in <module>
    _fiftyone_degrees_mobile_detector_v3_wrapper = swig_import_helper()
  File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 17, in swig_import_helper
    return importlib.import_module('_fiftyone_degrees_mobile_detector_v3_wrapper')
  File "/usr/lib64/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named '_fiftyone_degrees_mobile_detector_v3_wrapper'

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:545)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:703)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:685)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:498)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:954)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:142)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:133)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2559)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2508)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2507)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2507)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1149)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1149)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1149)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2747)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2689)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2678)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.checkNoFailures(AdaptiveExecutor.scala:154)
        at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.doRun(AdaptiveExecutor.scala:88)
        at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.tryRunningAndGetFuture(AdaptiveExecutor.scala:66)
        at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.execute(AdaptiveExecutor.scala:57)
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:241)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:240)
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:509)
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:471)
        at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3053)
        at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3052)
        at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3770)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
        at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3768)
        at org.apache.spark.sql.Dataset.count(Dataset.scala:3052)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 15, in swig_import_helper
    return importlib.import_module(mname)
  File "/usr/lib64/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'FiftyOneDegrees._fiftyone_degrees_mobile_detector_v3_wrapper'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
    process()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "./jobs.zip/jobs/parsed_events_orc_processor/etl.py", line 360, in enrich_events
    event['device'] = calculate_device_data(event)
  File "./jobs.zip/jobs/parsed_events_orc_processor/etl.py", line 152, in calculate_device_data
    device_data = mobile_detector.match(user_agent)
  File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 225, in match
    else settings.DETECTION_METHOD)
  File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 63, in instance
    cls._INSTANCES[method] = cls._METHODS[method]()
  File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 98, in __init__
    from FiftyOneDegrees import fiftyone_degrees_mobile_detector_v3_wrapper
  File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 18, in <module>
    _fiftyone_degrees_mobile_detector_v3_wrapper = swig_import_helper()
  File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 17, in swig_import_helper
    return importlib.import_module('_fiftyone_degrees_mobile_detector_v3_wrapper')
  File "/usr/lib64/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named '_fiftyone_degrees_mobile_detector_v3_wrapper'

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:545)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:703)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:685)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:498)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution
  • Could you post the full error message and the script that you're using as your job?

    I tried this but got an error message about being unable to load the database file:

    The provided detection database file (/home/hadoop/51Degrees/51Degrees-LiteV3.2.dat) does not exist or is not readable

  • Just added stderr output, tell me if that helps!

    Thanks!

  • Can you confirm your virtualenv tar file has the module in it?

    This command (change pyspark_ge.tar.gz if you renamed it in the Dockerfile):

    tar tzvf pyspark_ge.tar.gz | grep _fiftyone_degrees_mobile_detector_v3_wrapper
    

    should show output like this:

    -rwxr-xr-x  0 root   root  2352224 Jun 28 14:31 lib/python3.7/site-packages/_fiftyone_degrees_mobile_detector_v3_wrapper.cpython-37m-x86_64-linux-gnu.so
    

    I was able to get it to work with a simple script, which I'll post below.

asked 2 years ago948 views
2 Answers
2

This should be possible, but as you mentioned, requires some extra Linux packages. I was able to get this working for me with one additional change - the Device-Detection library tries to load a .dat file from a specific location in the user's home directory that won't exist in the virtualenv package, but luckily it can be specified manually.

Here's my Dockerfile with the extra yum packages as well as a step where I manually download the .dat file.

FROM amazonlinux:2 AS base

RUN yum install -y python3 gcc gcc-c++ python3-devel

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install \
    51degrees-mobile-detector-v3-wrapper==3.2.18.4 \
    venv-pack==0.2.0

# Download the data file
RUN curl --output-dir ${VIRTUAL_ENV} -LO https://github.com/51Degrees/Device-Detection/raw/master/data/51Degrees-LiteV3.2.dat

RUN mkdir /output && venv-pack -o /output/pyspark_so.tar.gz

FROM scratch AS export
COPY --from=base /output/pyspark_so.tar.gz /

Here's my sample script - note that I have to create the provider manually by specifying the .dat file in /home/hadoop/environment (where the virtualenv tar gets unpacked) and I try to do a detection both with and without Spark loaded.

# Default mobile_detector looks in /home/hadoop/51Degrees/51Degrees-LiteV3.2.dat
# So instead we load settings and the wrapper and create the Provider manually
# from fiftyone_degrees import mobile_detector
from fiftyone_degrees.mobile_detector.conf import settings
from FiftyOneDegrees import fiftyone_degrees_mobile_detector_v3_wrapper
from pyspark.sql import SparkSession

provider = fiftyone_degrees_mobile_detector_v3_wrapper.Provider(
    "/home/hadoop/environment/51Degrees-LiteV3.2.dat",
    settings.PROPERTIES,
    settings.CACHE_SIZE,
    settings.POOL_SIZE,
)
device = provider.getMatch(
    "Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9B176"
)

print(device.getValue("BrowserName"))


if __name__ == "__main__":
    spark = SparkSession.builder.appName("NativeModules").getOrCreate()

    device = provider.getMatch(
        "Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9B176"
    )
    print(device.getValue("BrowserName"))

And then this is the command used to run the job. Note the usage of all the different sparkSubmitParameters. These are all required for this to work properly.

aws emr-serverless start-job-run \
    --application-id $APPLICATION_ID \
    --execution-role-arn $JOB_ROLE_ARN \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://'${S3_BUCKET}'/code/pyspark/native_mod.py",
            "sparkSubmitParameters": "--conf spark.archives=s3://'${S3_BUCKET}'/artifacts/pyspark/pyspark_so.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
        }
    }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://'${S3_BUCKET}'/logs/"
            }
        }
    }'
AWS
dacort
answered 2 years ago
  • Thanks for your detailed answer, finally got it to work!

    I am packaging it without the dat file inside the environment and passing it with --conf spark.files=s3://<bucket>/51Degrees-PremiumV3_2.dat, which also works fine :)

    The problem that I was having and that your comment made me notice is that instead of my venv having:

    -rwxr-xr-x  0 root   root  2352224 Jun 28 14:31 lib/python3.7/site-packages/_fiftyone_degrees_mobile_detector_v3_wrapper.cpython-37m-x86_64-linux-gnu.so
    

    I had

    -rwxr-xr-x  0 root   root  2477856 29 Jun 21:33 lib/python3.7/site-packages/_fiftyone_degrees_mobile_detector_v3_wrapper.cpython-37m-aarch64-linux-gnu.so
    

    Which then instantly made me realize that I was using docker build targeting arm64 (default for apple silicon). Just had to pass --platform amd64 to the docker build command and voilá, it worked!

    Thanks for your help!

  • Can you add that to the readme so it's clear that we should target amd64 chipsets?

  • Ah, great point! Thanks @fvcpinheiro - I'll get that added.

0

Hi,

Thank you for posting your query to repost. I have passed on this information to @dacort the author of the github post.

Please share the error in stderr and stdout file and let us know if you are using ge_profile.py script to run the application Please ensure that there is not sensitive information in the logs shared

AWS
Sujay_M
answered 2 years ago
  • I'm not using ge_profile.py, but a custom script.

    Updated my question with stderr output. There you can see that the mobile_detector is well imported, which means it's well packaged, but then low level calls to c lib raises that module not found error.

    @dacort

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions