Skip to content

AWS EMR : PySpark Job failing with “JavaPackage object is not Callable” error during mapPartition stage

1

my pyspark job is failing at map partition function with JavaPackage object is not callable error. I have verified that the function I am passing to map partition function is callable and objects passed to the function are serializable , when I run the same code locally on my PC it works but it fails on EMR. What am I missing? If someone knows or has faced same error before and can help me?

below is the error log:

Sending data to segment : An error occurred while sending data to Segment: 'JavaPackage' object is not callable Traceback (most recent call last): File "/mnt/tmp/spark-8210288/application.zip/project/jobs/ingestion_job.py", line 329, in send _data_to segment Lambda partition: IngestionJob.batch_process( File "/mnt/tmp/spark-8210288/application.zip/pyspark/rdd.py", line 2316, in count return self.mapPartitions (lambda i: [sum(1 for in i)]) -sum() File */mnt/tmp/spark-8210288/application.zip/pyspark/rdd.py", line 2292, in sum 0, operator.add File */mnt/tmp/spark-8210288/application.zip/pyspark/rdd.py", line 2044, in fold vals = self mapPartitions (func) .collect() File "/mnt/tmp/spark-8210288/application.zip/pyspark/rdd.py", line 1833, in collect sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File */mnt/tmp/spark-8210288/application.zip/pyspark/rdd.py", line 5471, in _jrdd self.ctx, self.func, self._prev jrdd_deserializer, self. jrdd deserializer, profiler File */mnt/tmp/spark-8210288/application.zip/pyspark/rdd.py", line 5277, in wrap_function SC. javaAccumulator, TypeError: 'JavaPackage' object is not callable

asked 10 months ago561 views
1 Answer
1

Greeting

Hi Saloni,

Thank you for reaching out! It looks like you’re facing an issue with your PySpark job on Amazon EMR related to the error: "JavaPackage object is not Callable." I'm here to help troubleshoot this and get your job running smoothly. Let’s dig into this together! 🚀


Clarifying the Issue

From your description, the error occurs during the mapPartition function on EMR, but the same code works fine when run locally on your PC. This suggests the problem is likely tied to how distributed computing environments like EMR handle dependencies, serialization, and Java-Python interactions.

Based on the traceback, the root of the issue is that a JavaPackage object—a reference to a Java class or library in the JVM—has been passed to the mapPartition function but is being treated as if it were a callable Python function. This is problematic because Java objects are not inherently executable, and PySpark requires serializable, callable Python functions to process partitions of data. To solve this, we’ll ensure dependencies are properly configured, and we’ll create a Python wrapper to bridge the gap.


Key Terms

  • JavaPackage Object: A reference to a Java class or package exposed to Python through PySpark’s Py4J bridge. It allows Python to interact with Java code but cannot be executed directly.
  • EMR Cluster: Amazon Elastic MapReduce, a managed service for big data processing using frameworks like Apache Spark and Hadoop.
  • Serialization: The process of converting an object into a format that can be transmitted across the network. For distributed computing, objects passed to worker nodes must be serializable.

The Solution (Our Recipe)

Steps at a Glance:

  1. Verify that the library or dependency containing the JavaPackage object is available on EMR.
  2. Ensure the function passed to mapPartition is serializable by wrapping the Java object in a Python callable.
  3. Submit the updated job to EMR and validate the output.

Step-by-Step Guide:

  1. Verify that the library or dependency containing the JavaPackage object is available on EMR:
    Check that the Java library or package you’re using is included in the EMR cluster configuration. Add it via the --jars option when submitting your Spark job:
    spark-submit --jars s3://your-bucket/path-to-your-library.jar your_script.py
    If using custom EMR steps, include the dependency during cluster bootstrapping or use the EMR release version that includes the required library.

  1. Ensure the function passed to mapPartition is serializable by wrapping the Java object in a Python callable:
    Functions passed to mapPartition must be serializable for PySpark to execute them in a distributed context. Instead of directly using the Java object, wrap it in a Python callable class:
    from pyspark import SparkContext
    
    # Callable Python wrapper for a Java object
    class JavaFunctionWrapper:
        def __init__(self, java_object):
            self.java_object = java_object
    
        def __call__(self, partition):
            return [self.java_object.someMethod(x) for x in partition]
    
    sc = SparkContext()
    rdd = sc.parallelize([1, 2, 3, 4], 2)
    java_obj = sc._jvm.com.example.MyJavaClass()  # Replace with your Java class
    wrapper = JavaFunctionWrapper(java_obj)
    result = rdd.mapPartitions(wrapper).collect()
    print(result)
    This ensures that your Java object is used in a serializable, callable Python wrapper.

  1. Submit the updated job to EMR and validate the output:
    Deploy your updated code to EMR. Use the appropriate Spark-submit configurations and verify the job runs successfully by checking EMR and CloudWatch logs for any errors.

Closing Thoughts

This issue highlights the importance of ensuring compatibility between Java objects and Python functions in distributed environments like EMR. By wrapping the JavaPackage object in a callable and verifying dependency configurations, you should be able to resolve the issue and get your job running successfully.

Here are some helpful documentation links:


Farewell

I hope this helps you resolve the issue, Saloni! If you run into further trouble or need clarification, feel free to ask. Best of luck with your PySpark job on EMR—I know you’ll nail it! 🚀😊


Cheers,

Aaron 😊

answered 10 months ago
  • Hi Aaron

    Thank you so much for the response. I will try the steps you have suggested and will reply back here how it goes.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.