Skip to content

Issues running PySpark on AWS Lambda

0

I know the recommended strategy is to use EMR Serverless or EMR. However, I have a particular use case where I only need to run a fairly small PySpark job and need quick results. I've already gotten my job working on EMR Serverless. I want to configure a Lambda to perform the same functionality based on my aforementioned needs. I've used the Spark on AWS Lambda example code as a guide. I'm trying to make my Lambda compatible with EMR /EMR Serverless 7.1.0 - using the same library versions as specified here.

I've been able to create a container image Lambda. I've gotten most of the runtime issues out of the way, but I'm encountering an issue where I have too many open files now:

2024-06-07T17:47:36.185Z	ERROR StatusLogger Error creating converter for d
2024-06-07T17:47:36.185Z	java.lang.reflect.InvocationTargetException
2024-06-07T17:47:36.185Z	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2024-06-07T17:47:36.185Z	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
2024-06-07T17:47:36.185Z	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2024-06-07T17:47:36.185Z	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.pattern.PatternParser.createConverter(PatternParser.java:590)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.pattern.PatternParser.finalizeConverter(PatternParser.java:657)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.pattern.PatternParser.parse(PatternParser.java:420)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.pattern.PatternParser.parse(PatternParser.java:177)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.layout.PatternLayout$SerializerBuilder.build(PatternLayout.java:473)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.layout.PatternLayout.<init>(PatternLayout.java:139)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.layout.PatternLayout.<init>(PatternLayout.java:60)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.layout.PatternLayout$Builder.build(PatternLayout.java:766)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.config.AbstractConfiguration.setToDefault(AbstractConfiguration.java:745)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.config.DefaultConfiguration.<init>(DefaultConfiguration.java:47)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.LoggerContext.<init>(LoggerContext.java:84)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.createContext(ClassLoaderContextSelector.java:254)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.locateContext(ClassLoaderContextSelector.java:218)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:140)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:123)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:230)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:47)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.LogManager.getContext(LogManager.java:176)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.LogManager.getLogger(LogManager.java:666)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.LogManager.getRootLogger(LogManager.java:700)
2024-06-07T17:47:36.185Z	at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
2024-06-07T17:47:36.185Z	at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:114)
2024-06-07T17:47:36.185Z	at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:108)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit.initializeLogIfNecessary(SparkSubmit.scala:76)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:84)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2024-06-07T17:47:36.185Z	Caused by: java.lang.Error: java.io.FileNotFoundException: /usr/lib/jvm/java-17-amazon-corretto.x86_64/lib/tzdb.dat (Too many open files)

I think the problem is due to me including both AWS SDK for Java versions 2 and 1 which is noted in the above documentation:

2.23.18, 1.12.656

Which results in over a thousand jar files in my PySpark jar directory. I would like to cull this list, but it would be a tedious process to determine which jar files are needed. Unfortunately, EMR/EMR Serverless code is not public, so I don't know exactly which libs are needed. Does anyone know which libraries are needed or if I can limit what jar files I need to include? I cannot increase the number of open file descriptors because Lambda limits it to 1024.

Or is there another issue I should know about?

asked 2 years ago1.4K views
2 Answers
1

Hello,

AFAIK, you can not achieve the EMR Spark functionalities in Lambda as it has its own customization which is compatible only with EMR flavored services. The given spark-on-aws-lambda example is not meant for EMR Spark. However, the pyspark code that you execute in Lambda will also work in EMR. Example mentioned here.

AWS
SUPPORT ENGINEER
answered 2 years ago
  • Hi Yokesh, I definitely understand that I cannot achieve EMR Spark functionalities on Lambda. My original aim was to get PySpark running on Lambda. The problem for me is I'm getting the aforementioned "too many open files" error which could be due to me adding Delta Lake libraries, etc. Is there any way of overcoming that?

0

I am trying to run pyspark on lambda and I am getting the following error [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number. although it is working fine locally with docker when I use curl "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{} as per the doc https://docs.aws.amazon.com/lambda/latest/dg/python-image.html

I didn't follow this SoAL solution my docker file is

FROM public.ecr.aws/lambda/python:3.13

# Install java 21
RUN dnf install -y java-21-amazon-corretto-headless && rm -rf /var/cache/dnf

# set JAVA_HOME and update PATH
ENV JAVA_HOME=/usr/lib/jvm/java-21-amazon-corretto.x86_64
ENV PATH="$JAVA_HOME/bin:$PATH"

# Install pyspark
RUN pip install pyspark

# Copy function code
COPY lambda_function.py ${LAMBDA_TASK_ROOT}

# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "lambda_function.handler" ]

in lambda_function.handler I am creating a spark session using SparkSession.builder.getOrCreate()

any idea why I am getting the error, as mentioned when I run code locally on docker container and then use curl "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{} it works fine

answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.