By using AWS re:Post, you agree to the Terms of Use
/Amazon EMR/

Questions tagged with Amazon EMR

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Pyspark job fails on EMR on EKS virtual cluster: java.lang.ClassCastException

Hi, we are in the process of migrating our pyspark jobs from EMR classic (EC2-based) to EMR on EKS virtual cluster. We have come across a strange failure in one job where we are reading some avro data from s3 and saving them straight back in parquet format. Example code: ``` df = spark.read.format("avro").load(input_path) df \ .withColumnRenamed("my_col", "my_new_col") \ .repartition(60) \ .write \ .mode("append") \ .partitionBy("my_new_col", "date") \ .format("parquet") \ .option("compression", "gzip") \ .save(output_path) ``` This fails with the following message at the .save() call (We can tell from the Python traceback, not included here for brevity): > Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 17) (10.0.3.174 executor 4): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.dataReader$1 of type scala.Function1 in instance of org.apache.spark.sql.execution.datasources.FileFormat$$anon$1 We are running this with `--packages org.apache.spark:spark-avro_2.12:3.1.1` in sparkSubmitParameters. Exact same code ran fine in a normal EMR cluster. Comparing the environments, both have Spark 3.1.1, Scala version version 2.12.10, only the Java version is different: 1.8.0_332 (EMR classic) vs 1.8.0_302 (EMR on EKS). We should also mention that we were able to run another job successfuly on EMR on EKS, that job doesn't have this avro-to-parquet step (the input is already in parquet format). So we suspect it has something to do with the extra org.apache.spark:spark-avro_2.12:3.1.1 package we are importing. We searched the web for the java.lang.ClassCastException and found a couple of issues [here](https://issues.apache.org/jira/browse/SPARK-29497) and [here](https://issues.apache.org/jira/browse/SPARK-25047), but they are not particularly helpful to us since our code is in Python. Any hints what might be the cause? Thanks and regards, Nikos
1
answers
0
votes
8
views
asked 19 hours ago

EMR Studio PySpark Kernel uses lowered version of pip

I am using a Jupyter Notebook which is provided by an AWS managed service called EMR Studio. My understanding of how these notebooks work is that they are hosted on EC2 instances that I provision as part of my EMR cluster. Specifically with the PySpark kernel using the task nodes. Currently when I run the command `sc.list_packages()` I see that pip is at version 9.0.1 whereas if I SSH onto the main node and run `pip list` I see that pip is at version 20.2.2. I have issues running the command `sc.install_pypi_package()` due to the lowered pip version in the Notebook. In the notebook cell if I run `import pip` then `pip` I see that the module is located at ``` <module 'pip' from '/mnt1/yarn/usercache/<LIVY_IMPERSONATION_ROLE>/appcache/application_1652110228490_0001/container_1652110228490_0001_01_000001/tmp/1652113783466-0/lib/python3.7/site-packages/pip/__init__.py'> ``` I am assuming this is most likely within a virtualenv of some sort running as an application on the task node? I am unsure of this and I have no concrete evidence of how the virtualenv is provisioned if there is one. If I run `sc.uninstall_package('pip')` then `sc.list_packages()` I see pip at a version of 20.2.2 which is what I am looking to initially start off with. The module path is the same as previously mentioned. How can I get pip 20.2.2 in the virtualenv instead of pip 9.0.1? If I import a package like numpy I see that the module is located at a different location from where pip is. Any reason for this? ``` <module 'numpy' from '/usr/local/lib64/python3.7/site-packages/numpy/__init__.py'> ``` As for pip 9.0.1 the only reference I can find at the moment is in `/lib/python2.7/site-packages/virtualenv_support/pip-9.0.1-py2.py3-none-any.whl`. One directory outside of this I see a file called `virtualenv-15.1.0-py2.7.egg-info` which if I `cat` the file states that it upgrades to pip 9.0.1. I have tried to remove the pip 9.0.1 wheel file and replaced it with a pip 20.2.2 wheel which caused issues with the PySpark kernel being able to provision properly. There is also a `virtualenv.py` file which does reference a `__version__ = "15.1.0"`. Lastly I have noticed in this AWS blog post that there is a picture which shows pip at version 19.2.3 but I am not sure how that was achieved. It is below the console output for the command `sc.list_packages()`. https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/
0
answers
0
votes
4
views
asked 11 days ago
0
answers
0
votes
2
views
asked 11 days ago

mount_workspace_dir notebook magic not working in EMR Studio

In an EMR Studio Python3 notebook, I execute the following: ``` %mount_workspace_dir . ``` And receive the following error: ``` UsageError: Line magic function `%mount_workspace_dir` not found. ``` I setup the EMR Cluster for studio using a Cloud Formation template that is accessible to Studio via Service Catalog. The Cloud Formation template specifies a bootstrap script that installs s3fs-fuse. The template also specifies a step to be executed when the cluster launches that installs emr-notebooks-magics using pip. When the cluster launches, I execute the above %mount_workspace_dir command and receive the indicated error. I tried restarting the kernel as well using the Kernel->Restart Kernel option from the menu. Here is the Cloud Formation template (with substitutions for subnet and bucket names): ``` --- AWSTemplateFormatVersion: 2010-09-09 Parameters: SubnetId: Type: "String" Resources: EmrCluster: Type: AWS::EMR::Cluster Properties: Applications: - Name: Spark - Name: Livy - Name: JupyterEnterpriseGateway - Name: Hive - Name: Presto EbsRootVolumeSize: '50' Name: !Join ['-', ['emr-studio-', !Select [4, !Split ['-', !Select [2, !Split ['/', !Ref AWS::StackId]]]]]] JobFlowRole: emr-studio-instance-role ServiceRole: EMR_DefaultRole ReleaseLabel: "emr-6.3.0" VisibleToAllUsers: true LogUri: Fn::Sub: 's3://<my-bucket>/' Instances: TerminationProtected: false Ec2SubnetId: '<my-subnet>' MasterInstanceGroup: InstanceCount: 1 InstanceType: "m5.xlarge" BootstrapActions: - Name: Auto-Termination ScriptBootstrapAction: Path: "s3://<my-bucket>/scripts/bootstrap-actions/install-s3fs-fuse.sh" Steps: - Name: Enable-Notebooks-Magics ActionOnFailure: CONTINUE HadoopJarStep: Jar: command-runner.jar Args: - "sudo" - "/mnt/notebook-env/bin/pip" - "install" - "emr-notebooks-magics" Outputs: ClusterId: Value: Ref: EmrCluster Description: The ID of the EMR Cluster ``` Here is the content of the install-s3fs-fuse.sh script: ``` sudo amazon-linux-extras install epel -y sudo yum install s3fs-fuse -y ``` I also tried with EMR 6.5.0. Is there a step that I'm missing?
0
answers
0
votes
1
views
asked 2 months ago

Benefits to S3 cross-region access with VPC peered interface endpoints vs. public internet using NAT gateways?

My team is looking to setup EMR clusters in private VPCs in all regions while having our main storage as S3 buckets in us-east-1. We will need cross-region access to S3 and have been looking at different ways of accomplishing it. We have considered two approaches: 1. Setting up isolated VPCs with no internet access, one in us-east-1 for the S3 bucket access and one in every region to launch our EMR clusters in. We will pair each of the VPCs with the one in us-east-1 and then setup an interface endpoint in the us-east-1 VPC to allow S3 access through the interface endpoint with VPC peering. This utilizes AWS PrivateLink. 2. Setting up a private VPC with internet gateway and NAT gateways in public subnets while launching EMR clusters in the private subnets. We will access S3 across regions through public internet. For both solutions, we will utilize gateway endpoints when the compute and storage is in the same region as we found this should yield the same benefits as interface endpoints but with no additional cost. Through my research, I have found that AWS PrivateLink is more secure due to no public internet usage and has a significant latency advantage of up to 70% according to this experiment: https://blogs.vmware.com/security/2020/03/performance-testing-justifying-cost-and-performance-improvements-part-2.html I am wondering if we will still see this latency benefit if we are using VPC peering or if it would be better to go with the internet route.
2
answers
1
votes
12
views
asked 2 months ago

EMR bootstrap script with pip numpy installation fails on r6+ instances

I recently tested moving from r5 to r6 instance fleets for our PySpark script. It has a dependency on numpy and pandas that is installed via pip in a bootstrap script, along with a few other dependencies for communicating with s3: ``` #!/bin/bash -xe echo "---------------------------------------------------------" echo "using python version:" python3 --version echo "initial python packages (sudo python3 -m pip list):" sudo python3 -m pip list echo "---------------------------------------------------------" echo "install python3-dev development tools" sudo yum -y install python3-devel echo "---------------------------------------------------------" echo "installing python dependencies" sudo python3 -m pip install -U pip echo "pip installed/updated" sudo python3 -m pip install -U setuptools echo "setuptools installed" sudo python3 -m pip install \ cloudpickle==1.6.0 \ boto3==1.21.7 \ fsspec==2022.2.0 \ s3fs==0.4.2 echo "aws dependencies installed (boto3, cloudpickle, fsspec, s3fs)" # sudo python3 -m pip install \ # pandas==1.1.5 \ # numpy==1.16.5 # echo "pandas + numpy installed" sudo python3 -m pip install pandas==1.2.5 echo "pandas installed" echo "final python packages (sudo python3 -m pip list):" sudo python3 -m pip list ``` This runs without failure on r5 instances, and numpy is available in the python environment as expected. When allowing {r6, r6g) instance types, the bootstrap script fails with the following message: ``` _configtest.c:1:10: fatal error: Python.h: No such file or directory #include <Python.h> ^~~~~~~~~~ compilation terminated. failure. removing: _configtest.c _configtest.o Traceback (most recent call last): File "<string>", line 36, in <module> File "<pip-setuptools-caller>", line 34, in <module> File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/setup.py", line 419, in <module> setup_package() File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/setup.py", line 411, in setup_package setup(**metadata) File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/core.py", line 171, in setup return old_setup(**new_attr) File "/usr/local/lib/python3.7/site-packages/setuptools/__init__.py", line 155, in setup return distutils.core.setup(**attrs) File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 148, in setup return run_commands(dist) File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 163, in run_commands dist.run_commands() File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 967, in run_commands self.run_command(cmd) File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command cmd_obj.run() File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/install.py", line 62, in run r = self.setuptools_run() File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/install.py", line 36, in setuptools_run return distutils_install.run(self) File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/command/install.py", line 670, in run self.run_command('build') File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command cmd_obj.run() File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/build.py", line 47, in run old_build.run(self) File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/command/build.py", line 135, in run self.run_command(cmd_name) File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/local/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command cmd_obj.run() File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/build_src.py", line 148, in run self.build_sources() File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/build_src.py", line 165, in build_sources self.build_extension_sources(ext) File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/build_src.py", line 322, in build_extension_sources sources = self.generate_sources(sources, ext) File "/mnt/tmp/pip-install-np4ypx_v/numpy_833906a79d1d4dfeb4789de3524fd268/numpy/distutils/command/build_src.py", line 375, in generate_sources source = func(extension, build_dir) File "numpy/core/setup.py", line 423, in generate_config_h moredefs, ignored = cocache.check_types(config_cmd, ext, build_dir) File "numpy/core/setup.py", line 47, in check_types out = check_types(*a, **kw) File "numpy/core/setup.py", line 281, in check_types "install {0}-dev|{0}-devel.".format(python)) SystemError: Cannot compile 'Python.h'. Perhaps you need to install python-dev|python-devel. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: legacy-install-failure × Encountered error while trying to install package. ╰─> numpy note: This is an issue with the package mentioned above, not pip. hint: See above for output from the failure. ``` Note, this bootstrap script attempts to address the problem from the error message by installing python-devel via yum before running the pip install.
1
answers
0
votes
9
views
asked 3 months ago

authentication error with SAML + EMR + Lake formation

I have an errror when I try to login with an IDP (Auth0) and EMR integrated with Lake formation. I'm following the workshop [Lake formation & EMR integration](https://catalog.us-east-1.prod.workshops.aws/workshops/78572df7-d2ee-4f78-b698-7cafdb55135d/en-US/emr-integration ) I have configured an Auth0 account, aws IDP, EMR cluster (aws service) and data lake permissions with (idp users). But I have an error when I do the login with [EMR Zeppelin] (https://EMRMasterNodeDNS:8442/gateway/default/zeppelin/). I do the login with Auth0 and EMR but I can't do it with lakeformation. This is the error that I had on EMR proxy agent: `Caused by: java.lang.NullPointerException at org.apache.knox.gateway.util.SamlUtils.getSamlAwsRoleAttributeValues(SamlUtils.java:149) at org.apache.knox.gateway.pac4j.aws.AwsLakeFormationSamlImpl.getAwsCredentials(AwsLakeFormationSamlImpl.java:106) at org.apache.knox.gateway.pac4j.aws.AwsSamlHandler.processSamlResponse(AwsSamlHandler.java:78) at org.apache.knox.gateway.pac4j.filter.Pac4jDispatcherFilter.doFilter(Pac4jDispatcherFilter.java:234) at org.apache.knox.gateway.GatewayFilter$Holder.doFilter(GatewayFilter.java:372) at org.apache.knox.gateway.GatewayFilter$Chain.doFilter(GatewayFilter.java:272) at org.apache.knox.gateway.filter.XForwardedHeaderFilter.doFilter(XForwardedHeaderFilter.java:30) at org.apache.knox.gateway.filter.AbstractGatewayFilter.doFilter(AbstractGatewayFilter.java:61)` I think that I need to do the step 6 on the documentation [amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-lf-federation.html). But I don't know were I have to do this configuration. Any help? Thank you
0
answers
0
votes
1
views
asked 3 months ago
  • 1
  • 90 / page