By using AWS re:Post, you agree to the Terms of Use
/Amazon Elastic MapReduce/

Questions tagged with Amazon Elastic MapReduce

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Can Data Pipelines be used for running Spark jobs on EMR 6.5.0?

Hi, I have a problem in that I make heavy use of EMRs, and I orchestrate their use with Data Pipeline - multiple daily runs are automated and EMRs are launched and terminated on conclusion. However, I'd now like to make use of EMR 6.X.X releases via Data Pipelines, rather than EMR 5.X.X releases I'm currently using. This is for two main reasons: * Security compliance: the latest EMR 6.X.X release have less vulnerabilities than the latest EMR 5.X.X releases * Performance/functionality: EMR 6.X.X releases perform much better than EMR 5.X.X releases for what I'm doing, and have functionality I prefer to use However...the current documentation for Data Pipeline says the following regarding EMR versions: > AWS Data Pipeline only supports release version 6.1.0 (emr-6.1.0). Version 6.1.0 of EMR was last updated on Oct 15, 2020...it's pretty old. Now, if I try and use an EMR version > 6.1.0 with Data Pipeline, I get the issue that has already been raised [here](https://forums.aws.amazon.com/thread.jspa?threadID=346359) i.e. during initial EMR cluster bring up via the Data Pipeline, there's a failure that renders the cluster unusable. It looks like a malformed attempt to create a symbolic link to a jar by one of the AWS scripts: ++ find /usr/lib/hive/lib/ -name 'opencsv*jar' + open_csv_jar='/usr/lib/hive/lib/opencsv-2.3.jar /usr/lib/hive/lib/opencsv-3.9.jar' + sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.2.0.jar /mnt/taskRunner/oncluster-emr-hadoop- goodies.jar + sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hive-goodies-3.2.0.jar /mnt/taskRunner/oncluster-emr-hive-goodies.jar + sudo ln -s /usr/lib/hive/lib/opencsv-2.3.jar /usr/lib/hive/lib/opencsv-3.9.jar /mnt/taskRunner/open-csv.jar ln: target ‘/mnt/taskRunner/open-csv.jar’ is not a directory Command exiting with ret '1' So - I guess my questions are: 1. Is there a way to work around the above so that Data Pipelines can be used to launch EMR 6.5.0 clusters for Spark jobs? 2. If there isn't, is there a different way of automating runs of EMR 6.5.0 clusters, other than writing my own script and scheduling *that* to bring up the EMR cluster and add the required jobs/steps? Thanks.
0
answers
0
votes
19
views
asked 3 months ago

metadata service is unstable: connection timeout, Failed to connect to service endpoint etc

start from recently, our long running job are hitting metadata issue frequently. The exceptions various, but the all point to EC2 metadata service. It's either failed to connection the endpoint, or timeout to connect to the service, or complaining that I need to specify the region while building the client. The job is running on EMR 6.0.0 in Tokyo, with correct Role set, and the job has been running fine for months, just started from recent, it became unstable. So my question is: how can we monitor the healthy the metadata service? request QPS, success rate, etc. A few callstacks ``` com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazon.ws.emr.hadoop.fs.guice.UserGroupMappingAWSSessionCredentialsProvider@4a27ee0d: null, com.amazon.ws.emr.hadoop.fs.HadoopConfigurationAWSCredentialsProvider@76659c17: null, com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.auth.InstanceProfileCredentialsProvider@5c05c23d: Failed to connect to service endpoint: ] at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:136) ``` ``` com.amazonaws.SdkClientException: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region. at com.amazonaws.client.builder.AwsClientBuilder.setRegion(AwsClientBuilder.java:462) at com.amazonaws.client.builder.AwsClientBuilder.configureMutableProperties(AwsClientBuilder.java:424) at com.amazonaws.client.builder.AwsSyncClientBuilder.build(AwsSyncClientBuilder.java:46) ``` ``` com.amazonaws.SdkClientException: Unable to execute HTTP request: mybucket.s3.ap-northeast-1.amazonaws.com at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1189) ~[aws-java-sdk-bundle-1.11.711.jar:?] Caused by: java.net.UnknownHostException: mybucket.s3.ap-northeast-1.amazonaws.com at java.net.InetAddress.getAllByName0(InetAddress.java:1281) ~[?:1.8.0_242] at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_242] at java.net.InetAddress.getAllByName(InetAddress.java:1127) ~[?:1.8.0_242] ``` ``` com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Failed to connect to service endpoint: Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ```
1
answers
0
votes
12
views
asked 4 months ago

Hive Sync fails: AWSGlueDataCatalogHiveClientFactory not found

I'm syncing data written to S3 using Apache Hudi with Hive & Glue. Hudi options: ``` hudi_options: 'hoodie.table.name': mytable 'hoodie.datasource.write.recordkey.field': Id 'hoodie.datasource.write.partitionpath.field': date 'hoodie.datasource.write.table.name': mytable 'hoodie.datasource.write.operation': upsert 'hoodie.datasource.write.precombine.field': LastModifiedDate 'hoodie.datasource.hive_sync.enable': true 'hoodie.datasource.hive_sync.partition_fields': date 'hoodie.datasource.hive_sync.database': hudi_lake_dev 'hoodie.datasource.hive_sync.table': mytable ``` EMR Configurations: ``` ... { "Classification": "yarn-site", "Properties": { "yarn.nodemanager.vmem-check-enabled": "false", "yarn.log-aggregation-enable": "true", "yarn.log-aggregation.retain-seconds": "-1", "yarn.nodemanager.remote-app-log-dir": config[ "yarn_agg_log_uri_s3_path" ].format(current_date), }, "Configurations": [], }, { "Classification": "spark-hive-site", "Properties": { "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory" # noqa }, }, { "Classification": "presto-connector-hive", "Properties": {"hive.metastore.glue.datacatalog.enabled": "true"}, }, { "Classification": "hive-site", "Properties": { "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory", # noqa "hive.metastore.schema.verification": "false", }, }, ... ``` Getting the following error: ``` Traceback (most recent call last): File "/home/hadoop/my_pipeline.py", line 31, in <module> ... File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1109, in save File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o84.save. : org.apache.hudi.hive.HoodieHiveSyncException: Got runtime exception when hive syncing at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:74) at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:391) at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$4(HoodieSparkSqlWriter.scala:440) at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$4$adapted(HoodieSparkSqlWriter.scala:436) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:436) at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:497) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:223) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:194) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:190) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to create HiveMetaStoreClient at org.apache.hudi.hive.HoodieHiveClient.<init>(HoodieHiveClient.java:93) at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:69) ... 46 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:239) at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:402) at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:335) at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:315) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:291) at org.apache.hudi.hive.HoodieHiveClient.<init>(HoodieHiveClient.java:91) ... 47 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found) at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3991) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:251) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:234) ... 52 more Caused by: MetaException(message:Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found) at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClientFactory(HiveUtils.java:525) at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClient(HiveUtils.java:506) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3746) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3726) at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3988) ... 54 more 22/01/12 16:37:53 INFO SparkContext: Invoking stop() from shutdown hook ```
1
answers
0
votes
90
views
asked 4 months ago

EMRFS and S3 503 slow down responses

I've been seeing the following exception from emrfs using emr 6.5.0 ``` Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: XXX; S3 Extended Request ID: XXX; Proxy: null), S3 Extended Request ID: XXX at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5445) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5392) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1368) at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:26) at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:12) at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor$CallPerformer.call(GlobalS3Executor.java:108) at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:135) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:191) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:186) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObjectMetadata(AmazonS3LiteClient.java:96) at com.amazon.ws.emr.hadoop.fs.s3.lite.AbstractAmazonS3Lite.getObjectMetadata(AbstractAmazonS3Lite.java:43) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.getFileMetadataFromCacheOrS3(Jets3tNativeFileSystemStore.java:592) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:318) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:509) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:694) at org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker.getFileSize(BasicWriteStatsTracker.scala:70) at org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker.$anonfun$statCurrentFile$1(BasicWriteStatsTracker.scala:96) at org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker.$anonfun$statCurrentFile$1$adapted(BasicWriteStatsTracker.scala:95) ``` My spark application sets the following for fs.s3.: ``` hadoopConfiguration.set("fs.s3.maxConnections", "2") hadoopConfiguration.set("fs.s3.maxRetries", "120") hadoopConfiguration.set("fs.s3.retryPeriodSeconds", "1") hadoopConfiguration.set("fs.s3.buckets.create.enabled", "false") ``` The maxRetries is obviously crazy large just to see if it had any impact on this issue (answer: no). Is there an additional setting that needs to be added or changed to get this codepath to honour the retry configuration?
1
answers
0
votes
36
views
asked 5 months ago

Adding Hive and beeline client on AWS MWAA

Hi all, I am using HiveOperator to execute a query on EMR cluster using beeline client from AWS MWAA. Rather than adding a step on EMR, I want to hit the Hive Query directly using HiveOperator from AWS MWAA. But due to missing binaries on MWAA, it says no such beeline directory. Please see the below dag code and stack trace. Sample Code (**The below code works fine with our airflow installed on EKS**) hive_direct_task = HiveOperator( task_id='hive_direct_task', hive_cli_conn_id='hive_emr_dag_connection', hql='CREATE TABLE XXXX.XXXX STORED AS ORC AS SELECT DISTINCT * from XXXX.XXXX limit 2' ) {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: spark_hive_ssh_dag.hive_direct_task > ip-XXXX.ec2.internal {{hive_operator.py:121}} INFO - Executing: CREATE TABLE XXXX.XXXX STORED AS ORC AS SELECT DISTINCT * from XXXX.XXXX limit 2 {{hive_operator.py:136}} INFO - Passing HiveConf: {'airflow.ctx.dag_email': 'XXXX@XXXX.com', 'airflow.ctx.dag_owner': 'airflow', 'airflow.ctx.dag_id': 'spark_hive_ssh_dag', 'airflow.ctx.task_id': 'hive_direct_task', 'airflow.ctx.execution_date': '2020-12-09T18:03:28.344312+00:00', 'airflow.ctx.dag_run_id': 'manual__'} {{taskinstance.py:1150}} ERROR - No such file or directory: 'beeline': 'beeline' Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task result = task_copy.execute(context=context) File "/usr/local/lib/python3.7/site-packages/airflow/operators/hive_operator.py", line 137, in execute self.hook.run_cli(hql=self.hql, schema=self.schema, hive_conf=self.hiveconfs) File "/usr/local/lib/python3.7/site-packages/airflow/hooks/hive_hooks.py", line 258, in run_cli close_fds=True) File "/usr/lib64/python3.7/subprocess.py", line 800, in __init__ restore_signals, start_new_session) File "/usr/lib64/python3.7/subprocess.py", line 1551, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) **FileNotFoundError: No such file or directory: 'beeline': 'beeline'** Similar to this, If I want to use HDFSSensor, it will require hadoop/hdfs client. and many-more like this. Talking about my on premises airflow setup on EKS, We added all the binaries (hive, hadoop, hdfs) by keeping the base image - Apache Airflow. The same code works fine, whenever queried from Airflow to EMR using HiveOperator. **Does MWAA support integrations to AWS Services only?** Like EMR ClusterLaunch, EMR AddStep, AthenaOperators etc. **Can I Achieve the above use case with AWS MWAA?** Till now I have explored everything and I am unable to find a way to add the binaries on MWAA. If the above usecase is possible, please let us know how can I add the binaries in MWAA. Thanks, Neeraj Vyas (Data Engineer) Neerajvyas615@gmail.com Edited by: NeerajVyas on Dec 13, 2020 1:37 AM Edited by: NeerajVyas on Dec 13, 2020 1:39 AM
1
answers
0
votes
2
views
asked a year ago

NoClassDefFoundError when trying to use Hive/JDBC from Zeppelin

Using EMR 6.1.0 -> Hadoop 3.2.1, Hive 3.1.2, Spark 3.0.0 and Zeppelin 0.9.0, I installed the jdbc interpreter using: sudo bash bin/install-interpreter.sh -n jdbc --artifact org.apache.zeppelin:zeppelin-jdbc:0.9.0-preview2 (The --artifact flag was required, otherwise the interpreter would not install). Using the Zeppelin web interface (accessed as localhost:8890 using ssh forwarding), the hive/jdbc interpreter and a simple SELECT statement throws an error (copied below). I'm not sure what is required to get the setup to recognize PropertiesUtil. Any steps to get the jdbc hive interpreter working in Zeppelin would be appreciated. I have posted to the Zeppelin forums as well, but this may be specific to EMR. ----------- java.lang.NoClassDefFoundError: org/apache/zeppelin/util/PropertiesUtil at org.apache.zeppelin.jdbc.JDBCInterpreter.createConnectionPool(JDBCInterpreter.java:469) at org.apache.zeppelin.jdbc.JDBCInterpreter.getConnectionFromPool(JDBCInterpreter.java:485) at org.apache.zeppelin.jdbc.JDBCInterpreter.getConnection(JDBCInterpreter.java:505) at org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:706) at org.apache.zeppelin.jdbc.JDBCInterpreter.internalInterpret(JDBCInterpreter.java:881) at org.apache.zeppelin.interpreter.AbstractInterpreter.interpret(AbstractInterpreter.java:47) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:110) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:684) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:577) at org.apache.zeppelin.scheduler.Job.run(Job.java:172) at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:130) at org.apache.zeppelin.scheduler.ParallelScheduler.lambda$runJobInScheduler$0(ParallelScheduler.java:39) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: org.apache.zeppelin.util.PropertiesUtil at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 15 more Edited by: AADC on Sep 15, 2020 9:39 PM
1
answers
0
votes
4
views
asked 2 years ago

AWS Step Function does not recognize job flow ID

I have created trivial step function to add a step to an EMR cluster. The jobflowid is not being recognised. Does anyone have a working solution here? Or pointer to what set up may be missing? The state machine: { "StartAt": "add_emr_step", "States": { "add_emr_step": { "End": true, "InputPath": "$", "Parameters": { "ClusterId": "j-INTERESTINGID", "Step": { "Name": "$$.Execution.Id", "ActionOnFailure": "CONTINUE", "HadoopJarStep": { "Jar": "command-runner.jar", "Args": [ "spark-submit --deploy-mode cluster s3://myfavouritebucketname86fda855-nhjwrxp55888/scripts/csv_to_parquet.py" ] } } }, "Type": "Task", "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync", "ResultPath": "$.guid" } }, "TimeoutSeconds": 30 } Even with admin permissions the following error is returning. The EMR cluster is marked as visible. Copying and pasting the jobflowid into the console, it shows as expected. > Error > > EMR.AmazonElasticMapReduceException > > Cause > > Specified job flow ID does not exist. (Service: > AmazonElasticMapReduce; Status Code: 400; Error Code: > ValidationException; Request ID: be70a470-ef6d-4356-98cd-0fc36d2cd132) The IAM Role: { "Version": "2012-10-17", "Statement": [ { "Action": [ "elasticmapreduce:AddJobFlowSteps", "elasticmapreduce:DescribeStep", "elasticmapreduce:CancelSteps" ], "Resource": "arn:aws:elasticmapreduce:us-east-1:accountid:cluster/*", "Effect": "Allow" }, { "Action": [ "events:PutTargets", "events:PutRule", "events:DescribeRule" ], "Resource": "arn:aws:events:us-east-1:accountid:rule/StepFunctionsGetEventForEMRAddJobFlowStepsRule", "Effect": "Allow" } ] } The API call in CloudTrail { "eventVersion": "1.05", "userIdentity": { "type": "AssumedRole", "principalId": "AROA36BAFPT62RGX2RMVQ:mYIsrbWoGnzXqWuTAXdcCriqSbTPXkdb", "arn": "arn:aws:sts::accountid:assumed-role/emr-step-statematchineRoleC75C4884-1FDZWJ1GBIVUE/mYIsrbWoGnzXqWuTAXdcCriqSbTPXkdb", "accountId": "accountid", "accessKeyId": "ASIA36BAFPT6UG3XKKHI", "sessionContext": { "sessionIssuer": { "type": "Role", "principalId": "AROA36BAFPT62RGX2RMVQ", "arn": "arn:aws:iam::accountid:role/emr-step-statematchineRoleC75C4884-1FDZWJ1GBIVUE", "accountId": "accountid", "userName": "emr-step-statematchineRoleC75C4884-1FDZWJ1GBIVUE" }, "webIdFederationData": {}, "attributes": { "mfaAuthenticated": "false", "creationDate": "2020-06-08T23:29:29Z" } }, "invokedBy": "states.amazonaws.com" }, "eventTime": "2020-06-08T23:29:29Z", "eventSource": "elasticmapreduce.amazonaws.com", "eventName": "AddJobFlowSteps", "awsRegion": "us-east-1", "sourceIPAddress": "states.amazonaws.com", "userAgent": "states.amazonaws.com", "errorCode": "ValidationException", "errorMessage": "Specified job flow ID does not exist.", "requestParameters": { "jobFlowId": "j-288AS5TZY5CY7", "steps": [ { "name": "$$.Execution.Id", "actionOnFailure": "CONTINUE", "hadoopJarStep": { "jar": "command-runner.jar", "args": [] } } ] }, "responseElements": null, "requestID": "00c685e0-0d22-43ac-8fce-29bd808462bf", "eventID": "2a51a9ef-20a5-4fd7-bd33-be42cdf3088c", "eventType": "AwsApiCall", "recipientAccountId": "accountid" }
1
answers
0
votes
14
views
asked 2 years ago

No Such Method Error with Spark and Hudi

I have been colliding with this problem for several days now, and am at my wits end. We have an EMR cluster that we launch to process data into a Hudi data set. We start the cluster with an api call, specifying Spark, Hive, Tez, and EMR release label emr-5.29.0. We set a few configurations, notably "hive.metastore.client.factory.class", for glue with "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory". We then run our little script, which calls a spark.sql query, then writes to the Hudi set. We follow the configuration steps laid out here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html. The script works in Scala on a EMR Notebook, as well as from the spark-shell. But when we try to spark-submit it, we get the following error: &#39;java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;&#39; To compound the frustration, we had originally written the script in Python, but we encounter this same error within the EMR Notebook, the pyspark shell, and the spark-submit module. Some preliminary reading suggests that this is caused by incompatible jars, based on process of elimination we think that it is something with the Hudi jars and the AWS Glue functionality. Does anyone have any thoughts? The whole stack trace, if it helps any: at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.<init>(SdkTLSSocketFactory.java:58) at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.getPreferredSocketFactory(ApacheConnectionManagerFactory.java:93) at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:66) at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:59) at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:50) at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:38) at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:324) at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:308) at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:237) at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:223) at com.amazonaws.services.glue.AWSGlueClient.<init>(AWSGlueClient.java:177) at com.amazonaws.services.glue.AWSGlueClient.<init>(AWSGlueClient.java:163) at com.amazonaws.services.glue.AWSGlueClientBuilder.build(AWSGlueClientBuilder.java:61) at com.amazonaws.services.glue.AWSGlueClientBuilder.build(AWSGlueClientBuilder.java:27) at com.amazonaws.client.builder.AwsSyncClientBuilder.build(AwsSyncClientBuilder.java:46) at com.amazonaws.glue.catalog.metastore.AWSGlueClientFactory.newClient(AWSGlueClientFactory.java:72) at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.<init> (AWSCatalogMetastoreClient.java:146) at com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.createMetaStoreClient(AWSGlueDataCatalogHiveClientFactory.java:16) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3007) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3042) at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1235) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:175) at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:167) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183) at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:271) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:384) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) ... 77 more
2
answers
0
votes
25
views
asked 2 years ago

Setting the Step Concurrency Level is doing nothing

https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/ EMR&#39;s latest release allows you to run steps in parallel. This is a great feature. Problem is, when I set this value from it&#39;s default of 1 to 10, there is no noticeable change. The Steps still execute in sequential order where only one runs and the others are in pending. I&#39;m setting this value after the cluster is made with the modify-cluster command in the cli **Environment:** **Version:** emr-5.28.0 **Application:** Spark **Instance types:** Master(1) ->m4.2xlarge, Cores(2)->m4.xlarge I&#39;ve tried experimenting with the yarn schedulers, but it seems to make no difference. For the sake of simplicity lets just say it&#39;s the default Capacity Scheduler that EMR boots with. Even with the default Yarn capacity scheduler I can successfully do the old hack of tricking EMR into adding more applications to Yarn which will cause them to execute in parallel(desired) as outlined here:<https://stackoverflow.com/a/54069971> This works because of the **spark.yarn.submit.waitAppCompletion=false** property which makes an EMR step complete right away and move on to the next one, even before the Spark application finishes. This is making me think the problem is ultimately EMR not obeying the StepConcurrencyLevel for whatever reason unknown to me. This hacky solution is not ideal, because it basically creates a disconnect from EMR and the Spark job, which means if you wanted to use another aws service like the Serverless Step functions communicating with EMR, certain features may not work correctly (https://aws.amazon.com/blogs/aws/new-using-step-functions-to-orchestrate-amazon-emr-workloads/) Does it matter if the spark app is submitted in cluster-mode or client-mode? Any tips, tricks, or pointers would be much appreciated. I would imagine if it was working you would see two or more "RUNNING" steps on the aws EMR console in the Step tab with that also reflected in yarn (yarn application -list)
1
answers
0
votes
36
views
asked 2 years ago

Usage of StreamingFileSink is throwing NoClassDefFoundError

I know this could be my problem, but trying to graple it for a while. I am trying to run flink in AWS EMR cluster. My setup is: Time series event from Kinesis -> flink job -> Save it to S3 ``` DataStream<Event> kinesis = env.addSource(new FlinkKinesisConsumer< (this.streamName, new EventSchema(), kinesisConsumerConfig)).name("source"); final StreamingFileSink<Event> streamingFileSink = StreamingFileSink.<Event>forRowFormat( new org.apache.flink.core.fs.Path("s3a://"+ this.bucketName + "/" + this.objectPrefix), new SimpleStringEncoder<>("UTF-8")) .withBucketAssignerAndPolicy(new OrgIdBucketAssigner(), DefaultRollingPolicy.create().build()) .build(); DataStream<Event> eventDataStream = kinesis .rebalance() .keyBy(createKeySelectorByChoosingOrgIdFromTheEvent()) .process(new KeyedProcessFunction<String, Event, Event>() { @Override public void processElement(Event value, Context ctx, Collector<DeviceEvent> out) throws Exception { out.collect(value); } }); eventDataStream.addSink(streamingFileSink).name("streamingFileSink"); ``` From one of the sites, https://www.mail-archive.com/user@flink.apache.org/msg25039.html I came to know that in order to make StreamingFileSink work, one has to drop a jar **flink-s3-fs-hadoop-1.7.1.jar** to **/usr/lib/flink/lib** folder. My **/usr/lib/flink/lib** folder in EMR master node looks as below ``` -rw-r--r-- 1 root root 9924 Mar 20 01:06 slf4j-log4j12-1.7.15.jar -rw-r--r-- 1 root root 42655628 Mar 20 01:06 flink-shaded-hadoop2-uber-1.7.1.jar -rw-r--r-- 1 root root 483665 Mar 20 01:06 log4j-1.2.17.jar -rw-r--r-- 1 root root 140172 Mar 20 01:06 flink-python_2.11-1.7.1.jar -rw-r--r-- 1 root root 92070994 Mar 20 01:08 flink-dist_2.11-1.7.1.jar -rw-r--r-- 1 root root 23451686 May 5 23:04 flink-s3-fs-hadoop-1.7.1.jar ``` When I try to run the flink job, it throws the below exception in EMR slaves. ``` 2019-05-06 01:43:49,589 INFO org.apache.flink.runtime.taskmanager.Task - KeyedProcess -> Sink: streamingFileSink (3/4) (31000a186f6ab11f0066556116c669ba) switched from RUNNING to FAILED. java.lang.NoClassDefFoundError: Could not initialize class org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.internal.S3ErrorResponseHandler at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client.<init>(AmazonS3Client.java:374) at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client.<init>(AmazonS3Client.java:553) at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client.<init>(AmazonS3Client.java:531) at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.DefaultS3ClientFactory.newAmazonS3Client(DefaultS3ClientFactory.java:80) at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.DefaultS3ClientFactory.createS3Client(DefaultS3ClientFactory.java:54) at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:256) at org.apache.flink.fs.s3.common.AbstractS3FileSystemFactory.create(AbstractS3FileSystemFactory.java:125) at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:395) at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:318) at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.<init>(Buckets.java:112) at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink$RowFormatBuilder.createBuckets(StreamingFileSink.java:242) at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.initializeState(StreamingFileSink.java:327) at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178) at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:278) at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704) at java.lang.Thread.run(Thread.java:748) ``` Can you pls let me know what is the basic thing I am missing? Edited by: nischit on May 6, 2019 7:09 AM
1
answers
0
votes
0
views
asked 3 years ago

EMR: Terminated with errors Bootstrap failure

I am just trying to build a cluster with hadoop, hive and spark preinstalled on emr-5.23.0. Till today morning it was working fine and suddenly it started failing with the below error and the cluster getting terminated: Terminated with errors Bootstrap failure. I tried to capture the errors before the cluster dies off: ***************************************************************************** master.log ***************************************************************************** \[ec2-user@ip-00-000-000-00 ~]$ cat /mnt/var/log/bootstrap-actions/master.log 2019-04-16 19:08:42,443 INFO i-078302fec7edd117b: new instance started 2019-04-16 19:08:42,451 ERROR i-078302fec7edd117b: failed to start. bootstrap action 1 failed with non-zero exit code. 2019-04-16 19:09:25,805 INFO i-02fc7b0d9e6815b8d: new instance started 2019-04-16 19:09:27,826 INFO i-02fc7b0d9e6815b8d: all bootstrap actions complete and instance ready 2019-04-16 19:10:00,812 INFO i-0261bda9ec7074d13: new instance started 2019-04-16 19:10:03,482 INFO i-0261bda9ec7074d13: all bootstrap actions complete and instance ready ***************************************************************************** cloud-init ***************************************************************************** Apr 16 19:04:23 cloud-init\[4713]: util.py\[WARNING]: Running module yum-configure (<module 'cloudinit.config.cc_yum_configure' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_yum_configure.pyc'>) failed Apr 16 19:04:23 cloud-init\[4713]: util.py\[DEBUG]: Running module yum-configure (<module 'cloudinit.config.cc_yum_configure' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_yum_configure.pyc'>) failed Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/cloudinit/stages.py", line 660, in _run_modules cc.run(run_name, mod.handle, func_args, freq=freq) File "/usr/lib/python2.7/dist-packages/cloudinit/cloud.py", line 63, in run return self._runners.run(name, functor, args, freq, clear_on_fail) File "/usr/lib/python2.7/dist-packages/cloudinit/helpers.py", line 197, in run results = functor(*args) File "/usr/lib/python2.7/dist-packages/cloudinit/config/cc_yum_configure.py", line 53, in handle raise errors\[-1] KeyError: 'regional' Apr 16 19:04:23 cloud-init\[4713]: stages.py\[DEBUG]: Running module yum-add-repo (<module 'cloudinit.config.cc_yum_add_repo' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_yum_add_repo.pyc'>) with frequency once-per-instance Apr 16 19:04:23 cloud-init\[4713]: util.py\[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_yum_add_repo - wb: \[644] 20 bytes Apr 16 19:04:23 cloud-init\[4713]: helpers.py\[DEBUG]: Running config-yum-add-repo using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_yum_add_repo'>) Apr 16 19:04:23 cloud-init\[4713]: cc_yum_add_repo.py\[INFO]: Skipping repo emr-platform, file /etc/yum.repos.d/emr-platform.repo already exists! Apr 16 19:04:23 cloud-init\[4713]: cc_yum_add_repo.py\[INFO]: Skipping repo emr-apps, file /etc/yum.repos.d/emr-apps.repo already exists! Apr 16 19:04:23 cloud-init\[4713]: stages.py\[DEBUG]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_package_update_upgrade_install.pyc'>) with frequency once-per-instance Apr 16 19:04:23 cloud-init\[4713]: util.py\[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_package_update_upgrade_install - wb: \[644] 20 bytes Apr 16 19:04:23 cloud-init\[4713]: helpers.py\[DEBUG]: Running config-package-update-upgrade-install using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_package_update_upgrade_install'>) Apr 16 19:04:23 cloud-init\[4713]: amazon.py\[DEBUG]: Upgrade level: security Apr 16 19:04:23 cloud-init\[4713]: util.py\[DEBUG]: Running command \['yum', '-t', '-y', '--exclude=kernel', '--exclude=nvidia*', '--exclude=cudatoolkit', '--security', '--sec-severity=critical', '--sec-severity=important', 'upgrade'] with allowed return codes \[0] (shell=False, capture=False) Apr 16 19:05:24 cloud-init\[4713]: util.py\[WARNING]: Package upgrade failed Apr 16 19:05:24 cloud-init\[4713]: util.py\[DEBUG]: Package upgrade failed Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/cloudinit/config/cc_package_update_upgrade_install.py", line 81, in handle cloud.distro.upgrade_packages(upgrade_level, upgrade_exclude) File "/usr/lib/python2.7/dist-packages/cloudinit/distros/amazon.py", line 50, in upgrade_packages return self.package_command('upgrade', args=args) File "/usr/lib/python2.7/dist-packages/cloudinit/distros/rhel.py", line 211, in package_command util.subp(cmd, capture=False, pipe_cat=True, close_stdin=True) File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1626, in subp cmd=args) ProcessExecutionError: Unexpected error while running command. Command: \['yum', '-t', '-y', '--exclude=kernel', '--exclude=nvidia*', '--exclude=cudatoolkit', '--security', '--sec-severity=critical', '--sec-severity=important', 'upgrade'] Exit code: 1 Reason: - Stdout: '' Stderr: '' Apr 16 19:05:24 cloud-init\[4713]: cc_package_update_upgrade_install.py\[WARNING]: 1 failed with exceptions, re-raising the last one Apr 16 19:05:24 cloud-init\[4713]: util.py\[WARNING]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_package_update_upgrade_install.pyc'>) failed Apr 16 19:05:24 cloud-init\[4713]: util.py\[DEBUG]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_package_update_upgrade_install.pyc'>) failed Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/cloudinit/stages.py", line 660, in _run_modules cc.run(run_name, mod.handle, func_args, freq=freq) File "/usr/lib/python2.7/dist-packages/cloudinit/cloud.py", line 63, in run return self._runners.run(name, functor, args, freq, clear_on_fail) File "/usr/lib/python2.7/dist-packages/cloudinit/helpers.py", line 197, in run results = functor(*args) File "/usr/lib/python2.7/dist-packages/cloudinit/config/cc_package_update_upgrade_install.py", line 111, in handle raise errors\[-1] ProcessExecutionError: Unexpected error while running command. Command: \['yum', '-t', '-y', '--exclude=kernel', '--exclude=nvidia*', '--exclude=cudatoolkit', '--security', '--sec-severity=critical', '--sec-severity=important', 'upgrade'] Exit code: 1 Reason: - Stdout: '' Stderr: '' Apr 16 19:05:24 cloud-init\[4713]: stages.py\[DEBUG]: Running module timezone (<module 'cloudinit.config.cc_timezone' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_timezone.pyc'>) with frequency once-per-instance Apr 16 19:05:24 cloud-init\[4713]: util.py\[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_timezone - wb: \[644] 20 bytes Apr 16 19:05:24 cloud-init\[4713]: helpers.py\[DEBUG]: Running config-timezone using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_timezone'>) Apr 16 19:05:24 cloud-init\[4713]: cc_timezone.py\[DEBUG]: Skipping module named timezone, no 'timezone' specified Apr 16 19:05:24 cloud-init\[4713]: stages.py\[DEBUG]: Running module puppet (<module 'cloudinit.config.cc_puppet' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_puppet.pyc'>) with frequency once-per-instance Apr 16 19:05:24 cloud-init\[4713]: util.py\[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_puppet - wb: \[644] 20 bytes Apr 16 19:05:24 cloud-init\[4713]: helpers.py\[DEBUG]: Running config-puppet using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_puppet'>) Apr 16 19:05:24 cloud-init\[4713]: cc_puppet.py\[DEBUG]: Skipping module named puppet, no 'puppet' configuration found Apr 16 19:05:24 cloud-init\[4713]: stages.py\[DEBUG]: Running module disable-ec2-metadata (<module 'cloudinit.config.cc_disable_ec2_metadata' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_disable_ec2_metadata.pyc'>) with frequency always Apr 16 19:05:24 cloud-init\[4713]: helpers.py\[DEBUG]: Running config-disable-ec2-metadata using lock (<cloudinit.helpers.DummyLock object at 0x7fb9c0cad810>) Apr 16 19:05:24 cloud-init\[4713]: cc_disable_ec2_metadata.py\[DEBUG]: Skipping module named disable-ec2-metadata, disabling the ec2 route not enabled Apr 16 19:05:24 cloud-init\[4713]: stages.py\[DEBUG]: Running module runcmd (<module 'cloudinit.config.cc_runcmd' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_runcmd.pyc'>) with frequency once-per-instance Apr 16 19:05:24 cloud-init\[4713]: util.py\[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_runcmd - wb: \[644] 20 bytes Apr 16 19:05:24 cloud-init\[4713]: helpers.py\[DEBUG]: Running config-runcmd using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_runcmd'>) Apr 16 19:05:24 cloud-init\[4713]: cc_runcmd.py\[DEBUG]: Skipping module named runcmd, no 'runcmd' key in configuration Apr 16 19:05:24 cloud-init\[4713]: cloud-init\[DEBUG]: Ran 10 modules with 2 failures Apr 16 19:05:24 cloud-init\[4713]: util.py\[DEBUG]: Reading from /proc/uptime (quiet=False) Apr 16 19:05:24 cloud-init\[4713]: util.py\[DEBUG]: Read 15 bytes from /proc/uptime Apr 16 19:05:24 cloud-init\[4713]: util.py\[DEBUG]: cloud-init mode 'modules' took 140.880 seconds (140.88) Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Cloud-init v. 0.7.6 running 'modules:final' at Tue, 16 Apr 2019 19:08:10 +0000. Up 483.41 seconds. Apr 16 19:08:10 cloud-init\[5562]: stages.py\[DEBUG]: Using distro class <class 'cloudinit.distros.amazon.Distro'> Apr 16 19:08:10 cloud-init\[5562]: stages.py\[DEBUG]: Running module scripts-per-once (<module 'cloudinit.config.cc_scripts_per_once' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_per_once.pyc'>) with frequency once Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Writing to /var/lib/cloud/sem/config_scripts_per_once.once - wb: \[644] 20 bytes Apr 16 19:08:10 cloud-init\[5562]: helpers.py\[DEBUG]: Running config-scripts-per-once using lock (<FileLock using file '/var/lib/cloud/sem/config_scripts_per_once.once'>) Apr 16 19:08:10 cloud-init\[5562]: stages.py\[DEBUG]: Running module scripts-per-boot (<module 'cloudinit.config.cc_scripts_per_boot' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_per_boot.pyc'>) with frequency always Apr 16 19:08:10 cloud-init\[5562]: helpers.py\[DEBUG]: Running config-scripts-per-boot using lock (<cloudinit.helpers.DummyLock object at 0x7f35a40e08d0>) Apr 16 19:08:10 cloud-init\[5562]: stages.py\[DEBUG]: Running module scripts-per-instance (<module 'cloudinit.config.cc_scripts_per_instance' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_per_instance.pyc'>) with frequency once-per-instance Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_scripts_per_instance - wb: \[644] 20 bytes Apr 16 19:08:10 cloud-init\[5562]: helpers.py\[DEBUG]: Running config-scripts-per-instance using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_scripts_per_instance'>) Apr 16 19:08:10 cloud-init\[5562]: stages.py\[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) with frequency once-per-instance Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_scripts_user - wb: \[644] 20 bytes Apr 16 19:08:10 cloud-init\[5562]: helpers.py\[DEBUG]: Running config-scripts-user using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_scripts_user'>) Apr 16 19:08:10 cloud-init\[5562]: stages.py\[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.pyc'>) with frequency once-per-instance Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_ssh_authkey_fingerprints - wb: \[644] 20 bytes Apr 16 19:08:10 cloud-init\[5562]: helpers.py\[DEBUG]: Running config-ssh-authkey-fingerprints using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_ssh_authkey_fingerprints'>) Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False) Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Read 3935 bytes from /etc/ssh/sshd_config Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Reading from /home/ec2-user/.ssh/authorized_keys (quiet=False) Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Read 389 bytes from /home/ec2-user/.ssh/authorized_keys Apr 16 19:08:10 cloud-init\[5562]: stages.py\[DEBUG]: Running module keys-to-console (<module 'cloudinit.config.cc_keys_to_console' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_keys_to_console.pyc'>) with frequency once-per-instance Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_keys_to_console - wb: \[644] 20 bytes Apr 16 19:08:10 cloud-init\[5562]: helpers.py\[DEBUG]: Running config-keys-to-console using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_keys_to_console'>) Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Running command \['/usr/libexec/cloud-init/write-ssh-key-fingerprints', '', 'ssh-dss'] with allowed return codes \[0] (shell=False, capture=True) Apr 16 19:08:10 cloud-init\[5562]: stages.py\[DEBUG]: Running module phone-home (<module 'cloudinit.config.cc_phone_home' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_phone_home.pyc'>) with frequency once-per-instance Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_phone_home - wb: \[644] 20 bytes Apr 16 19:08:10 cloud-init\[5562]: helpers.py\[DEBUG]: Running config-phone-home using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_phone_home'>) Apr 16 19:08:10 cloud-init\[5562]: cc_phone_home.py\[DEBUG]: Skipping module named phone-home, no 'phone_home' configuration found Apr 16 19:08:10 cloud-init\[5562]: stages.py\[DEBUG]: Running module final-message (<module 'cloudinit.config.cc_final_message' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_final_message.pyc'>) with frequency always Apr 16 19:08:10 cloud-init\[5562]: helpers.py\[DEBUG]: Running config-final-message using lock (<cloudinit.helpers.DummyLock object at 0x7f35a3d848d0>) Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Reading from /proc/uptime (quiet=False) Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Read 15 bytes from /proc/uptime Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Cloud-init v. 0.7.6 finished at Tue, 16 Apr 2019 19:08:10 +0000. Datasource DataSourceEc2. Up 483.59 seconds Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Writing to /var/lib/cloud/instance/boot-finished - wb: \[644] 52 bytes Apr 16 19:08:10 cloud-init\[5562]: stages.py\[DEBUG]: Running module power-state-change (<module 'cloudinit.config.cc_power_state_change' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_power_state_change.pyc'>) with frequency once-per-instance Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Writing to /var/lib/cloud/instances/i-078302fec7edd117b/sem/config_power_state_change - wb: \[644] 20 bytes Apr 16 19:08:10 cloud-init\[5562]: helpers.py\[DEBUG]: Running config-power-state-change using lock (<FileLock using file '/var/lib/cloud/instances/i-078302fec7edd117b/sem/config_power_state_change'>) Apr 16 19:08:10 cloud-init\[5562]: cc_power_state_change.py\[DEBUG]: no power_state provided. doing nothing Apr 16 19:08:10 cloud-init\[5562]: cloud-init\[DEBUG]: Ran 9 modules with 0 failures Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Creating symbolic link from '/run/cloud-init/result.json' => '../../var/lib/cloud/data/result.json' Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Reading from /proc/uptime (quiet=False) Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: Read 15 bytes from /proc/uptime Apr 16 19:08:10 cloud-init\[5562]: util.py\[DEBUG]: cloud-init mode 'modules' took 0.182 seconds (0.18) ***************************************************************************** There is no additional bootstarp action added by me. Edited by: BurnswickNYC on Apr 16, 2019 1:53 PM
3
answers
0
votes
30
views
asked 3 years ago

pyspark gets stuck in Running due to import issue

**In short:** I run a pySpark application on AWS's EMR. When I map an rdd using a custom function that resides in an external module in an external package (shipped inside a .zip file as --py-files) the cluster gets stuck - the Running status is kept while no more log lines appear until I manually terminate it. **What it is not:** It is not a proper import exception - as this would have terminated the application upon executing the import lines, raising the appropriate exception, which does not happen. Also, as seen below, calling a function that maps with a similar function as a lambda, when the called function resides in the "problematic" module - works. **What it is:** Only when the program tries to use a function from that module as a mapping function in a transformation that is written in the main program does the bug occur. Additionally, if I remove the import line highlighted in the external file (the "problematic" module) - an import that is never used there in this minimal bug-reproduction context (but in the actual context it is used) - the bug ceases to exist. Below is the code for the minimal example of the bug, including commenting of 2 important lines, and some technical info. Any help would be appreciated. Here is the _main program_: ``` import spark_context_holder from reproducing_bugs_external_package import reproducing_bugs_external_file sc = spark_context_holder.sc log = spark_context_holder.log def make_nums_rdd(): return sc.parallelize([1, 2, 3] * 300).map(lambda x: x * x / 1.45) log.warn("Starting my code!") sum = sc.parallelize([1,2,3]*300).map(lambda x: x*x/1.45).sum() log.warn("The calculated sum using in-line expression, which doesn't mean anything more than 'succeeded in carrying out the calculation on the cluster', is {}!".format(sum)) simple_sum_rdd = make_nums_rdd() log.warn("The calculated sum using the in-file function, which doesn't mean anything more than 'succeeded in carrying out the calculation on the cluster', is {}!".format(simple_sum_rdd.sum())) simple_sum_rdd = reproducing_bugs_external_file.make_nums_rdd(sc) log.warn("The calculated sum using the external file's function, which doesn't mean anything more than 'succeeded in carrying out the calculation on the cluster', is {}!".format(simple_sum_rdd.sum())) simple_sum_rdd = sc.parallelize([1,2,3]*300).map(reproducing_bugs_external_file.calc_func) log.warn("The calculated sum using the external file's mapping function, which doesn't mean anything more than 'succeeded in carrying out the calculation on the cluster', is {}!".format(simple_sum_rdd.sum())) # This last line does not get logged, while the others up until this one do. Here the cluster gets stuck on Running status without outputting any more log lines ``` In the zip file shipped as --py-files I have the following structure: -spark_context_holde.py -reproducing_bugs_external_package -- __init__.py -- reproducing_bugs_external_file.py And here are their respective contents: spark_context_holder.py ``` from pyspark.sql import SparkSession from pyspark import SparkConf, SparkContext conf = SparkConf().setAppName("kac_walk_experiment") sc = SparkContext(conf=conf) spark = SparkSession(sc) log4jLogger = sc._jvm.org.apache.log4j log = log4jLogger.LogManager.getLogger("dbg_et") # sc.setLogLevel("ALL") def getParallelismAlternative(): return int(sc.getConf().get('spark.cores.max')) ``` __init__.py ``` from . import reproducing_bugs_external_file __all__ = [reproducing_bugs_external_file] ``` reproducing_bugs_external_file.py ``` import numpy import spark_context_holder # If this is removed - the bug stops! def make_nums_rdd(sc): return sc.parallelize([1, 2, 3] * 300).map(lambda x: x * x / 1.45) def calc_func(x): return x*x/1.45 ``` More technical details: Release label:emr-5.17.0 Hadoop distribution:Amazon 2.8.4 Applications:Spark 2.3.1 using python3.4 which is the 3 version installed on AWS's machines to date All this can be seen in SO https://stackoverflow.com/questions/54011119/an-import-issue-gets-pyspark-on-aws-stuck-in-running-status
1
answers
0
votes
0
views
asked 3 years ago

Slave Nodes Unable to Talk to Master in Spark Application

I'm attempting to run a basic word count program on an EMR cluster as a PoC using Spark and Yarn. The step immediately fails, it seems due to the fact that the slave nodes cannot contact the master in some form. Yarn fails to create any containers, and the slave nodes fail to connect to the master node. I'm running the below scala code as a spark-submit step in the EMR cluster with the class name provided as an argument. ``` object App { def main(args: Array[String]): Unit = { //Set logging level to ERROR Logger.getLogger("org.apache.spark").setLevel(Level.ERROR) //Reading a local file on HDFS val myInput = "s3://texfile.txt" // Should be some file on your local HDFS val conf = new SparkConf().setAppName("Word Count") val sc = new SparkContext(conf) val inputData = sc.textFile(myInput, 2).cache() //Find words having words 'island' and 'the' val wordA = inputData.filter(line => line.contains("islands")).count() val wordB = inputData.filter(line => line.contains("the")).count println("Number of lines with word 'islands' %s".format(wordA)) println("Number of lines with word 'the' %s".format(wordB)) } } ``` LOGS: **YARN RESOURCE MANAGER** ======================== ``` 2018-07-17 12:40:21,188 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl (AsyncDispatcher event handler): Application application_1531831066208_0001 failed 2 times due to AM Container for appattempt_1531831066208_0001_ 000002 exited with exitCode: 13 Failing this attempt.Diagnostics: Exception from container-launch. Container id: container_1531831066208_0001_02_000001 Exit code: 13 Stack trace: ExitCodeException exitCode=13: at org.apache.hadoop.util.Shell.runCommand(Shell.java:972) at org.apache.hadoop.util.Shell.run(Shell.java:869) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Container exited with a non-zero exit code 13 For more detailed output, check the application tracking page: http://ip-10-162-1-222.ec2.internal:8088/cluster/app/application_1531831066208_0001 Then click on links to logs of each attempt. . Failing the application. ``` **SLAVE NODE** ====================== ``` 2018-07-17 12:37:29,827 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl eepTime=1000 MILLISECONDS) 2018-07-17 12:37:30,828 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl eepTime=1000 MILLISECONDS) 2018-07-17 12:37:31,829 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl eepTime=1000 MILLISECONDS) 2018-07-17 12:37:32,830 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl eepTime=1000 MILLISECONDS) 2018-07-17 12:37:33,831 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl eepTime=1000 MILLISECONDS) 2018-07-17 12:37:34,832 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl eepTime=1000 MILLISECONDS) 2018-07-17 12:37:35,832 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl eepTime=1000 MILLISECONDS) 2018-07-17 12:37:36,833 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl eepTime=1000 MILLISECONDS) 2018-07-17 12:37:37,834 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl eepTime=1000 MILLISECONDS) 2018-07-17 12:37:38,835 INFO org.apache.hadoop.ipc.Client (main): Retrying connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sl eepTime=1000 MILLISECONDS) 2018-07-17 12:37:38,836 WARN org.apache.hadoop.ipc.Client (main): Failed to connect to server: ip-10-162-1-222.ec2.internal/10.162.1.222:8025: retries get failed due to exceeded maximum allowed retries number: 10 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788) at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550) at org.apache.hadoop.ipc.Client.call(Client.java:1381) at org.apache.hadoop.ipc.Client.call(Client.java:1345) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy75.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:73) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346) at com.sun.proxy.$Proxy76.registerNodeManager(Unknown Source) ``` I can ping the slave nodes fine from the master, and vice versa, when using strictly the IP addresses. I suspect the issue might have to do with DNS lookups going wonky, but am unsure how to debug this any further, and can seem to find no relevant settings in creating the EMR cluster. Any time spent helping me on this issue is greatly appreciated.
1
answers
0
votes
0
views
asked 4 years ago
  • 1
  • 90 / page