EMR Spark instance connectivity to Cassandra instance

1
  1. Spun up an EMR instance: emr-6.10.0 Spark 3.3.1, HBASE 2.4.15, Hive 3.1.3, JupyterHub 1.5.0, Hadoop 3.3.3, ZooKeeper 3.5.10, Zeppelin 0.10.1, Phoenix 5.1.2, Presto 0.278, TensorFlow2.11.0, JupyterEnterpriseGateway 2.6.0 This instance also has access to S3

2. Spun up a KeySpace instance:

  • Replication strategy: Single-Region
  • arn:aws:cassandra:us-east-1*******
  • Created a simple table in Cassandra.

root |-- rowkey: string (nullable = false) |-- amount: string (nullable = true) |-- source: string (nullable = true)

3. SSH'd into EMR master node.

  • locally, generate AWS_ACCESS_KEY, AWS_SECRET_ACCESS_KEY
  • in SSH terminal, 'AWS Configure' and provide necessary info as prompted

4. Start pyspark session

pyspark --files https://blackline-ic-us-dev-****.s3.amazonaws.com/JoesTest/application.conf --conf 
  spark.cassandra.connection.config.profile.path=https://blackline-ic-us-dev-****.s3.amazonaws.com/JoesTest/application.conf --packages 
  com.crealytics:spark-excel_2.12:3.1.3_0.19.0,software.aws.mcs:aws-sigv4-auth-cassandra-java-driver-plugin:4.0.9,com.datastax.spark:spark-cassandra- 
  connector_2.12:3.4.1 --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions --conf spark.dynamicAllocation.enabled=false 
 --master yarn --driver-memory 3g --executor-memory 3g --num-executors 2 --executor-cores 2 --repositories https://repo1.maven.org/maven2/

5.Prepare to connect to Cassandra

import org.apache.spark.sql.cassandra 
spark.conf.set("spark.sql.catalog.myCatalog", "com.datastax.spark.connector.datasource.CassandraCatalog") 
df = spark.read.table("myCatalog.*****.dummy") 
df.printSchema() 

returns: root |-- rowkey: string (nullable = false) |-- amount:

6. read data in the table

df.show()
  • results in an exception.
  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 607, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o84.showString.
: java.lang.UnsupportedOperationException: empty.reduceLeft
        at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:49)

...

question Im not sure what this exception is telling us. Need help trouble shooting. Is it our network setup? Is it the way I am trying to use EMR?

More info: application.conf

datastax-java-driver {
	basic.contact-points = ["cassandra.us-east-1.amazonaws.com:9142"]
	basic.load-balancing-policy {
		class = DefaultLoadBalancingPolicy
		local-datacenter = us-east-1
		slow-replica-avoidance = false
	}
	advanced {
		auth-provider = {
			class = software.aws.mcs.auth.SigV4AuthProvider
			aws-region = us-east-1
		}
		ssl-engine-factory {
			class = DefaultSslEngineFactory
			truststore-path = "/home/hadoop/cassandra_truststore.jks"
			truststore-password = "*****"
			hostname-validation=false
		}
	}
}
已提問 7 個月前檢視次數 281 次
1 個回答
1

Hello There,

Thank you for the query.

I understand that you are trying to establish connection between EMR Spark and a Keyspace Instance with Cassandra. After the connection when you are trying to read data from tables using df.show() you are getting the following error "py4j.protocol.Py4JJavaError: An error occurred while calling o84.showString. : java.lang.UnsupportedOperationException: empty.reduceLeft". You would like to understand the root cause of the error and how to fix this.

I had a look into the steps you have followed. From what I can see so far, this would need detailed analysis and access to the logs, resources and account information, that are non-public information. On checking internally, I found that you have already logged a ticket with AWS Premium Support. I will take ownership of the case and assist you with investigating the issue there. I will post the solution here as well, once we manage to get past the issue.

Hope you have a great day ahead.

profile pictureAWS
支援工程師
Rajiv_M
已回答 7 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南