EMR Spark instance connectivity to Cassandra instance

1
  1. Spun up an EMR instance: emr-6.10.0 Spark 3.3.1, HBASE 2.4.15, Hive 3.1.3, JupyterHub 1.5.0, Hadoop 3.3.3, ZooKeeper 3.5.10, Zeppelin 0.10.1, Phoenix 5.1.2, Presto 0.278, TensorFlow2.11.0, JupyterEnterpriseGateway 2.6.0 This instance also has access to S3

2. Spun up a KeySpace instance:

  • Replication strategy: Single-Region
  • arn:aws:cassandra:us-east-1*******
  • Created a simple table in Cassandra.

root |-- rowkey: string (nullable = false) |-- amount: string (nullable = true) |-- source: string (nullable = true)

3. SSH'd into EMR master node.

  • locally, generate AWS_ACCESS_KEY, AWS_SECRET_ACCESS_KEY
  • in SSH terminal, 'AWS Configure' and provide necessary info as prompted

4. Start pyspark session

pyspark --files https://blackline-ic-us-dev-****.s3.amazonaws.com/JoesTest/application.conf --conf 
  spark.cassandra.connection.config.profile.path=https://blackline-ic-us-dev-****.s3.amazonaws.com/JoesTest/application.conf --packages 
  com.crealytics:spark-excel_2.12:3.1.3_0.19.0,software.aws.mcs:aws-sigv4-auth-cassandra-java-driver-plugin:4.0.9,com.datastax.spark:spark-cassandra- 
  connector_2.12:3.4.1 --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions --conf spark.dynamicAllocation.enabled=false 
 --master yarn --driver-memory 3g --executor-memory 3g --num-executors 2 --executor-cores 2 --repositories https://repo1.maven.org/maven2/

5.Prepare to connect to Cassandra

import org.apache.spark.sql.cassandra 
spark.conf.set("spark.sql.catalog.myCatalog", "com.datastax.spark.connector.datasource.CassandraCatalog") 
df = spark.read.table("myCatalog.*****.dummy") 
df.printSchema() 

returns: root |-- rowkey: string (nullable = false) |-- amount:

6. read data in the table

df.show()
  • results in an exception.
  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 607, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o84.showString.
: java.lang.UnsupportedOperationException: empty.reduceLeft
        at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:49)

...

question Im not sure what this exception is telling us. Need help trouble shooting. Is it our network setup? Is it the way I am trying to use EMR?

More info: application.conf

datastax-java-driver {
	basic.contact-points = ["cassandra.us-east-1.amazonaws.com:9142"]
	basic.load-balancing-policy {
		class = DefaultLoadBalancingPolicy
		local-datacenter = us-east-1
		slow-replica-avoidance = false
	}
	advanced {
		auth-provider = {
			class = software.aws.mcs.auth.SigV4AuthProvider
			aws-region = us-east-1
		}
		ssl-engine-factory {
			class = DefaultSslEngineFactory
			truststore-path = "/home/hadoop/cassandra_truststore.jks"
			truststore-password = "*****"
			hostname-validation=false
		}
	}
}
asked 7 months ago269 views
1 Answer
1

Hello There,

Thank you for the query.

I understand that you are trying to establish connection between EMR Spark and a Keyspace Instance with Cassandra. After the connection when you are trying to read data from tables using df.show() you are getting the following error "py4j.protocol.Py4JJavaError: An error occurred while calling o84.showString. : java.lang.UnsupportedOperationException: empty.reduceLeft". You would like to understand the root cause of the error and how to fix this.

I had a look into the steps you have followed. From what I can see so far, this would need detailed analysis and access to the logs, resources and account information, that are non-public information. On checking internally, I found that you have already logged a ticket with AWS Premium Support. I will take ownership of the case and assist you with investigating the issue there. I will post the solution here as well, once we manage to get past the issue.

Hope you have a great day ahead.

profile pictureAWS
SUPPORT ENGINEER
Rajiv_M
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions