- Spun up an EMR instance:
emr-6.10.0
Spark 3.3.1, HBASE 2.4.15, Hive 3.1.3, JupyterHub 1.5.0, Hadoop 3.3.3, ZooKeeper 3.5.10, Zeppelin 0.10.1, Phoenix 5.1.2, Presto 0.278,
TensorFlow2.11.0, JupyterEnterpriseGateway 2.6.0
This instance also has access to S3
2. Spun up a KeySpace instance:
- Replication strategy: Single-Region
- arn:aws:cassandra:us-east-1*******
- Created a simple table in Cassandra.
root
|-- rowkey: string (nullable = false)
|-- amount: string (nullable = true)
|-- source: string (nullable = true)
3. SSH'd into EMR master node.
- locally, generate AWS_ACCESS_KEY, AWS_SECRET_ACCESS_KEY
- in SSH terminal, 'AWS Configure' and provide necessary info as prompted
4. Start pyspark session
pyspark --files https://blackline-ic-us-dev-****.s3.amazonaws.com/JoesTest/application.conf --conf
spark.cassandra.connection.config.profile.path=https://blackline-ic-us-dev-****.s3.amazonaws.com/JoesTest/application.conf --packages
com.crealytics:spark-excel_2.12:3.1.3_0.19.0,software.aws.mcs:aws-sigv4-auth-cassandra-java-driver-plugin:4.0.9,com.datastax.spark:spark-cassandra-
connector_2.12:3.4.1 --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions --conf spark.dynamicAllocation.enabled=false
--master yarn --driver-memory 3g --executor-memory 3g --num-executors 2 --executor-cores 2 --repositories https://repo1.maven.org/maven2/
5.Prepare to connect to Cassandra
import org.apache.spark.sql.cassandra
spark.conf.set("spark.sql.catalog.myCatalog", "com.datastax.spark.connector.datasource.CassandraCatalog")
df = spark.read.table("myCatalog.*****.dummy")
df.printSchema()
returns: root |-- rowkey: string (nullable = false) |-- amount:
6. read data in the table
df.show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 607, in show
print(self._jdf.showString(n, 20, vertical))
File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 190, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o84.showString.
: java.lang.UnsupportedOperationException: empty.reduceLeft
at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:49)
...
question
Im not sure what this exception is telling us. Need help trouble shooting. Is it our network setup? Is it the way I am trying to use EMR?
More info:
application.conf
datastax-java-driver {
basic.contact-points = ["cassandra.us-east-1.amazonaws.com:9142"]
basic.load-balancing-policy {
class = DefaultLoadBalancingPolicy
local-datacenter = us-east-1
slow-replica-avoidance = false
}
advanced {
auth-provider = {
class = software.aws.mcs.auth.SigV4AuthProvider
aws-region = us-east-1
}
ssl-engine-factory {
class = DefaultSslEngineFactory
truststore-path = "/home/hadoop/cassandra_truststore.jks"
truststore-password = "*****"
hostname-validation=false
}
}
}