Skip to content

spark.sql not working on EMR (Serverless)

0

The following script does not create the table in the S3 location indicated by the query. I tested it locally and the Delta Json file is created and contains the information about the created table.

from pyspark.sql import SparkSession

spark = (SparkSession
    .builder
    .enableHiveSupport()
    .appName('omop_ddl')
    .getOrCreate()
    )


spark.sql(f"""
CREATE
OR REPLACE TABLE CONCEPT (
  CONCEPT_ID LONG,
  CONCEPT_NAME STRING,
  DOMAIN_ID STRING,
  VOCABULARY_ID STRING,
  CONCEPT_CLASS_ID STRING,
  STANDARD_CONCEPT STRING,
  CONCEPT_CODE STRING,
  VALID_START_DATE DATE,
  VALID_END_DATE DATE,
  INVALID_REASON STRING
) USING DELTA
LOCATION 's3a://ls-dl-mvp-s3deltalake/health_lakehouse/silver/concept';
""")

The configuration parameters are the following ones:

--conf spark.jars=s3a://ls-dl-mvp-s3development/spark_jars/delta-core_2.12-2.1.0.jar,s3a://ls-dl-mvp-s3development/spark_jars/delta-storage-2.1.0.jar 
--conf spark.executor.cores=1 
--conf spark.executor.memory=4g 
--conf spark.driver.cores=1 
--conf spark.driver.memory=4g 
--conf spark.executor.instances=1 

I tried to modify the location in the query by inserting a non-existent bucket and the script did not go into error. Am I forgetting something? Thank you very much for your help

asked 4 years ago259 views
1 Answer
0

Hi anselboro,

For the version you are using, additional configuration is required for EMR Serverless with Delta Lake according to the EMR Serverless Delta Lake documentation.

If you can provide which EMR version you are using, I can provide more specific configuration steps for that version.

For EMR 6.9.0 and higher (Delta Lake included):

--conf spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

For EMR 7.0.0 and higher (Delta Lake 3.0.0+):

--conf spark.jars=/usr/share/aws/delta/lib/delta-spark.jar,/usr/share/aws/delta/lib/delta-storage.jar
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

What You're Missing

Your current configuration:

--conf spark.jars=s3a://ls-dl-mvp-s3development/spark_jars/delta-core_2.12-2.1.0.jar,s3a://ls-dl-mvp-s3development/spark_jars/delta-storage-2.1.0.jar

Missing configurations:

--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Without these, Spark doesn't register the Delta catalog plugin, so USING DELTA is not recognized.

Thank you.

AWS
EXPERT
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.