スキップしてコンテンツを表示

spark.sql not working on EMR (Serverless)

0

The following script does not create the table in the S3 location indicated by the query. I tested it locally and the Delta Json file is created and contains the information about the created table.

from pyspark.sql import SparkSession

spark = (SparkSession
    .builder
    .enableHiveSupport()
    .appName('omop_ddl')
    .getOrCreate()
    )


spark.sql(f"""
CREATE
OR REPLACE TABLE CONCEPT (
  CONCEPT_ID LONG,
  CONCEPT_NAME STRING,
  DOMAIN_ID STRING,
  VOCABULARY_ID STRING,
  CONCEPT_CLASS_ID STRING,
  STANDARD_CONCEPT STRING,
  CONCEPT_CODE STRING,
  VALID_START_DATE DATE,
  VALID_END_DATE DATE,
  INVALID_REASON STRING
) USING DELTA
LOCATION 's3a://ls-dl-mvp-s3deltalake/health_lakehouse/silver/concept';
""")

The configuration parameters are the following ones:

--conf spark.jars=s3a://ls-dl-mvp-s3development/spark_jars/delta-core_2.12-2.1.0.jar,s3a://ls-dl-mvp-s3development/spark_jars/delta-storage-2.1.0.jar 
--conf spark.executor.cores=1 
--conf spark.executor.memory=4g 
--conf spark.driver.cores=1 
--conf spark.driver.memory=4g 
--conf spark.executor.instances=1 

I tried to modify the location in the query by inserting a non-existent bucket and the script did not go into error. Am I forgetting something? Thank you very much for your help

質問済み 4年前268ビュー
1回答
0

Hi anselboro,

For the version you are using, additional configuration is required for EMR Serverless with Delta Lake according to the EMR Serverless Delta Lake documentation.

If you can provide which EMR version you are using, I can provide more specific configuration steps for that version.

For EMR 6.9.0 and higher (Delta Lake included):

--conf spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

For EMR 7.0.0 and higher (Delta Lake 3.0.0+):

--conf spark.jars=/usr/share/aws/delta/lib/delta-spark.jar,/usr/share/aws/delta/lib/delta-storage.jar
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

What You're Missing

Your current configuration:

--conf spark.jars=s3a://ls-dl-mvp-s3development/spark_jars/delta-core_2.12-2.1.0.jar,s3a://ls-dl-mvp-s3development/spark_jars/delta-storage-2.1.0.jar

Missing configurations:

--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Without these, Spark doesn't register the Delta catalog plugin, so USING DELTA is not recognized.

Thank you.

AWS
エキスパート
回答済み 3ヶ月前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

関連するコンテンツ