Duplicate Rows When Loading Delta Native tables In Athena Engine V3

0

Hello,

We use Spark jobs running on EMR to create and update tables which we manage in Glue (they are Parquet files in S3).

We have recently switched to Athena engine v3 and are working to implement our tables as delta native tables.

The goal is to create a new native delta table on top of our S3 address and to populate that table properly. We are running the following SQL commands in Spark (from EMR):

DROP TABLE IF EXISTS <aws_ID>/orders<tablename>;

CREATE EXTERNAL TABLE <aws_ID>/orders.<tablename>; (<schema>) PARTITIONED BY ( year STRING, month STRING, day STRING, dt STRING) LOCATION 's3://REDACTED' ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' TBLPROPERTIES('table_type' = 'DELTA');

At this point the table is created with the proper schema and zero rows. We run the following command to populate it: MSCK REPAIR TABLE <tablename>

The table contains many rows that are exact duplicates on every field. How can we avoid this? Should we approach it differently?

Vlad
질문됨 일 년 전384회 조회
1개 답변
1

Sounds you are creating a table that is not really Delta but just reading the parquet files.
For native format you just specify the location and type: https://docs.aws.amazon.com/athena/latest/ug/delta-lake-tables.html
and I don't think you can use the REPAIR, you would have to use a crawler or synchronize using the tool provided by Delta.

profile pictureAWS
전문가
답변함 일 년 전
  • Thanks Gonzalo! Makes sense. The DDL you linked works from within Athena, but not from within Spark.

    Do you know how to create a native delta table for Athena engine v3 in PySpark?

  • I think it's doable, in Spark write the data with the delta library and then create the table in the catalog using CREATE EXTERNAL TABLE specifying the columns, location and 'table_type'='DELTA'

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠