Querying Latest Available Partition

0

I am building an ETL pipeline using primarily state machines, Athena, and the Glue catalog. In general things work in the following way:

  1. A table, partitioned by "version", exists in the Glue Catalog. The table represents the output destination of some ETL process.
  2. A step function (managed by some other process) executes "INSERT INTO" athena queries. The step function supplies a "version" that is used as part of the "INSERT INTO" query so that new data can be appended into the table defined in (1). The table contains all "versions" - it's a historical table that grows over time.

My question is: What is a good way of exposing a view/table that allows someone (or something) to query only the latest "version" partition for a given historically partitioned table?

I've looked into other table types AWS offers, including Governed tables and Iceberg tables. Each seems to have some incompatibility with our existing or planned future architecture:

  1. Governed tables do not support writes via athena insert queries. Only Glue ETL/Spark seems to be supported at the moment.
  2. Iceberg tables do not support Lake Formation data filters (which we'd like to use in the future to control data access)
  3. Iceberg tables also seem to have poor performance. Anecdotally, it can take several seconds to insert a very small handful of rows to a given iceberg table. I'd worry about future performance when we want to insert a million rows.

Any guidance would be appreciated!

1개 답변
0

I believe CREATING a VIEW with the latest version and exposing that view to the user would be appropraite.

CREATE OR REPLACE  VIEW latest_version_only AS 
select * from "datalake"."table1" 
where version_no = (SELECT MAX(version_no) FROM "datalake"."table1");
profile pictureAWS
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠