By using AWS re:Post, you agree to the Terms of Use

Questions tagged with AWS Lake Formation

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Querying Latest Available Partition

I am building an ETL pipeline using primarily state machines, Athena, and the Glue catalog. In general things work in the following way: 1. A table, partitioned by "version", exists in the Glue Catalog. The table represents the output destination of some ETL process. 2. A step function (managed by some other process) executes "INSERT INTO" athena queries. The step function supplies a "version" that is used as part of the "INSERT INTO" query so that new data can be appended into the table defined in (1). The table contains all "versions" - it's a historical table that grows over time. My question is: What is a good way of exposing a view/table that allows someone (or something) to query only the latest "version" partition for a given historically partitioned table? I've looked into other table types AWS offers, including Governed tables and Iceberg tables. Each seems to have some incompatibility with our existing or planned future architecture: 1. Governed tables do not support writes via athena insert queries. Only Glue ETL/Spark seems to be supported at the moment. 2. Iceberg tables do not support Lake Formation data filters (which we'd like to use in the future to control data access) 3. Iceberg tables also seem to have poor performance. Anecdotally, it can take several seconds to insert a very small handful of rows to a given iceberg table. I'd worry about future performance when we want to insert a million rows. Any guidance would be appreciated!
1
answers
0
votes
51
views
asked a month ago

Data Mesh on AWS Lake Formation

Hi, I'm building a data mesh in AWS Lake Formation. The idea is to have 4 accounts: account 0: main account account 1: central data governance account 2: data producer account 3: data consumer I have been looking for information about how to implement the mesh in AWS and I'm following some tutorials that are very similar to what I'm doing: https://catalog.us-east-1.prod.workshops.aws/workshops/78572df7-d2ee-4f78-b698-7cafdb55135d/en-US/lakeformation-basics/cross-account-data-mesh https://aws.amazon.com/blogs/big-data/design-a-data-mesh-architecture-using-aws-lake-formation-and-aws-glue/ https://aws.amazon.com/blogs/big-data/build-a-data-sharing-workflow-with-aws-lake-formation-for-your-data-mesh/ However, after having created the bucket and uploaded some csv data to it (in the producer account), I don't know if I have to register first to the glue catalog in the producer account or I just do it in the lake formation like it says here: https://catalog.us-east-1.prod.workshops.aws/workshops/78572df7-d2ee-4f78-b698-7cafdb55135d/en-US/lakeformation-basics/databases (is this dependant on if one uses glue permissions or lake formation permissions in lake formation configuration?) Indeed I have done it first the database and the table in glue and then when I go to lake formation in the database and table sections the database and table created from glue appear there without doing anything. Even if I disable there the options: "Use only IAM access control for new databases" "Use only IAM access control for new tables in new databases" both the database and table appear there do you know if glue and lake formations share the data catalog? and I'm doing it correctly? thanks, John
1
answers
0
votes
52
views
asked 2 months ago