By using AWS re:Post, you agree to the Terms of Use

Questions tagged with Data Lakes

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

AWS Lake Formation: (AccessDeniedException) when calling the GetTable operation: Insufficient Lake Formation permission(s) on table

I have implemented LakeFormation on my data bucket. I have a step function in which one step consists of running a GlueJob that reads and writes to the data catalog. I have upgraded my DataLake permissions as reported [here][1]. The Service Role that runs my Step Function has a root-type policy (granted just for debugging this issue): ```yaml Statement: - Effect: "Allow" Action: - "*" Resource: - "*" ``` On lake formation the service role has: - Administrator Rights - Database Creation rights (and Grantable) - Data Location access to the entire bucket (and Grantable) - Super rights on read and write Database (and Grantable) - Super rights on ALL tables within above Databases (and Grantable). The bucket is not encrypted. But, somehow, its access to the tables is denied with the error: ``` (AccessDeniedException) when calling the GetTable operation: Insufficient Lake Formation permission(s) on table ``` What's really strange is that the Glue Job succeeds when writing to some tables, and fails on others. And there is no real substantial difference across tables: all of them are under the same S3 prefix, parquet files, partitioned on the same key. Given the abundance of permissions granted, I am really clueless about what is causing the error. Please, send help. [1]: https://docs.aws.amazon.com/lake-formation/latest/dg/upgrade-glue-lake-formation.html
1
answers
0
votes
142
views
asked 6 months ago

Grouping of partitioned dataframes

I have a large dataset (table) with >1e9 records (rows) in Glue. The tables are partitioned by column A, which is a n-letters subtring of column B. For example: | A (partition key) | B | ... | | --- | --- | --- | | abc | abc123... | ... | | abc | abc123... | ... | | abc | abc456... | ... | | abc | abc456... | ... | | abc | abc456... | ... | | abc | abc789... | ... | | abc | abc789... | ... | | ... | ... | ... | | xyz | xyz123... | ... | | xyz | xyz123... | ... | | xyz | xyz123... | ... | | xyz | xyz456... | ... | | xyz | xyz456... | ... | | xyz | xyz456... | ... | | xyz | xyz789... | ... | | xyz | xyz789... | ... | There are >1e6 possible different values of column B and correspondingly significantly less for column A (maybe 1e3). Now I need to group records/rows by column B and the assumption is that it could be advantageous if the table was partitioned by column A, as it would be sufficient to load dataframes from single partitions for grouping instead of running the operation on the entire table. (Partitioning by column B would lead to unreasonably large numbers partitions.) Is my assumption right? How would I tell my Glue job the link between column A and B and profit from the partitioning? Alternatively I could handle the 1e3 dataframes (one for each partition) separately in my Glue job and merge them lateron. But this looks a bit complicated to me. This question is a follow-up question to https://repost.aws/questions/QUwxdl4EwTQcKBuL8MKCU0EQ/are-partitions-advantageous-for-groupby-operations-in-glue-jobs.
1
answers
0
votes
25
views
asked 7 months ago