By using AWS re:Post, you agree to the Terms of Use

Questions tagged with Data Lakes

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Hudi Clustering

I am using EMR 6.6.0, which has hudi 10.1. I am trying to bulkinsert and do inline clustering using Hudi. But seems its not clustering the file as per file size being mentioned. But it is still producing the files in KB only. I tried below configuration: > hudi_clusteringopt = { 'hoodie.table.name': 'myhudidataset_upsert_legacy_new7', 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.datasource.write.partitionpath.field': 'creation_date', 'hoodie.datasource.write.precombine.field': 'last_update_time', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.database': 'my_hudi_db', 'hoodie.datasource.hive_sync.table': 'myhudidataset_upsert_legacy_new7', 'hoodie.datasource.hive_sync.partition_fields': 'creation_date', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.datasource.write.operation": "bulk_insert", } # "hoodie.datasource.write.operation": "bulk_insert", try: inputDF.write.format("org.apache.hudi"). \ options(**hudi_clusteringopt). \ option("hoodie.parquet.small.file.limit", "0"). \ option("hoodie.clustering.inline", "true"). \ option("hoodie.clustering.inline.max.commits", "0"). \ option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824"). \ option("hoodie.clustering.plan.strategy.small.file.limit", "629145600"). \ option("hoodie.clustering.plan.strategy.sort.columns", "pk_col"). \ mode('append'). \ save("s3://xxxxxxxxxxxxxx"); except Exception as e: print(e) Here is the data set if someone wants to regenerate: inputDF = spark.createDataFrame( [ ("1001",1001, "2015-01-01", "2015-01-01T13:51:39.340396Z"), ("1011",1011, "2015-01-01", "2015-01-01T12:14:58.597216Z"), ("1021",1021, "2015-01-01", "2015-01-01T13:51:40.417052Z"), ("1031",1031, "2015-01-01", "2015-01-01T13:51:40.519832Z"), ("1041",1041, "2015-01-02", "2015-01-01T12:15:00.512679Z"), ("1051",1051, "2015-01-02", "2015-01-01T13:51:42.248818Z"), ], ["id","id_val", "creation_date", "last_update_time"] )
1
answers
0
votes
29
views
asked 3 months ago

Describe table in Athena fails with insufficient lake formation permissions

When I try to run the following query via the Athena JDBC Driver ```sql describe gitlab.issues ``` I get the following error: > [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. FAILED: SemanticException Unable to fetch table gitlab. Insufficient Lake Formation permission(s) on gitlab (Service: AmazonDataCatalog; Status Code: 400; Error Code: AccessDeniedException; Request ID: be6aeb1b-fc06-410d-9723-2df066307b35; Proxy: null) [Execution ID: a2534d22-c4df-49e9-8515-80224779bf01] the following query works: ```sql select * from gitlab.issues limit 10 ``` The role that is used has the `DESCRIBE` permission on the `gitlab` database and `DESCRIBE, SELECT` permissions on the table `issues`. It also has the following IAM permissions: ```json { "Version": "2012-10-17", "Statement": [ { "Action": [ "athena:BatchGetNamedQuery", "athena:BatchGetQueryExecution", "athena:CreatePreparedStatement", "athena:DeletePreparedStatement", "athena:GetDataCatalog", "athena:GetDatabase", "athena:GetNamedQuery", "athena:GetPreparedStatement", "athena:GetQueryExecution", "athena:GetQueryResults", "athena:GetQueryResultsStream", "athena:GetTableMetadata", "athena:GetWorkGroup", "athena:ListDatabases", "athena:ListNamedQueries", "athena:ListPreparedStatements", "athena:ListDataCatalogs", "athena:ListEngineVersions", "athena:ListQueryExecutions", "athena:ListTableMetadata", "athena:ListTagsForResource", "athena:ListWorkGroups", "athena:StartQueryExecution", "athena:StopQueryExecution", "athena:UpdatePreparedStatement" ], "Resource": "*", "Effect": "Allow" }, { "Action": [ "glue:BatchGetCustomEntityTypes", "glue:BatchGetPartition", "glue:GetCatalogImportStatus", "glue:GetColumnStatisticsForPartition", "glue:GetColumnStatisticsForTable", "glue:GetCustomEntityType", "glue:GetDatabase", "glue:GetDatabases", "glue:GetPartition", "glue:GetPartitionIndexes", "glue:GetPartitions", "glue:GetSchema", "glue:GetSchemaByDefinition", "glue:GetSchemaVersion", "glue:GetSchemaVersionsDiff", "glue:GetTable", "glue:GetTableVersion", "glue:GetTableVersions", "glue:GetTables", "glue:GetUserDefinedFunction", "glue:GetUserDefinedFunctions", "glue:ListCustomEntityTypes", "glue:ListSchemaVersions", "glue:ListSchemas", "glue:QuerySchemaVersionMetadata", "glue:SearchTables" ], "Resource": "*", "Effect": "Allow" }, { "Condition": { "ForAnyValue:StringEquals": { "aws:CalledVia": "athena.amazonaws.com" } }, "Action": [ "s3:GetBucketLocation", "s3:GetObject", "s3:ListBucket", "s3:ListBucketMultipartUploads", "s3:ListMultipartUploadParts", "s3:AbortMultipartUpload", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::aws-athena-query-results-123456789012-eu-west-1", "arn:aws:s3:::aws-athena-query-results-123456789012-eu-west-1/*", "arn:aws:s3:::aws-athena-federation-spill-123456789012-eu-west-1", "arn:aws:s3:::aws-athena-federation-spill-123456789012-eu-west-1/*" ], "Effect": "Allow" }, { "Action": [ "lakeformation:CancelTransaction", "lakeformation:CommitTransaction", "lakeformation:DescribeResource", "lakeformation:DescribeTransaction", "lakeformation:ExtendTransaction", "lakeformation:GetDataAccess", "lakeformation:GetQueryState", "lakeformation:GetQueryStatistics", "lakeformation:GetTableObjects", "lakeformation:GetWorkUnitResults", "lakeformation:GetWorkUnits", "lakeformation:StartQueryPlanning", "lakeformation:StartTransaction" ], "Resource": "*", "Effect": "Allow" }, { "Condition": { "ForAnyValue:StringEquals": { "aws:CalledVia": "athena.amazonaws.com" } }, "Action": "lambda:InvokeFunction", "Resource": "arn:aws:lambda:*:*:function:athena-federation-*", "Effect": "Allow" }, { "Condition": { "ForAnyValue:StringEquals": { "aws:CalledVia": "athena.amazonaws.com" } }, "Action": ["s3:GetBucketLocation", "s3:GetObject", "s3:ListBucket"], "Resource": "*", "Effect": "Allow" } ] } ``` even if I make the role a LakeFormation Admin, Database Creator, assign Super Permissions to the table and database and add the AdministratorAccess IAM Policy to the role it still fails.
0
answers
0
votes
78
views
asked 3 months ago

_temp AWS lake formation blueprint pipeline tables appears to IAM user in athena editor although I didn't give this user permission on them

_temp lake formation blueprint pipeline tables appears to IAM user in Athena editor, although I didn't give this user permission on them below the policy granted to this IAM user,also in lake formation permsissions ,I didnt give this user any permissions on _temp tables: { "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1652364721496", "Action": [ "athena:BatchGetNamedQuery", "athena:BatchGetQueryExecution", "athena:GetDataCatalog", "athena:GetDatabase", "athena:GetNamedQuery", "athena:GetPreparedStatement", "athena:GetQueryExecution", "athena:GetQueryResults", "athena:GetQueryResultsStream", "athena:GetTableMetadata", "athena:GetWorkGroup", "athena:ListDataCatalogs", "athena:ListDatabases", "athena:ListEngineVersions", "athena:ListNamedQueries", "athena:ListPreparedStatements", "athena:ListQueryExecutions", "athena:ListTableMetadata", "athena:ListTagsForResource", "athena:ListWorkGroups", "athena:StartQueryExecution", "athena:StopQueryExecution" ], "Effect": "Allow", "Resource": "*" }, { "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:GetDatabases", "glue:BatchDeleteTable", "glue:GetTable", "glue:GetTables", "glue:GetPartition", "glue:GetPartitions", "glue:BatchGetPartition" ], "Resource": [ "*" ] }, { "Sid": "Stmt1652365282568", "Action": "s3:*", "Effect": "Allow", "Resource": [ "arn:aws:s3:::queryresults-all", "arn:aws:s3:::queryresults-all/*" ] }, { "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess" ], "Resource": [ "*" ] } ] }
1
answers
0
votes
30
views
asked 5 months ago