- Newest
- Most votes
- Most comments
- Grant permissions: Ensure that the IAM roles associated with your Glue or EMR services have the necessary Lake Formation permissions to access the data. You can do this through the Lake Formation console by granting the necessary permissions (SELECT, INSERT, DELETE, ALTER, DROP) to the IAM roles for the specific databases and tables.
- Use Glue Catalog: Your Glue or EMR jobs should be configured to use the AWS Glue Data Catalog as the metastore. This way, the jobs will be able to access the table metadata that's managed by Lake Formation.
- Use Lake Formation API: When reading data in your Spark jobs, you should use the Lake Formation API to get a temporary, pre-signed URL that your job can use to read the data. In Spark, you can use spark.read.format("parquet").load(preSignedUrl) to read the data.
- Access Control: It's worth noting that Lake Formation enforces fine-grained access control. This means that even though your Glue or EMR job might be able to see the table metadata, the job might not be able to access the underlying data if the IAM role associated with the job doesn't have the necessary permissions.
import boto3
# Initialize a session using AWS SDK
session = boto3.Session(region_name="us-west-2") # specify the correct region
# Initialize the Lake Formation client
lakeformation = session.client("lakeformation")
# Get the pre-signed URL for a data location
response = lakeformation.get_data_lake_settings()
# Check if the response is successful
if response['ResponseMetadata']['HTTPStatusCode'] == 200:
# Extract the pre-signed URL
pre_signed_url = response['DataLakeSettings']['PreSignedUrl']
# Read data using Spark
df = spark.read.format("parquet").load(pre_signed_url)
Thanks for the answer. Sorry for the inaccurate description of the problem. The question was asked in the context of the new Amazon DataZone service and its data asset subscription model. I think we've already found a solution.
DataZone for each project creates three roles (project owner, contributor, viewer).
datazone-usr-o-proj-MyProjectId
datazone-usr-c-proj-MyProjectId
datazone-usr-u-proj-MyProjectId
Those roles obtain SELECT permission on Lake Formation tables that the project is subscribed to. In case of project with DataLake Producer capability, the roles get assigned policy that enables them to submit and run glue jobs => https://docs.aws.amazon.com/datazone/latest/userguide/Identitybasedroles.html
The policy is defined with the condition:
"Condition": {
"ForAnyValue:StringEquals": {
"aws:ResourceTag/datazone:projectId": "proj-MyProjectId"
}
}
So to be able to process data with glue we need to for glue job:
- select IAM role create by data zone project, e.g. datazone-usr-c-proj-MyProjectId
- put our jobs script in the bucket created by data zone project, e.g.: s3://datazone-proj-MyProjectId-...
- add resource tag to the job with our project id: "datazone:projectId"= "proj-MyProjectId"
Looking at the project roles policies, as of now, it's possible to process data with Glue but unfortunately not with EMR.
What would say if we extend policies to include EMR related permissions?
Relevant content
- asked a year ago
- asked a year ago
- asked 3 years ago
- asked a year ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 3 years ago