How can I consume subscribed Lake Formation data assets using Glue or EMR?

0

Problem I'd like to process my data using Spark. How can I consume Lake Formation data assets that I'm subscribed to, using Glue ETL Jobs or EMR?

Context I created following domains: Marketing, Sales and Reporting. Project in the Reporting domain is suppose to consume a data from Marketing and from Sales domains in order to create a report. The report is going to be published as a data product in Reporting domain. In Reporting domain I created a "reporting project" with Data Lake Producer capability. Then I subscribed "reporting project" to Sales and Marketing assets (Lake Formation tables). As a project owner of "reporting project" I can see assets that I subscribed to in my SUB db. I can query those assets with a help of Athena run from DataZone portal in the context of my Project. I can also create reporting table in my PUB db that I can publish later in the Data Catalog. After successful subscription, the authorization to my Sales and Marketing Lake Formation tables is done via DataZone project roles (owner, contributor, viewer).

irme
asked a year ago1030 views
2 Answers
0
  • Grant permissions: Ensure that the IAM roles associated with your Glue or EMR services have the necessary Lake Formation permissions to access the data. You can do this through the Lake Formation console by granting the necessary permissions (SELECT, INSERT, DELETE, ALTER, DROP) to the IAM roles for the specific databases and tables.
  • Use Glue Catalog: Your Glue or EMR jobs should be configured to use the AWS Glue Data Catalog as the metastore. This way, the jobs will be able to access the table metadata that's managed by Lake Formation.
  • Use Lake Formation API: When reading data in your Spark jobs, you should use the Lake Formation API to get a temporary, pre-signed URL that your job can use to read the data. In Spark, you can use spark.read.format("parquet").load(preSignedUrl) to read the data.
  • Access Control: It's worth noting that Lake Formation enforces fine-grained access control. This means that even though your Glue or EMR job might be able to see the table metadata, the job might not be able to access the underlying data if the IAM role associated with the job doesn't have the necessary permissions.
import boto3

# Initialize a session using AWS SDK
session = boto3.Session(region_name="us-west-2")  # specify the correct region

# Initialize the Lake Formation client
lakeformation = session.client("lakeformation")

# Get the pre-signed URL for a data location
response = lakeformation.get_data_lake_settings()

# Check if the response is successful
if response['ResponseMetadata']['HTTPStatusCode'] == 200:
    # Extract the pre-signed URL
    pre_signed_url = response['DataLakeSettings']['PreSignedUrl']

# Read data using Spark
df = spark.read.format("parquet").load(pre_signed_url)
profile picture
EXPERT
answered a year ago
0

Thanks for the answer. Sorry for the inaccurate description of the problem. The question was asked in the context of the new Amazon DataZone service and its data asset subscription model. I think we've already found a solution.

DataZone for each project creates three roles (project owner, contributor, viewer).

datazone-usr-o-proj-MyProjectId
datazone-usr-c-proj-MyProjectId
datazone-usr-u-proj-MyProjectId

Those roles obtain SELECT permission on Lake Formation tables that the project is subscribed to. In case of project with DataLake Producer capability, the roles get assigned policy that enables them to submit and run glue jobs => https://docs.aws.amazon.com/datazone/latest/userguide/Identitybasedroles.html

The policy is defined with the condition:

            "Condition": {
                "ForAnyValue:StringEquals": {
                    "aws:ResourceTag/datazone:projectId": "proj-MyProjectId"
                }
            }

So to be able to process data with glue we need to for glue job:

  1. select IAM role create by data zone project, e.g. datazone-usr-c-proj-MyProjectId
  2. put our jobs script in the bucket created by data zone project, e.g.: s3://datazone-proj-MyProjectId-...
  3. add resource tag to the job with our project id: "datazone:projectId"= "proj-MyProjectId"

Looking at the project roles policies, as of now, it's possible to process data with Glue but unfortunately not with EMR.

What would say if we extend policies to include EMR related permissions?

irme
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions