Datazone lineage using Airflow Openlineage integration

0

I have created custom extractors in Openlineage Airflow integration and all works well if I use the local Marquez setup as is described in the Openlineage docs by including the following in the airflow.cfg.

[openlineage]
transport = {"type": "http", "url": "http://localhost:5000", "endpoint": "api/v1/lineage"}

However, I do not get it to work when trying to send it to Datazone. From what I can gather from the docs, this should be something like this:

[openlineage]
transport = {"type": "http", "url": "https://datazone.eu-west-1.api.aws", "endpoint": "v2/domains/{datazone_id}/lineage"}

Airflow does not give any other error than: WARNING - Failed to emit OpenLineage event of id {id-hash}

I am running Airflow locally and it triggers the tasks well using my local credentials, so it is not due to access issues.

What is the correct transport config to use here? I am running data transformation logic outside of Airflow using ECS tasks, so I am not using Glue with its own way of connecting to Datazone.

Thank you in advance,

Pieter Coremans

asked 2 months ago93 views
2 Answers
0

Hello,

I understand you are using custom extractors in Openlineage Airflow integration and it works fine, but you are unable to send it to Datazone as you get the below error in airflow. When you run airflow locally using your own credentials it triggers the task, however when you use ECS tasks, you get the error.

WARNING - Failed to emit OpenLineage event of id {id-hash}

  1. Install the openlineage-airflow package in your Airflow environment. Add the following to your requirements.txt file: [1] openlineage-airflow==1.4.1

  2. Configure OpenLineage as a lineage backend in your Airflow environment. Please find below an example of the variable:

  • AIRFLOW__LINEAGE__BACKEND=openlineage.lineage_backend.OpenLineageBackend OPENLINEAGE_URL=http:// OPENLINEAGE_NAMESPACE=your-namespace
  1. Then set up the transport configuration for OpenLineage. For tasks running on ECS, use the ConsoleTransport to log OpenLineage events to CloudWatch logs.

  2. Create a helper function to capture lineage information from CloudWatch logs and publish it to Amazon DataZone. This function should:

  • Filter the OpenLineage events from the CloudWatch log group
  • Parse the OpenLineage events
  • Use the Amazon DataZone PostLineageEvent API to send the events to DataZone
  1. Implement the helper function in your Airflow DAG or as a separate task that runs after your ECS tasks complete.

  2. In your Airflow DAG, use the appropriate operators for your ECS tasks and ensure they are configured to emit OpenLineage events.

  3. Run your DAG and verify that lineage information is being captured in Amazon DataZone.

That being said, if the above doesn't work for you, I request you to raise a support case with AWS so that we can do resource based troubleshooting and we will get back to you on the support case.

References:


[1] https://aws.amazon.com/blogs/big-data/amazon-datazone-introduces-openlineage-compatible-data-lineage-visualization-in-preview/
AWS
answered 2 months ago
profile pictureAWS
EXPERT
reviewed 2 months ago
-1

Thank you for your question about integrating OpenLineage with Amazon DataZone using Apache Airflow. It appears there might be a misunderstanding in how to configure the OpenLineage transport for Amazon DataZone.

For Amazon DataZone, you don't use the HTTP transport type. Instead, you should use a specific Amazon DataZone API transport. The correct configuration should look more like this:

[openlineage]
transport = {"type": "amazon_datazone_api", "domainId": "<Your-DataZone-Domain-ID>"}

You need to replace <Your-DataZone-Domain-ID> with your actual Amazon DataZone domain ID.

Additionally, when using Amazon DataZone, you don't need to specify a URL or endpoint. The Amazon DataZone API transport will handle the communication with the appropriate DataZone API endpoint.

To capture lineage from Airflow and send it to Amazon DataZone, you need to configure the OpenLineage Airflow plugin. This involves installing the openlineage-airflow package and setting it up as a plugin in your Airflow environment.

Once configured, the plugin will convert Airflow metadata to OpenLineage events, which can then be consumed by Amazon DataZone's PostLineageEvent API.

It's worth noting that you may need to ensure your Airflow environment has the necessary permissions to interact with Amazon DataZone. This typically involves configuring the appropriate IAM roles and permissions.

If you're still encountering issues after making these changes, you may want to check your Airflow logs for more detailed error messages, or consult the Amazon DataZone documentation for any additional configuration steps specific to your setup.

Remember, since you're running data transformation logic outside of Airflow using ECS tasks, you'll need to ensure that these tasks are also configured to emit OpenLineage events that can be captured and sent to Amazon DataZone.
Sources
Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview | AWS Big Data Blog
Data lineage in Amazon DataZone - Amazon DataZone

profile picture
answered 2 months ago
  • This autogenerated answer is not correct. Using this, I get the following warning: WARNING - not enough values to unpack (expected 2, got 1) Followed by the error: TypeError: Transport None has to be class, and subclass of Transport in the _create_transport method of the openlineage client.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions