- Newest
- Most votes
- Most comments
Hello,
I understand you are using custom extractors in Openlineage Airflow integration and it works fine, but you are unable to send it to Datazone as you get the below error in airflow. When you run airflow locally using your own credentials it triggers the task, however when you use ECS tasks, you get the error.
WARNING - Failed to emit OpenLineage event of id {id-hash}
-
Install the openlineage-airflow package in your Airflow environment. Add the following to your requirements.txt file: [1] openlineage-airflow==1.4.1
-
Configure OpenLineage as a lineage backend in your Airflow environment. Please find below an example of the variable:
- AIRFLOW__LINEAGE__BACKEND=openlineage.lineage_backend.OpenLineageBackend OPENLINEAGE_URL=http:// OPENLINEAGE_NAMESPACE=your-namespace
-
Then set up the transport configuration for OpenLineage. For tasks running on ECS, use the ConsoleTransport to log OpenLineage events to CloudWatch logs.
-
Create a helper function to capture lineage information from CloudWatch logs and publish it to Amazon DataZone. This function should:
- Filter the OpenLineage events from the CloudWatch log group
- Parse the OpenLineage events
- Use the Amazon DataZone PostLineageEvent API to send the events to DataZone
-
Implement the helper function in your Airflow DAG or as a separate task that runs after your ECS tasks complete.
-
In your Airflow DAG, use the appropriate operators for your ECS tasks and ensure they are configured to emit OpenLineage events.
-
Run your DAG and verify that lineage information is being captured in Amazon DataZone.
That being said, if the above doesn't work for you, I request you to raise a support case with AWS so that we can do resource based troubleshooting and we will get back to you on the support case.
References:
[1] https://aws.amazon.com/blogs/big-data/amazon-datazone-introduces-openlineage-compatible-data-lineage-visualization-in-preview/
Thank you for your question about integrating OpenLineage with Amazon DataZone using Apache Airflow. It appears there might be a misunderstanding in how to configure the OpenLineage transport for Amazon DataZone.
For Amazon DataZone, you don't use the HTTP transport type. Instead, you should use a specific Amazon DataZone API transport. The correct configuration should look more like this:
[openlineage]
transport = {"type": "amazon_datazone_api", "domainId": "<Your-DataZone-Domain-ID>"}
You need to replace <Your-DataZone-Domain-ID> with your actual Amazon DataZone domain ID.
Additionally, when using Amazon DataZone, you don't need to specify a URL or endpoint. The Amazon DataZone API transport will handle the communication with the appropriate DataZone API endpoint.
To capture lineage from Airflow and send it to Amazon DataZone, you need to configure the OpenLineage Airflow plugin. This involves installing the openlineage-airflow
package and setting it up as a plugin in your Airflow environment.
Once configured, the plugin will convert Airflow metadata to OpenLineage events, which can then be consumed by Amazon DataZone's PostLineageEvent API.
It's worth noting that you may need to ensure your Airflow environment has the necessary permissions to interact with Amazon DataZone. This typically involves configuring the appropriate IAM roles and permissions.
If you're still encountering issues after making these changes, you may want to check your Airflow logs for more detailed error messages, or consult the Amazon DataZone documentation for any additional configuration steps specific to your setup.
Remember, since you're running data transformation logic outside of Airflow using ECS tasks, you'll need to ensure that these tasks are also configured to emit OpenLineage events that can be captured and sent to Amazon DataZone.
Sources
Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview | AWS Big Data Blog
Data lineage in Amazon DataZone - Amazon DataZone
Relevant content
- asked 5 months ago
- asked 3 months ago
- asked 3 years ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 2 years ago
This autogenerated answer is not correct. Using this, I get the following warning:
WARNING - not enough values to unpack (expected 2, got 1)
Followed by the error:TypeError: Transport None has to be class, and subclass of Transport
in the _create_transport method of the openlineage client.