Create linage in DataHub from Glue job

0

I am running a PoC around integrating the Glue lineage into the DataHub. I have based my research on this set of AWS blog posts 1 and 2. With certain level of success. I found a number of crucial elements completely missing there such as EBS provisioning for the EKS cluster which took me a while to figure out.

Long story short I provisioned a Datahub in my AWS environment with all necessary components running. In 2 blog post there is a process described how to run Glue job which after completion would generate lineage in DataHub. I could not replicate this behavior. After my glue job finishes, I can see that GMS instance in Datahub receives the call and everything looks fine however when I open the frontend I only see a Spark task without any upstream or downstream tables attached despite the fact that operation was successful with all the data populated as expected. I tried different versions of datahub-spark-lineage jar. I provisioned the default latest version of datahub platform.

Are there gotchas that I am missing? Would be keen to hear if anyone managed to get it running recently. I also raised the bug here for more context.

Denys
asked 2 months ago511 views
1 Answer
1

There could be a few reasons why the lineage is not showing up as expected in AWS DataHub after running your AWS Glue ETL job:

The Glue job is not configured properly to generate lineage data. Make sure the --enable-data-lineage argument is passed to the job run with a value of true.

There may be permissions issues preventing the lineage data from being written to AWS Lake Formation, which DataHub reads from. Check that the IAM role used by Glue has sufficient permissions.

The version of the datahub-spark-lineage jar being used may not be compatible with your versions of Glue and DataHub. Try using the latest version of the jar available.

There could be networking or connectivity issues between the components. Ensure Glue, Lake Formation and DataHub can all reach each other.

The job may be failing or erroring out before completing, so no lineage is produced. Check the Glue job logs for errors or failures.

profile picture
EXPERT
answered 2 months ago
  • Thank you, Giovanni. I don't think there is an official AWS doc about --enable-data-lineage option. Has anything changed since this - https://repost.aws/questions/QUCmeXSLNuShacYAx9B8Jg4g/aws-glue-job-parameter-enable-data-lineage? I also haven't seen any documentation mentioning DataHub reading from Lake Formation. Are you able to point me at that, please? From my understanding, instrumented spark code in glue job is calling the HubSpot GMS API endpoint and pushing relevant data and not the other way around. As I mentioned, spark DataTask is being recorded in DataHub but nothing less. Job completes successfully with no errors in logs.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions