- 最新
- 投票最多
- 评论最多
There could be a few reasons why the lineage is not showing up as expected in AWS DataHub after running your AWS Glue ETL job:
The Glue job is not configured properly to generate lineage data. Make sure the
--enable-data-lineage
argument is passed to the job run with a value of true.
There may be permissions issues preventing the lineage data from being written to AWS Lake Formation, which DataHub reads from. Check that the IAM role used by Glue has sufficient permissions.
The version of the
datahub-spark-lineage
jar being used may not be compatible with your versions of Glue and DataHub. Try using the latest version of the jar available.
There could be networking or connectivity issues between the components. Ensure Glue, Lake Formation and DataHub can all reach each other.
The job may be failing or erroring out before completing, so no lineage is produced. Check the Glue job logs for errors or failures.
相关内容
- AWS 官方已更新 2 年前
- AWS 官方已更新 2 年前
Thank you, Giovanni. I don't think there is an official AWS doc about --enable-data-lineage option. Has anything changed since this - https://repost.aws/questions/QUCmeXSLNuShacYAx9B8Jg4g/aws-glue-job-parameter-enable-data-lineage? I also haven't seen any documentation mentioning DataHub reading from Lake Formation. Are you able to point me at that, please? From my understanding, instrumented spark code in glue job is calling the HubSpot GMS API endpoint and pushing relevant data and not the other way around. As I mentioned, spark DataTask is being recorded in DataHub but nothing less. Job completes successfully with no errors in logs.