How to create a Glue Workflow programmatically?

0

Is there a way to create a Glue workflow programmatically?

I looked at CloudFormation but the only one I found is to create an empty Workflow (just Workflow name, Description and properties). https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-glue-workflow.html

I tried to look at the APIs as well (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-workflow.html), and even if there are all the data type for all the structures the only create API is again only adding the blank box.

Am I missing something? How do we create the workflow from the blueprint in Lake Formation? is some sort of pre-assembled JSON file that we just link to the workflow in glue?

can you do something similar, or need to wait for the customizable blueprints? Thank you

UPDATE:

As it can be derived from the snippet of code from the Accepted Answer, the key is to consider that it is actually the :

AWS::Glue::Trigger

construct that helps you build the Workflow.

Specifically, you need to:

  1. create the Workflow with AWS::Glue::Workflow
  2. If you need create Database and connection as well ( AWS::Glue::Database , AWS::Glue::Connection)
  3. Create any Crawler and any Job you want to add to the workflow using : AWS::Glue::Crawler or AWS::Glue::Job
  4. Create a first Trigger (AWS::Glue::Trigger ) with Type : ON-DEMAND , and Actions = to the firs Crawler or job your Workflow need to launch and Workflowname referencing the Workflow created at point 1
  5. Create any other Trigger with Type : CONDITIONAL

Below an Example (to create a Workflow that launch a Crawler on an S3 Bucket (cloudtraillogs) , if successfull launch a python script to change the table and partition schema to make them work with Athena )).

hope this helps

---

AWSTemplateFormatVersion: '2010-09-09'
Description: Creates cloudtrail crwaler and catalog for Athena and a job to transform to Parquet

Parameters: 
  CloudtrailS3: 
    Type: String
    Description: Enter the unique bucket name where the cloud trails log are stored

  CloudtrailS3Path: 
    Type: String
    Description: Enter the path/prefix that you want to crawl 

  CloudtrailDataLakeS3: 
    Type: String
    Description: Enter the unique bucket name for the data lake in which to store the logs in Parquet Format

Resources:
    
  CloudTrailGlueExecutionRole:
    Type: 'AWS::IAM::Role'
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - glue.amazonaws.com
            Action:
              - 'sts:AssumeRole'
      Path: /
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
  
  GluePolicy:
    Properties:
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
        - Action:
          - s3:GetBucketLocation
          - s3:GetObject
          - s3:PutObject
          - s3:ListBucket
          Effect: Allow
          Resource:
          - !Join ['', ['arn:aws:s3:::', !Ref CloudtrailS3] ]
          - !Join ['', ['arn:aws:s3:::', !Ref CloudtrailS3, '/*'] ]
          - !Join ['', ['arn:aws:s3:::', !Ref CloudtrailDataLakeS3] ]
          - !Join ['', ['arn:aws:s3:::', !Ref CloudtrailDataLakeS3, '/*'] ]
        - Action:
          - s3:DeleteObject
          Effect: Allow
          Resource:

          - !Join ['', ['arn:aws:s3:::', !Ref CloudtrailDataLakeS3] ]
          - !Join ['', ['arn:aws:s3:::', !Ref CloudtrailDataLakeS3, '/*'] ]
      PolicyName: glue_cloudtrail_S3_policy
      Roles:
      - Ref: CloudTrailGlueExecutionRole
    Type: AWS::IAM::Policy

  GlueWorkflow:
    Type: AWS::Glue::Workflow
    Properties: 
      Description: Workflow to crawl the cloudtrail logs
      Name: cloudtrail_discovery_workflow

  GlueDatabaseCloudTrail:
    Type: AWS::Glue::Database
    Properties:
      # The database is created in the Data Catalog for your account
      CatalogId: !Ref AWS::AccountId   
      DatabaseInput:
        # The name of the database is defined in the Parameters section above
        Name: cloudtrail_db
        Description: Database to hold tables for NY Philarmonica data
        LocationUri: !Ref CloudtrailDataLakeS3
  
  GlueCrawlerCTSource:
    Type: AWS::Glue::Crawler
    Properties:
      Name: cloudtrail_source_crawler
      Role: !GetAtt CloudTrailGlueExecutionRole.Arn
      #Classifiers: none, use the default classifier
      Description: AWS Glue crawler to crawl cloudtrail logs
      Schedule: 
        ScheduleExpression: 'cron(0 9 * * ? *)'
      DatabaseName: !Ref GlueDatabaseCloudTrail
      Targets:
        S3Targets:
          - Path: !Sub
            - s3://${bucket}/${path}
            - {
              bucket: !Ref CloudtrailS3,
              path : !Ref CloudtrailS3Path
              }
            Exclusions: 
              - '*/CloudTrail-Digest/**'
              - '*/Config/**'
            
      #TablePrefix: ''
      SchemaChangePolicy:
        UpdateBehavior: "UPDATE_IN_DATABASE"
        DeleteBehavior: "LOG"
      Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

  GlueJobConvertTable:
    Type: AWS::Glue::Job
    Properties:
      Name: ct_change_table_schema
      Role:
        Fn::GetAtt: [CloudTrailGlueExecutionRole, Arn]
      ExecutionProperty:
        MaxConcurrentRuns: 1
      GlueVersion: 1.0
      Command:
        Name: pythonshell
        PythonVersion: 3
        ScriptLocation: !Sub
          - s3://${bucket}/python/ct_change_table_schema.py
          - {bucket: !Ref CloudtrailDataLakeS3}
      DefaultArguments:
        '--TempDir': !Sub
          - s3://${bucket}/glue_tmp/
          - {bucket: !Ref CloudtrailDataLakeS3}
        "--job-bookmark-option" : "job-bookmark-disable"
        "--enable-metrics" : ""
    DependsOn:
      - CloudTrailGlueExecutionRole

  GlueSourceCrawlerTrigger:
    Type: AWS::Glue::Trigger
    Properties:
      Name: ct_start_source_crawl_Trigger
      Type: ON_DEMAND
      Description: Source Crawler trigger
      WorkflowName: !Ref GlueWorkflow
      Actions:
      - CrawlerName:
          Ref: GlueCrawlerCTSource
    DependsOn:
      - GlueCrawlerCTSource

  GlueJobTrigger:
    Type: AWS::Glue::Trigger
    Properties:
      Name: ct_change_schema_Job_Trigger
      Type: CONDITIONAL
      Description: Job trigger
      WorkflowName: !Ref GlueWorkflow
      StartOnCreation: 'true'
      Actions:
      - JobName: !Ref GlueJobConvertTable
      Predicate:
        Conditions:
        - LogicalOperator: EQUALS
          CrawlerName: !Ref GlueCrawlerCTSource
          CrawlState: SUCCEEDED
        Logical: ANY
    DependsOn:
    - GlueJobConvertTable
AWS
EXPERT
asked 4 years ago2686 views
1 Answer
0
Accepted Answer

For a workflow, you'll want a mixture of trigger, crawler and jobs. There's a fairly good coverage in CloudFormation but you might still need some Custom Resources and/or to kick off something like a Step Function.

Example (from the internet, just trying to minimize traffic to the post):

---
Parameters:
  OutputPathLocation:
    Description: Output path of the transformation file
    Type: String
  WorkFlowName:
    Description: Name of the workflow
    Type: String
    Default: test-workflow
  MyScriptLocation:
    Description: Location of ETL script
    Type: String
Resources:
  MyGlueWorkFlow:
    Type: AWS::Glue::Workflow
    Properties:
      Description: test cfn workflow
      Name:
        Ref: WorkFlowName
  MyGlueCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      DatabaseName: cfndb
      Description: My crawler
      Name: MyGlueCrawler
      Role: AWSGlueServiceRole
      TablePrefix: cfn_
      Targets:
        S3Targets:
        - Path: s3://crawler-public-us-east-1/flight/2016/csv
    DependsOn:
    - MyGlueWorkFlow
  MyGlueCrawlerTrigger:
    Type: AWS::Glue::Trigger
    Properties:
      Name: MyCrawlerTrigger
      Type: ON_DEMAND
      Description: Crawler trigger
      WorkflowName:
        Ref: MyGlueWorkFlow
      Actions:
      - CrawlerName:
          Ref: MyGlueCrawler
    DependsOn:
    - MyGlueCrawler
  MyGlueJob:
    Type: AWS::Glue::Job
    Properties:
      Command:
        Name: glueetl
        ScriptLocation:
          Ref: MyScriptLocation
      Description: My workflow job
      GlueVersion: '1.0'
      Name: MyGlueJob
      Role: AWSGlueServiceRole
      DefaultArguments:
        "--outputpath":
          Ref: OutputPathLocation
    DependsOn:
    - MyGlueCrawler
  MyGlueJobTrigger:
    Type: AWS::Glue::Trigger
    Properties:
      Name: MyGlueJobTrigger
      Type: CONDITIONAL
      Description: Job trigger
      WorkflowName:
        Ref: MyGlueWorkFlow
      StartOnCreation: 'true'
      Actions:
      - JobName:
          Ref: MyGlueJob
      Predicate:
        Conditions:
        - LogicalOperator: EQUALS
          CrawlerName:
            Ref: MyGlueCrawler
          CrawlState: SUCCEEDED
        Logical: ANY
    DependsOn:
    - MyGlueJob


AWS
EXPERT
Raphael
answered 4 years ago
  • Where do we run this code ?. Which script should I use in order to utilize this script ?. I tried creating with boto3 only I was able to create the workflow , wasn't able to add triggers/jobs

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions