How to create a Glue Workflow programmatically?

0

Is there a way to create a Glue workflow programmatically?

I looked at CloudFormation but the only one I found is to create an empty Workflow (just Workflow name, Description and properties). https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-glue-workflow.html

I tried to look at the APIs as well (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-workflow.html), and even if there are all the data type for all the structures the only create API is again only adding the blank box.

Am I missing something? How do we create the workflow from the blueprint in Lake Formation? is some sort of pre-assembled JSON file that we just link to the workflow in glue?

can you do something similar, or need to wait for the customizable blueprints? Thank you

UPDATE:

As it can be derived from the snippet of code from the Accepted Answer, the key is to consider that it is actually the :

AWS::Glue::Trigger

construct that helps you build the Workflow.

Specifically, you need to:

  1. create the Workflow with AWS::Glue::Workflow
  2. If you need create Database and connection as well ( AWS::Glue::Database , AWS::Glue::Connection)
  3. Create any Crawler and any Job you want to add to the workflow using : AWS::Glue::Crawler or AWS::Glue::Job
  4. Create a first Trigger (AWS::Glue::Trigger ) with Type : ON-DEMAND , and Actions = to the firs Crawler or job your Workflow need to launch and Workflowname referencing the Workflow created at point 1
  5. Create any other Trigger with Type : CONDITIONAL

Below an Example (to create a Workflow that launch a Crawler on an S3 Bucket (cloudtraillogs) , if successfull launch a python script to change the table and partition schema to make them work with Athena )).

hope this helps

---

AWSTemplateFormatVersion: '2010-09-09'
Description: Creates cloudtrail crwaler and catalog for Athena and a job to transform to Parquet

Parameters: 
  CloudtrailS3: 
    Type: String
    Description: Enter the unique bucket name where the cloud trails log are stored

  CloudtrailS3Path: 
    Type: String
    Description: Enter the path/prefix that you want to crawl 

  CloudtrailDataLakeS3: 
    Type: String
    Description: Enter the unique bucket name for the data lake in which to store the logs in Parquet Format

Resources:
    
  CloudTrailGlueExecutionRole:
    Type: 'AWS::IAM::Role'
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - glue.amazonaws.com
            Action:
              - 'sts:AssumeRole'
      Path: /
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
  
  GluePolicy:
    Properties:
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
        - Action:
          - s3:GetBucketLocation
          - s3:GetObject
          - s3:PutObject
          - s3:ListBucket
          Effect: Allow
          Resource:
          - !Join ['', ['arn:aws:s3:::', !Ref CloudtrailS3] ]
          - !Join ['', ['arn:aws:s3:::', !Ref CloudtrailS3, '/*'] ]
          - !Join ['', ['arn:aws:s3:::', !Ref CloudtrailDataLakeS3] ]
          - !Join ['', ['arn:aws:s3:::', !Ref CloudtrailDataLakeS3, '/*'] ]
        - Action:
          - s3:DeleteObject
          Effect: Allow
          Resource:

          - !Join ['', ['arn:aws:s3:::', !Ref CloudtrailDataLakeS3] ]
          - !Join ['', ['arn:aws:s3:::', !Ref CloudtrailDataLakeS3, '/*'] ]
      PolicyName: glue_cloudtrail_S3_policy
      Roles:
      - Ref: CloudTrailGlueExecutionRole
    Type: AWS::IAM::Policy

  GlueWorkflow:
    Type: AWS::Glue::Workflow
    Properties: 
      Description: Workflow to crawl the cloudtrail logs
      Name: cloudtrail_discovery_workflow

  GlueDatabaseCloudTrail:
    Type: AWS::Glue::Database
    Properties:
      # The database is created in the Data Catalog for your account
      CatalogId: !Ref AWS::AccountId   
      DatabaseInput:
        # The name of the database is defined in the Parameters section above
        Name: cloudtrail_db
        Description: Database to hold tables for NY Philarmonica data
        LocationUri: !Ref CloudtrailDataLakeS3
  
  GlueCrawlerCTSource:
    Type: AWS::Glue::Crawler
    Properties:
      Name: cloudtrail_source_crawler
      Role: !GetAtt CloudTrailGlueExecutionRole.Arn
      #Classifiers: none, use the default classifier
      Description: AWS Glue crawler to crawl cloudtrail logs
      Schedule: 
        ScheduleExpression: 'cron(0 9 * * ? *)'
      DatabaseName: !Ref GlueDatabaseCloudTrail
      Targets:
        S3Targets:
          - Path: !Sub
            - s3://${bucket}/${path}
            - {
              bucket: !Ref CloudtrailS3,
              path : !Ref CloudtrailS3Path
              }
            Exclusions: 
              - '*/CloudTrail-Digest/**'
              - '*/Config/**'
            
      #TablePrefix: ''
      SchemaChangePolicy:
        UpdateBehavior: "UPDATE_IN_DATABASE"
        DeleteBehavior: "LOG"
      Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

  GlueJobConvertTable:
    Type: AWS::Glue::Job
    Properties:
      Name: ct_change_table_schema
      Role:
        Fn::GetAtt: [CloudTrailGlueExecutionRole, Arn]
      ExecutionProperty:
        MaxConcurrentRuns: 1
      GlueVersion: 1.0
      Command:
        Name: pythonshell
        PythonVersion: 3
        ScriptLocation: !Sub
          - s3://${bucket}/python/ct_change_table_schema.py
          - {bucket: !Ref CloudtrailDataLakeS3}
      DefaultArguments:
        '--TempDir': !Sub
          - s3://${bucket}/glue_tmp/
          - {bucket: !Ref CloudtrailDataLakeS3}
        "--job-bookmark-option" : "job-bookmark-disable"
        "--enable-metrics" : ""
    DependsOn:
      - CloudTrailGlueExecutionRole

  GlueSourceCrawlerTrigger:
    Type: AWS::Glue::Trigger
    Properties:
      Name: ct_start_source_crawl_Trigger
      Type: ON_DEMAND
      Description: Source Crawler trigger
      WorkflowName: !Ref GlueWorkflow
      Actions:
      - CrawlerName:
          Ref: GlueCrawlerCTSource
    DependsOn:
      - GlueCrawlerCTSource

  GlueJobTrigger:
    Type: AWS::Glue::Trigger
    Properties:
      Name: ct_change_schema_Job_Trigger
      Type: CONDITIONAL
      Description: Job trigger
      WorkflowName: !Ref GlueWorkflow
      StartOnCreation: 'true'
      Actions:
      - JobName: !Ref GlueJobConvertTable
      Predicate:
        Conditions:
        - LogicalOperator: EQUALS
          CrawlerName: !Ref GlueCrawlerCTSource
          CrawlState: SUCCEEDED
        Logical: ANY
    DependsOn:
    - GlueJobConvertTable
AWS
EXPERTE
gefragt vor 4 Jahren2710 Aufrufe
1 Antwort
0
Akzeptierte Antwort

For a workflow, you'll want a mixture of trigger, crawler and jobs. There's a fairly good coverage in CloudFormation but you might still need some Custom Resources and/or to kick off something like a Step Function.

Example (from the internet, just trying to minimize traffic to the post):

---
Parameters:
  OutputPathLocation:
    Description: Output path of the transformation file
    Type: String
  WorkFlowName:
    Description: Name of the workflow
    Type: String
    Default: test-workflow
  MyScriptLocation:
    Description: Location of ETL script
    Type: String
Resources:
  MyGlueWorkFlow:
    Type: AWS::Glue::Workflow
    Properties:
      Description: test cfn workflow
      Name:
        Ref: WorkFlowName
  MyGlueCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      DatabaseName: cfndb
      Description: My crawler
      Name: MyGlueCrawler
      Role: AWSGlueServiceRole
      TablePrefix: cfn_
      Targets:
        S3Targets:
        - Path: s3://crawler-public-us-east-1/flight/2016/csv
    DependsOn:
    - MyGlueWorkFlow
  MyGlueCrawlerTrigger:
    Type: AWS::Glue::Trigger
    Properties:
      Name: MyCrawlerTrigger
      Type: ON_DEMAND
      Description: Crawler trigger
      WorkflowName:
        Ref: MyGlueWorkFlow
      Actions:
      - CrawlerName:
          Ref: MyGlueCrawler
    DependsOn:
    - MyGlueCrawler
  MyGlueJob:
    Type: AWS::Glue::Job
    Properties:
      Command:
        Name: glueetl
        ScriptLocation:
          Ref: MyScriptLocation
      Description: My workflow job
      GlueVersion: '1.0'
      Name: MyGlueJob
      Role: AWSGlueServiceRole
      DefaultArguments:
        "--outputpath":
          Ref: OutputPathLocation
    DependsOn:
    - MyGlueCrawler
  MyGlueJobTrigger:
    Type: AWS::Glue::Trigger
    Properties:
      Name: MyGlueJobTrigger
      Type: CONDITIONAL
      Description: Job trigger
      WorkflowName:
        Ref: MyGlueWorkFlow
      StartOnCreation: 'true'
      Actions:
      - JobName:
          Ref: MyGlueJob
      Predicate:
        Conditions:
        - LogicalOperator: EQUALS
          CrawlerName:
            Ref: MyGlueCrawler
          CrawlState: SUCCEEDED
        Logical: ANY
    DependsOn:
    - MyGlueJob


AWS
EXPERTE
Raphael
beantwortet vor 4 Jahren
  • Where do we run this code ?. Which script should I use in order to utilize this script ?. I tried creating with boto3 only I was able to create the workflow , wasn't able to add triggers/jobs

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen