Is there a way to create a Glue workflow programmatically?
I looked at CloudFormation but the only one I found is to create an empty Workflow (just Workflow name, Description and properties).
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-glue-workflow.html
I tried to look at the APIs as well (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-workflow.html), and even if there are all the data type for all the structures the only create API is again only adding the blank box.
Am I missing something? How do we create the workflow from the blueprint in Lake Formation?
is some sort of pre-assembled JSON file that we just link to the workflow in glue?
can you do something similar, or need to wait for the customizable blueprints?
Thank you
UPDATE:
As it can be derived from the snippet of code from the Accepted Answer, the key is to consider that it is actually the :
AWS::Glue::Trigger
construct that helps you build the Workflow.
Specifically, you need to:
- create the Workflow with AWS::Glue::Workflow
- If you need create Database and connection as well ( AWS::Glue::Database , AWS::Glue::Connection)
- Create any Crawler and any Job you want to add to the workflow using : AWS::Glue::Crawler or AWS::Glue::Job
- Create a first Trigger (AWS::Glue::Trigger ) with Type : ON-DEMAND , and Actions = to the firs Crawler or job your Workflow need to launch and Workflowname referencing the Workflow created at point 1
- Create any other Trigger with Type : CONDITIONAL
Below an Example (to create a Workflow that launch a Crawler on an S3 Bucket (cloudtraillogs) , if successfull launch a python script to change the table and partition schema to make them work with Athena )).
hope this helps
---
AWSTemplateFormatVersion: '2010-09-09'
Description: Creates cloudtrail crwaler and catalog for Athena and a job to transform to Parquet
Parameters:
CloudtrailS3:
Type: String
Description: Enter the unique bucket name where the cloud trails log are stored
CloudtrailS3Path:
Type: String
Description: Enter the path/prefix that you want to crawl
CloudtrailDataLakeS3:
Type: String
Description: Enter the unique bucket name for the data lake in which to store the logs in Parquet Format
Resources:
CloudTrailGlueExecutionRole:
Type: 'AWS::IAM::Role'
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service:
- glue.amazonaws.com
Action:
- 'sts:AssumeRole'
Path: /
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
GluePolicy:
Properties:
PolicyDocument:
Version: '2012-10-17'
Statement:
- Action:
- s3:GetBucketLocation
- s3:GetObject
- s3:PutObject
- s3:ListBucket
Effect: Allow
Resource:
- !Join ['', ['arn:aws:s3:::', !Ref CloudtrailS3] ]
- !Join ['', ['arn:aws:s3:::', !Ref CloudtrailS3, '/*'] ]
- !Join ['', ['arn:aws:s3:::', !Ref CloudtrailDataLakeS3] ]
- !Join ['', ['arn:aws:s3:::', !Ref CloudtrailDataLakeS3, '/*'] ]
- Action:
- s3:DeleteObject
Effect: Allow
Resource:
- !Join ['', ['arn:aws:s3:::', !Ref CloudtrailDataLakeS3] ]
- !Join ['', ['arn:aws:s3:::', !Ref CloudtrailDataLakeS3, '/*'] ]
PolicyName: glue_cloudtrail_S3_policy
Roles:
- Ref: CloudTrailGlueExecutionRole
Type: AWS::IAM::Policy
GlueWorkflow:
Type: AWS::Glue::Workflow
Properties:
Description: Workflow to crawl the cloudtrail logs
Name: cloudtrail_discovery_workflow
GlueDatabaseCloudTrail:
Type: AWS::Glue::Database
Properties:
# The database is created in the Data Catalog for your account
CatalogId: !Ref AWS::AccountId
DatabaseInput:
# The name of the database is defined in the Parameters section above
Name: cloudtrail_db
Description: Database to hold tables for NY Philarmonica data
LocationUri: !Ref CloudtrailDataLakeS3
GlueCrawlerCTSource:
Type: AWS::Glue::Crawler
Properties:
Name: cloudtrail_source_crawler
Role: !GetAtt CloudTrailGlueExecutionRole.Arn
#Classifiers: none, use the default classifier
Description: AWS Glue crawler to crawl cloudtrail logs
Schedule:
ScheduleExpression: 'cron(0 9 * * ? *)'
DatabaseName: !Ref GlueDatabaseCloudTrail
Targets:
S3Targets:
- Path: !Sub
- s3://${bucket}/${path}
- {
bucket: !Ref CloudtrailS3,
path : !Ref CloudtrailS3Path
}
Exclusions:
- '*/CloudTrail-Digest/**'
- '*/Config/**'
#TablePrefix: ''
SchemaChangePolicy:
UpdateBehavior: "UPDATE_IN_DATABASE"
DeleteBehavior: "LOG"
Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"
GlueJobConvertTable:
Type: AWS::Glue::Job
Properties:
Name: ct_change_table_schema
Role:
Fn::GetAtt: [CloudTrailGlueExecutionRole, Arn]
ExecutionProperty:
MaxConcurrentRuns: 1
GlueVersion: 1.0
Command:
Name: pythonshell
PythonVersion: 3
ScriptLocation: !Sub
- s3://${bucket}/python/ct_change_table_schema.py
- {bucket: !Ref CloudtrailDataLakeS3}
DefaultArguments:
'--TempDir': !Sub
- s3://${bucket}/glue_tmp/
- {bucket: !Ref CloudtrailDataLakeS3}
"--job-bookmark-option" : "job-bookmark-disable"
"--enable-metrics" : ""
DependsOn:
- CloudTrailGlueExecutionRole
GlueSourceCrawlerTrigger:
Type: AWS::Glue::Trigger
Properties:
Name: ct_start_source_crawl_Trigger
Type: ON_DEMAND
Description: Source Crawler trigger
WorkflowName: !Ref GlueWorkflow
Actions:
- CrawlerName:
Ref: GlueCrawlerCTSource
DependsOn:
- GlueCrawlerCTSource
GlueJobTrigger:
Type: AWS::Glue::Trigger
Properties:
Name: ct_change_schema_Job_Trigger
Type: CONDITIONAL
Description: Job trigger
WorkflowName: !Ref GlueWorkflow
StartOnCreation: 'true'
Actions:
- JobName: !Ref GlueJobConvertTable
Predicate:
Conditions:
- LogicalOperator: EQUALS
CrawlerName: !Ref GlueCrawlerCTSource
CrawlState: SUCCEEDED
Logical: ANY
DependsOn:
- GlueJobConvertTable
Where do we run this code ?. Which script should I use in order to utilize this script ?. I tried creating with boto3 only I was able to create the workflow , wasn't able to add triggers/jobs