Skip to content

how to load a ETL script to S3 bucket using yaml CloudFormation stack

0

I have been writing CloudFormation Stack using yaml and deploying it to AWS Infrastructure ( For legacy reasons, I can not switch to CDK unfortunately ;))

Following yaml code is a part of the cloudformation stack. The yaml code is creating a Glue job. it loads etl script from S3 bucket (name transform_json_to_parquet.py) as a part of the Cloudformation stack (see line ScriptLocation below).

A major limitation of approach is

  • It expects that transform_json_to_parquet.py script should be present in S3-bucket-name-1. Therefore, I have to manually upload transform_json_to_parquet.py file to S3-bucket-name-1.

I am just wondering is there any way that allow me to load transform_json_to_parquet.py file when I deploy cloudformation stack to AWS

  TransformJsonDataJob:
    Type: "AWS::Glue::Job"
    Properties:
      Role: !Ref AWSGlueETLJobRole  
      Name: "TransformJsonToParquet"
      Description: "Trasform JSON to Parquet"
      Timeout: 5
      WorkerType: G.1X
      NumberOfWorkers: 2
      MaxRetries: 0
      Command:
        "Name": "glueetl"
        "ScriptLocation" : !Sub s3://<S3-bucket-name-1>/transform_json_to_parquet.py
      DefaultArguments: 
        "--s3_json_path" : !Sub s3://<S3-bucket-name-2>/
        "--s3_parquet_path" : !Sub s3://<S3-bucket-name-3>/
2 Answers
2
Accepted Answer

There's no built-in mechanism in CloudFormation to upload objects to an S3 bucket. Technically, you could accomplish that with a custom CloudFormation resource (details in https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/template-custom-resources-lambda.html). The Lambda function could copy or upload the content in the S3 location.

However, given the filename, I assume the script will be the same for every account. If you're deploying it within a single organisation or offering it as a service to the clients of your service provider organisation, you could avoid the whole issue simply by creating a central S3 bucket, uploading the file there once, and authorising your entire AWS Organizations organisation, or those of your customers, or simply a list of authorised AWS account IDs to read the object from your central bucket.

This way, you wouldn't need to create the S3 bucket to host the code in every account separately, instead loading it from a single, central bucket.

EXPERT
answered a year ago
EXPERT
reviewed a year ago
1

Hello!

As an alternative to creating a central S3 bucket, can you add a Lambda function to the CloudFormation template? That means you would not need to manually upload the file.

For example:

  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: LambdaS3Policy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - s3:PutObject
                Resource: !Sub "arn:aws:s3:::${ScriptBucket}/*"

  UploadScriptFunction:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Role: !GetAtt LambdaExecutionRole.Arn
      Runtime: python3.8
      Code:
        ZipFile: |
          import boto3
          import os

          def handler(event, context):
              s3 = boto3.client('s3')
              bucket_name = os.environ['BUCKET_NAME']
              script_content = """
              # Your ETL script content goes here
              """
              s3.put_object(Bucket=bucket_name, Key='transform_json_to_parquet.py', Body=script_content)

      Environment:
        Variables:
          BUCKET_NAME: !Ref ScriptBucket
AWS
answered a year ago
  • That would deploy and leave behind, in every account the stack gets deployed in, a Lambda function using the oldest version of Python that Lambda still supports today and which will be deprecated on October 14, 2024, along with an IAM role not used for anything else but able to be assumed by all Lambda functions in the account. The declaration above also doesn't contain anything to invoke the function, so the bucket would remain empty.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.