Invoking Lambda or External API from within a AWS Glue Job

0

We would like to read data from S3/CSV and have the periodic night batches, bulk load and other benefits of AWS Glue but then instead of writing python ETL script, we would like to call our API or invoke Lambda so we can ingest our data programmatically. I know we can do this outside of AWS Glue with probably plumbing together S3 trigger to Lambda but I feel there is a value of Glue that can be used so we can do ETL and some preprocessing and post-processing before calling our API. Some more context: We use Neptune Graph DB, converting customers CSV format to Neptune CSV bulk loader format is hard and a bit manual, hence we would like to try to connect 1) call our REST API directly 2) invoke a lambda within VPC 3) (worst case), call bulk loader api. But for that to work, we need a way to call an external API? Possible? How? Any Examples you can point us to using boto3?

1 Answer
1

I am not sure you would be using the benefits of the Glue core if you are calling an API. The driver would have to handle the API requests, while the executors would not be able to use their compute power to call the APIs in parallel. Given that, I would believe you would not use the power of Glue until you use Pyspark/DynamicFrames to process data. It may be more efficient (less expensive) to orchestrate a Lambda function to read S3, call API and do transformation and write into S3 before you use a Glue job to process/transform for your ETL.

That said, there may be an use case for what you want to implement. In case you want to try calling an API from Glue using Python code, you could try the following code.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3  ## Library for invoking Lambda

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

## your ETL logic prior to invoking Lambda

## Once the ETL completes
lambda_client = boto3.client('lambda')  
response = lambda_client.invoke(FunctionName='LambdaName')  

## Your ETL code after invoking lanbda

if you want to call an external API, you need to install requests module using additional-python-modules option and then use the below code:

import requests
url="https://example.com/api/jobs/test"
response = requests.post(url)
print(response.text) #TEXT/HTML
print(response.status_code, response.reason) #HTTP
profile picture
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions