Optimizing Levenshtein User-Defined Function in Amazon Redshift

5 minute read

Content level: Intermediate

Implement and optimize user defined function using Python and AWS Lambda function

Introduction

Bringing your code closer to your data will typically improve performance. Writing user-defined database functions to encapsulate and cache the execution of your code is a proven solution to optimize, and simplify implementing your code with your data.

For example, let's say you want to implement Levenshtein Distance in Amazon Redshift using a user-defined function that provides the best performance and is easy to implement. This article will discuss how to implement this function in Amazon Redshift and how you can optimize the execution.

Code

Note: The code used in the article is inspired from the Wikipedia article on Levenshtein Distance with a small change to help improve performance.

Amazon Redshift supports CREATE FUNCTION to create scalar functions in SQL and Python languages. The following PL/Python function example of the Levenshtein Distance function is simple to implement without any external dependencies. The function takes advantage of the parallel architecture of Amazon Redshift and performs very well. However, beyond setting the volatility (VOLATILE | STABLE | IMMUTABLE), there aren't ways to configure the function to run faster without increasing your Amazon Redshift data warehouse size.

Python UDF

Amazon Redshift also supports CREATE EXTERAL FUNCTION which enables you to invoke an AWS Lambda function from Amazon Redshift. The following AWS Lambda function example is exactly the same as the PL/Python example above of the Levenshtein Distance function and can be deployed with AWS CloudFormation.

Lambda UDF

The external function itself is pretty simple as it just points to the above AWS Lambda function.

Create External Function

Note: Be sure to use an iam_role that has "AWSLambda_FullAccess". More information on how to configure AWS Lambda function execution can be found here.

Cost Considerations

Executing a PL/Python function in Amazon Redshift does not incur an additional cost while invoking an AWS Lambda function from Amazon Redshift can. At the time of this article, the AWS Lambda free tier includes one million free requests per month and 400,000 GB-seconds of compute time per month. Billing is usage based so as you use more, invoke AWS Lambda from Amazon Redshift, the higher the cost. More details on AWS Lambda pricing can be found here.

Tests

For these tests, a 2 node ra3.4xlarge Amazon Redshift cluster was provisioned. Next, the customer table from the popular TPC-DS benchmark was loaded using the 3TB dataset. The Levenshtein distance of the c_first_name and c_last_name columns were calculated and stored in a new table using the Create Table As Select (CTAS) method. The new tables for each test used the same distribution key as the source customer table to eliminate data movement across the network. Lastly, the execution times for each test were captured and then compared.

Test Setup

create table customer
(
  c_customer_sk int4 not null ,                 
  c_customer_id char(16) not null ,             
  c_current_cdemo_sk int4 ,   
  c_current_hdemo_sk int4 ,   
  c_current_addr_sk int4 ,    
  c_first_shipto_date_sk int4 ,                 
  c_first_sales_date_sk int4 ,
  c_salutation char(10) ,     
  c_first_name char(20) ,     
  c_last_name char(30) ,      
  c_preferred_cust_flag char(1) ,               
  c_birth_day int4 ,          
  c_birth_month int4 ,        
  c_birth_year int4 ,         
  c_birth_country varchar(20) ,                 
  c_login char(13) ,          
  c_email_address char(50) ,  
  c_last_review_date_sk int4 ,
  primary key (c_customer_sk)
) distkey(c_customer_sk);

copy customer from 's3://redshift-downloads/TPC-DS/2.13/3TB/customer/' 
iam_role default gzip delimiter '|' EMPTYASNULL region 'us-east-1';

The full TPC-DS benchmark can be found in Github.

PL/Python

Step 1 - Create the PL/Python Function

Refer to the PL/Python Source code

Step 2 - Create new table using PL/Python function

drop table if exists levenshtein_plpython; 
  
create table levenshtein_plpython distkey(c_customer_sk) as 
select c_customer_sk, fn_levenshtein_distance(c_first_name, c_last_name) as distance
from customer;

AWS Lambda

Step 1 - Launch CloudFormation Template

Use the following CloudFormation Template that deploys AWS Lambda Function.

Step 2 - Create the external function in Amazon Redshift

Create the following external function in Amazon Redshift. Be sure to set the iam_role correctly so that the AWS Lambda function can execute from Redshift.

Step 3 - Create new table using AWS Lambda function

drop table if exists levenshtein_lambda;

create table levenshtein_lambda distkey(c_customer_sk) as 
select c_customer_sk, fn_lambda_levenshtein_distance(c_first_name, c_last_name) as distance
from customer;

Optimizing AWS Lambda

If you are executing these tests yourself, you may have noticed that the PL/Python function executed faster than the AWS Lambda function. However, we are using the default memory and ephemeral storage in AWS Lambda.

Next, go to the AWS console and find the AWS Lambda service and find the fn_lambda_levenshtein_distance function.

Lambda Configuration

Click on the Configuration tab and adjust the memory and storage settings as shown in the table below. For each test, execute step 3 again and capture the execution time.

Optimizing Tests

Test	Memory (MB)	Ephemeral Storage (MB)
1	256	1024
2	512	2048
3	1024	4096
4	2048	8192
5	4096	10240*

*10240 is the max storage in AWS Lambda

Results

The graph above depicts the execution times for each of the tests in seconds. As you can see, the default memory and storage configuration for AWS Lambda yields worse results than PL/Python but with some very simple tuning, you can achieve much better results with AWS Lambda. Additionally, there is a point of diminishing returns with using more memory as tests 4 and 5 yield virtually the same results despite having 2x more memory than test 5.

Summary

Amazon Redshift supports built-in language support as well as external language support with AWS Lambda. The built-in support of Python is easy to use and performs very well. AWS Lambda however, can also be used and when configured, can outperform the built-in language support.

Authors

Jon Roberts (AWS) and Harshida Patel (AWS)

Topics

Serverless Compute Analytics

Relevant content

Use generic logic to manage data warehouse change data capture (CDC) in Amazon Redshift
EXPERT
Sean Beath
published 3 months ago
Combining AWS IoT Core with Step Functions to optimize IoT Data ingestion and consumption
EXPERT
Charlyscott237
published 4 months ago
Bringing Generative AI to the data warehouse with Amazon Bedrock and Amazon Redshift
EXPERT
Rick Fraser
published 3 months ago
Update existing MySQL database to Optimized Writes?
AWS-User-8949122
asked a year ago
How to load data to Redshift serverless from step function
Accepted Answer
kimar
asked a month ago
Step Functions with call to Redshift Data API:ExecuteStatement does not wait for Redshift Function/Stored Procedure completion.
Accepted Answer
rePost-User-2779526
asked 2 years ago
How do I create an AWS Lambda function that runs queries in Amazon Redshift?
AWS OFFICIALUpdated a year ago
How do I use logs to track activity in my Amazon Redshift database cluster?
AWS OFFICIALUpdated 2 months ago
How do I scale my request rate to Amazon S3 and improve my request rate performance?
AWS OFFICIALUpdated 2 months ago
How do I create a cross-account Lambda function to access Amazon Redshift in a different account?
AWS OFFICIALUpdated 5 months ago