Optimizing Levenshtein User-Defined Function in Amazon Redshift
5 minute read
Implement and optimize user defined function using Python and AWS Lambda function
Bringing your code closer to your data will typically improve performance. Writing user-defined database functions to encapsulate and cache the execution of your code is a proven solution to optimize, and simplify implementing your code with your data.
For example, let's say you want to implement Levenshtein Distance in Amazon Redshift using a user-defined function that provides the best performance and is easy to implement. This article will discuss how to implement this function in Amazon Redshift and how you can optimize the execution.
Note: The code used in the article is inspired from the Wikipedia article on Levenshtein Distance with a small change to help improve performance.
Amazon Redshift supports CREATE FUNCTION to create scalar functions in SQL and Python languages. The following PL/Python function example of the Levenshtein Distance function is simple to implement without any external dependencies. The function takes advantage of the parallel architecture of Amazon Redshift and performs very well. However, beyond setting the volatility (VOLATILE | STABLE | IMMUTABLE), there aren't ways to configure the function to run faster without increasing your Amazon Redshift data warehouse size.
Amazon Redshift also supports CREATE EXTERAL FUNCTION which enables you to invoke an AWS Lambda function from Amazon Redshift. The following AWS Lambda function example is exactly the same as the PL/Python example above of the Levenshtein Distance function and can be deployed with AWS CloudFormation.
The external function itself is pretty simple as it just points to the above AWS Lambda function.
Note: Be sure to use an iam_role that has "AWSLambda_FullAccess". More information on how to configure AWS Lambda function execution can be found here.
Executing a PL/Python function in Amazon Redshift does not incur an additional cost while invoking an AWS Lambda function from Amazon Redshift can. At the time of this article, the AWS Lambda free tier includes one million free requests per month and 400,000 GB-seconds of compute time per month. Billing is usage based so as you use more, invoke AWS Lambda from Amazon Redshift, the higher the cost. More details on AWS Lambda pricing can be found here.
For these tests, a 2 node ra3.4xlarge Amazon Redshift cluster was provisioned. Next, the customer table from the popular TPC-DS benchmark was loaded using the 3TB dataset. The Levenshtein distance of the c_first_name and c_last_name columns were calculated and stored in a new table using the Create Table As Select (CTAS) method. The new tables for each test used the same distribution key as the source customer table to eliminate data movement across the network. Lastly, the execution times for each test were captured and then compared.
Step 2 - Create new table using PL/Python function
drop table if exists levenshtein_plpython;
create table levenshtein_plpython distkey(c_customer_sk) as
select c_customer_sk, fn_levenshtein_distance(c_first_name, c_last_name) as distance
Step 2 - Create the external function in Amazon Redshift
Create the following external function in Amazon Redshift. Be sure to set the iam_role correctly so that the AWS Lambda function can execute from Redshift.
Step 3 - Create new table using AWS Lambda function
drop table if exists levenshtein_lambda;
create table levenshtein_lambda distkey(c_customer_sk) as
select c_customer_sk, fn_lambda_levenshtein_distance(c_first_name, c_last_name) as distance
Optimizing AWS Lambda
If you are executing these tests yourself, you may have noticed that the PL/Python function executed faster than the AWS Lambda function. However, we are using the default memory and ephemeral storage in AWS Lambda.
Next, go to the AWS console and find the AWS Lambda service and find the fn_lambda_levenshtein_distance function.
Click on the Configuration tab and adjust the memory and storage settings as shown in the table below. For each test, execute step 3 again and capture the execution time.
Ephemeral Storage (MB)
*10240 is the max storage in AWS Lambda
The graph above depicts the execution times for each of the tests in seconds. As you can see, the default memory and storage configuration for AWS Lambda yields worse results than PL/Python but with some very simple tuning, you can achieve much better results with AWS Lambda. Additionally, there is a point of diminishing returns with using more memory as tests 4 and 5 yield virtually the same results despite having 2x more memory than test 5.
Amazon Redshift supports built-in language support as well as external language support with AWS Lambda. The built-in support of Python is easy to use and performs very well. AWS Lambda however, can also be used and when configured, can outperform the built-in language support.