Skip to content

Integrating Amazon FSx with Redshift Using S3 Access Points

5 minute read
Content level: Expert
1

COPY, Spectrum, and Data API

Amazon FSx and Redshift Integration Guide

Overview

This document provides a comprehensive guide for integrating Amazon FSx for NetApp ONTAP with Amazon Redshift using S3 access points. This integration enables Redshift to access data stored on FSx file systems, providing a unified data access layer.

Architecture

The integration leverages Amazon S3 Access Points for FSx ONTAP, which allows Redshift to access FSx-resident data as if it were in an S3 bucket. This allows organizations to:

  • Access data stored on FSx file systems from Redshift without data duplication
  • Maintain a single source of truth for data across multiple services

Setup Steps

Step 1: Create Amazon FSx for NetApp ONTAP file system, storage virtual machine, and volume (if they don't already exist)

Create an Amazon FSx for NetApp ONTAP file system in your AWS account. Ensure it's configured with:

  • Appropriate storage capacity for your data needs
  • Network configuration that allows access from Redshift
  • Security groups that permit NFS traffic

Figure 1: Amazon FSx for NetApp ONTAP File System Configuration

Amazon FSx for NetApp ONTAP File System

Step 2: Create S3 Access Point for FSx Volume

Create an S3 access point for the volume on your Amazon FSx file system. You will use the S3 access point as the source or destination with Redshift.

Access Point ARN Format:

arn:aws:s3:<region>:<account-id>:accesspoint/<access-point-name>

Figure 2: FSx Volume Configuration with S3 Access Point Attachment

S3 Access Point Configuration for FSx Volume

Figure 3: S3 Access Point Details and Settings

S3 Access Point Details and Settings

Step 3: Mount FSx and Prepare Data

You can mount the FSx file system to an EC2 instance or use the S3 access point from a local environment to upload sample data.

Mount Command:

sudo mount -t nfs <svm-id>.<file-system-id>.fsx.<region>.amazonaws.com:/<volume-name> /fsx

Figure 4: FSx Volume Mount Instructions for Linux and Windows Instances

Mounting FSx and Preparing Sample Data

Testing and Validation

Test Plan Overview

The integration was validated through the following test scenarios:

  1. COPY Command Validation - Load data from FSx S3 alias into Redshift tables
  2. Redshift Spectrum External Tables - Create external schema and query data via Spectrum
  3. Programmatic Access via Data API - Automate data loading workflows using Lambda and Redshift Data API

COPY Command - Successful ✅

The COPY command works successfully with FSx S3 access points.

Example Table Schema:

CREATE TABLE lineitem_v2 (
  L_ORDERKEY bigint NOT NULL,
  L_PARTKEY bigint,
  L_SUPPKEY bigint,
  L_LINENUMBER integer NOT NULL,
  L_QUANTITY decimal(18,4),
  L_EXTENDEDPRICE decimal(18,4),
  L_DISCOUNT decimal(18,4),
  L_TAX decimal(18,4),
  L_RETURNFLAG varchar(1),
  L_LINESTATUS varchar(1),
  L_SHIPDATE date,
  L_COMMITDATE date,
  L_RECEIPTDATE date,
  L_SHIPINSTRUCT varchar(25),
  L_SHIPMODE varchar(10),
  L_COMMENT varchar(44)
)
DISTKEY (L_ORDERKEY)
SORTKEY (L_RECEIPTDATE);

COPY Command:

COPY lineitem_v2 
FROM 's3://<fsx-s3-alias>/lineitem-part/'
IAM_ROLE 'arn:aws:iam::<account-id>:role/<redshift-role>'
REGION '<region>' 
GZIP DELIMITER '|' COMPUPDATE PRESET;

Figure 5: Successful COPY Command Execution - 303 Million Records Loaded from FSx

Successful COPY Operation - 303 Million Records Loaded

Redshift Spectrum - Successful ✅

Redshift Spectrum works perfectly with FSx S3 access points, allowing you to query data directly from FSx without loading it into Redshift tables.

External Schema Creation:

CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'default'
IAM_ROLE 'arn:aws:iam::<account-id>:role/<redshift-role>'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Figure 6: Redshift Spectrum Query Results from FSx Data via External Schema

Redshift Spectrum External Schema Configuration

Programmatic Access via Data API

The integration works successfully with Redshift Data API, enabling serverless and programmatic data loading.

Lambda Function Example

import json
import boto3
import logging
import os

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    # Initialize Redshift Data API client
    redshift_data = boto3.client('redshift-data', region_name='us-east-1')
    
    # Redshift Serverless configuration
    workgroup_name = os.environ.get('WORKGROUP_NAME', 'test-wg')
    database = os.environ.get('DATABASE', 'dev')
    
    # Drop and create table
    drop_table_sql = "DROP TABLE IF EXISTS lineitem_v4;"
    
    create_table_sql = """
    CREATE TABLE lineitem_v4 (
        l_orderkey BIGINT,
        l_partkey BIGINT,
        l_suppkey BIGINT,
        l_linenumber INTEGER,
        l_quantity DECIMAL(15,2),
        l_extendedprice DECIMAL(15,2),
        l_discount DECIMAL(15,2),
        l_tax DECIMAL(15,2),
        l_returnflag CHAR(1),
        l_linestatus CHAR(1),
        l_shipdate DATE,
        l_commitdate DATE,
        l_receiptdate DATE,
        l_shipinstruct CHAR(25),
        l_shipmode CHAR(10),
        l_comment VARCHAR(44)
    );
    """
    
    # COPY command
    copy_sql = """
    COPY lineitem_v4 
    FROM 's3://<fsx-s3-alias>/lineitem-part/'
    IAM_ROLE 'arn:aws:iam::<account-id>:role/<redshift-role>'
    REGION '<region>' 
    GZIP DELIMITER '|' COMPUPDATE PRESET;
    """
    
    try:
        # Execute COPY command
        response = redshift_data.execute_statement(
            WorkgroupName=workgroup_name,
            Database=database,
            Sql=copy_sql
        )
        
        query_id = response['Id']
        logger.info(f"COPY command submitted with query ID: {query_id}")
        
        # Check query status
        status_response = redshift_data.describe_statement(Id=query_id)
        
        return {
            'statusCode': 200,
            'body': json.dumps({
                'message': 'COPY command submitted successfully',
                'queryId': query_id,
                'status': status_response['Status'],
                'workgroup': workgroup_name,
                'database': database
            })
        }
        
    except Exception as e:
        logger.error(f"Error executing COPY command: {str(e)}")
        return {
            'statusCode': 500,
            'body': json.dumps({
                'error': str(e)
            })
        }

Environment Variables:

  • WORKGROUP_NAME: Your Redshift Serverless workgroup name
  • DATABASE: Target database name (default: 'dev')

Key Findings

  1. COPY Operations: Successfully load data from FSx to Redshift using S3 access point aliases
  2. Large Dataset Support: Tested with 303+ million records without issues
  3. Data API Integration: Programmatic access via Lambda and Redshift Data API works seamlessly
  4. Standard S3 Syntax: Use familiar S3 paths with FSx access point aliases
  5. Redshift Spectrum: Query FSx data directly via external tables without loading into Redshift

Conclusion

The Amazon FSx and Redshift integration via S3 access points provides a robust solution for accessing FSx-hosted data from Redshift. Both COPY operations and Redshift Spectrum queries work seamlessly, enabling flexible data access patterns.

This integration is particularly valuable for organizations that:

  • Need to maintain data on FSx for multi-protocol access (NFS, SMB, S3)
  • Want to query FSx data directly via Spectrum without loading into Redshift
  • Require high-performance file system capabilities alongside analytics workloads
  • Need to load large datasets from FSx into Redshift for complex analytics