Skip to content

Daily NAS-to-S3 file transfer over Direct Connect: AWS DataSync vs. Glue/Lambda direct access...

0

Business Requirement I need to ingest CSV files from on-premises NAS storage to AWS S3 for downstream processing.

Current Setup: Source: On-premises NAS storage File characteristics: 4-5 CSV files daily, each under 10 MB (~20-50 MB total daily) Connectivity: AWS Direct Connect established between on-premises and AWS Target: AWS S3, followed by AWS Glue ETL processing

Question While AWS DataSync is the commonly recommended solution for transferring files from on-premises NAS to S3, I'm exploring whether compute services can directly read from the NAS and write to S3:

Can the following AWS services directly access on-premises NAS shares over Direct Connect to copy files to S3?

AWS Glue Python Shell jobs - using custom Python libraries (e.g., pysmb, NFS clients)
AWS Glue Spark jobs - with custom connectors or mount logic
AWS Lambda functions - with VPC configuration and file system libraries

Context for the Question

Why I'm exploring this approach: Small daily data volume (under 50 MB) may not justify DataSync infrastructure/costs Desire to consolidate data ingestion and processing in a single workflow Potential to reduce architectural complexity (fewer moving parts)

What I'm trying to understand: Is direct NAS access from these compute services technically feasible and supported? What are the performance, reliability, and security implications? How does this compare to DataSync in terms of cost, complexity, and maintainability? Are there documented patterns or best practices for this approach? What are the potential pitfalls or limitations I should be aware of?

Desired Guidance AWS-recommended architecture for this use case Trade-off analysis between DataSync vs. compute-based file transfer Reference architectures or documentation for hybrid data ingestion patterns Real-world implementation experiences with similar requirements Security and networking best practices for Direct Connect-based data access

2 Answers
0

AWS DataSync is specifically designed for transferring data between on-premises storage systems and AWS storage services like Amazon S3. The service uses a DataSync agent that connects to your existing storage systems using industry-standard protocols like NFS and SMB. When deployed on-premises, the agent can route traffic through your Direct Connect connection to transfer data to S3.

For your use case with Direct Connect connectivity, DataSync would work by having the agent route traffic from your on-premises NAS to the Direct Connect connection, then through a Direct Connect gateway associated with your VPC, and finally to your S3 bucket. The service automatically converts each file to a single S3 object while preserving file permissions and metadata, which can be important for downstream processing.

Regarding direct access from compute services like AWS Glue or Lambda to on-premises NAS over Direct Connect: While these services can be configured to run within a VPC and theoretically access resources over Direct Connect, this approach is not the recommended pattern for transferring files from on-premises storage to S3. DataSync is purpose-built for this specific use case and provides several advantages:

Key considerations for your scenario:

  1. Transfer optimization: DataSync uses an AWS-designed transfer protocol to optimize data movement and performs automatic integrity checks to ensure data arrives intact.

  2. Metadata preservation: When copying data to S3, DataSync preserves POSIX metadata from NFS shares as S3 object metadata, which is restored if you later copy objects back to file formats.

  3. Scheduling and monitoring: DataSync supports task scheduling and generates detailed reports for verification and auditing, which aligns well with your daily transfer requirement.

  4. Bandwidth management: The service includes bandwidth optimization and control features, which is valuable when sharing Direct Connect capacity with other workloads.

For your small daily data volume (20-50 MB), DataSync's pay-as-you-go pricing model means you only pay for the data you transfer, making it cost-effective even for modest transfer volumes. The service also simplifies your architecture by handling the transfer mechanics, allowing your Glue ETL jobs to focus purely on data processing rather than data acquisition.

The recommended approach would be to use DataSync for the file transfer from your NAS to S3, then trigger your Glue ETL processing once the transfer completes. This separation of concerns provides better reliability, maintainability, and follows AWS best practices for hybrid data ingestion patterns.
Sources
DataSync architecture and routing examples with Direct Connect - AWS DataSync
Online Data Transfer and Migration – AWS DataSync – Amazon Web Services
Transferring your data with AWS DataSync - AWS DataSync

answered 2 months ago
EXPERT
reviewed 2 months ago
  • Can you please elaborate on what are the reasons it's not recommended? Looking for data driven answer as to what could be the potential issues of going with custom solutions mentioned in the question. "Regarding direct access from compute services like AWS Glue or Lambda to on-premises NAS over Direct Connect: While these services can be configured to run within a VPC and theoretically access resources over Direct Connect, this approach is not the recommended pattern for transferring files from on-premises storage to S3."

-2

Gzip the files and send to s3 will reduce costs for storage and transfer

EXPERT
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.