- Newest
- Most votes
- Most comments
AWS DataSync is specifically designed for transferring data between on-premises storage systems and AWS storage services like Amazon S3. The service uses a DataSync agent that connects to your existing storage systems using industry-standard protocols like NFS and SMB. When deployed on-premises, the agent can route traffic through your Direct Connect connection to transfer data to S3.
For your use case with Direct Connect connectivity, DataSync would work by having the agent route traffic from your on-premises NAS to the Direct Connect connection, then through a Direct Connect gateway associated with your VPC, and finally to your S3 bucket. The service automatically converts each file to a single S3 object while preserving file permissions and metadata, which can be important for downstream processing.
Regarding direct access from compute services like AWS Glue or Lambda to on-premises NAS over Direct Connect: While these services can be configured to run within a VPC and theoretically access resources over Direct Connect, this approach is not the recommended pattern for transferring files from on-premises storage to S3. DataSync is purpose-built for this specific use case and provides several advantages:
Key considerations for your scenario:
-
Transfer optimization: DataSync uses an AWS-designed transfer protocol to optimize data movement and performs automatic integrity checks to ensure data arrives intact.
-
Metadata preservation: When copying data to S3, DataSync preserves POSIX metadata from NFS shares as S3 object metadata, which is restored if you later copy objects back to file formats.
-
Scheduling and monitoring: DataSync supports task scheduling and generates detailed reports for verification and auditing, which aligns well with your daily transfer requirement.
-
Bandwidth management: The service includes bandwidth optimization and control features, which is valuable when sharing Direct Connect capacity with other workloads.
For your small daily data volume (20-50 MB), DataSync's pay-as-you-go pricing model means you only pay for the data you transfer, making it cost-effective even for modest transfer volumes. The service also simplifies your architecture by handling the transfer mechanics, allowing your Glue ETL jobs to focus purely on data processing rather than data acquisition.
The recommended approach would be to use DataSync for the file transfer from your NAS to S3, then trigger your Glue ETL processing once the transfer completes. This separation of concerns provides better reliability, maintainability, and follows AWS best practices for hybrid data ingestion patterns.
Sources
DataSync architecture and routing examples with Direct Connect - AWS DataSync
Online Data Transfer and Migration – AWS DataSync – Amazon Web Services
Transferring your data with AWS DataSync - AWS DataSync
Gzip the files and send to s3 will reduce costs for storage and transfer
Relevant content
- asked 3 years ago
- AWS OFFICIALUpdated 5 months ago

Can you please elaborate on what are the reasons it's not recommended? Looking for data driven answer as to what could be the potential issues of going with custom solutions mentioned in the question. "Regarding direct access from compute services like AWS Glue or Lambda to on-premises NAS over Direct Connect: While these services can be configured to run within a VPC and theoretically access resources over Direct Connect, this approach is not the recommended pattern for transferring files from on-premises storage to S3."