- Newest
- Most votes
- Most comments
To transfer large volumes of PostgreSQL data and disk snapshots from on-premises to AWS S3, you can consider the following approaches:
-
AWS Database Migration Service (DMS): For the PostgreSQL data, AWS DMS can be used to migrate the database to S3. DMS supports both full load and change data capture (CDC) for ongoing replication. It can write data from source tables into multiple files in S3, which can be partitioned based on commit dates for CDC or partition columns for full load.
-
AWS CLI or SDK: For large-scale data transfer, you can use the AWS CLI commands like "aws s3 cp" or "aws s3 sync". These tools support multipart uploads, which is beneficial for large files. To improve transfer speed, you can enable the Common Runtime (CRT) for S3 transfers.
-
AWS DataSync: Although not explicitly mentioned for your use case, DataSync could be a viable option if you have network connectivity from on-premises to AWS through DirectConnect or site-to-site VPN. It's designed for large-scale data transfers and can efficiently move data to S3.
-
S3 File Gateway: For the disk snapshots, you might consider deploying Amazon S3 File Gateway on-premises. It provides a file server interface supporting NFS and SMB protocols, allowing you to transfer data to S3 while providing low-latency access through local caching.
-
AWS Snowball or Snowmobile: Given the extremely large volume of data (500TB of PostgreSQL data and 20PB of disk snapshots), you might want to consider using AWS Snowball devices or even AWS Snowmobile for the initial bulk data transfer. These physical devices can be shipped to your location, loaded with data, and then returned to AWS for direct upload to S3.
For optimal performance and cost-effectiveness:
- Ensure even distribution of partitions and bucket key space.
- Use hashing methods for key prefixes to distribute data evenly.
- Adjust the default multipart upload configuration to optimize costs.
- If using SSE-KMS, keep the S3 bucket key enabled to reduce KMS API calls.
- Consider running multiple parallel transfer jobs to maximize throughput.
Given the massive scale of your data, a combination of these methods might be necessary. For instance, you could use Snowmobile for the initial bulk transfer of disk snapshots, while simultaneously using DMS for the PostgreSQL data transfer and ongoing replication.
Sources
what is the best way to transfer data from on-prem to s3 over internet | AWS re:Post
S3 Data Copy\Migration Best Practices | AWS re:Post
Step-by-step an Amazon RDS PostgreSQL database to an Amazon S3 data lake migration walkthrough - Database Migration Guide
Oracle external tables and PostgreSQL integration with Amazon S3 - Oracle to Aurora PostgreSQL Migration Playbook