How can we transfer TBs of postgresql from on prem to s3

0

We have 500tb of postgres sql data and 20pb of disk snapshot.

How can we transfer this data from on prem to aws s3

asked 2 months ago105 views
1 Answer
0

To transfer large volumes of PostgreSQL data and disk snapshots from on-premises to AWS S3, you can consider the following approaches:

  1. AWS Database Migration Service (DMS): For the PostgreSQL data, AWS DMS can be used to migrate the database to S3. DMS supports both full load and change data capture (CDC) for ongoing replication. It can write data from source tables into multiple files in S3, which can be partitioned based on commit dates for CDC or partition columns for full load.

  2. AWS CLI or SDK: For large-scale data transfer, you can use the AWS CLI commands like "aws s3 cp" or "aws s3 sync". These tools support multipart uploads, which is beneficial for large files. To improve transfer speed, you can enable the Common Runtime (CRT) for S3 transfers.

  3. AWS DataSync: Although not explicitly mentioned for your use case, DataSync could be a viable option if you have network connectivity from on-premises to AWS through DirectConnect or site-to-site VPN. It's designed for large-scale data transfers and can efficiently move data to S3.

  4. S3 File Gateway: For the disk snapshots, you might consider deploying Amazon S3 File Gateway on-premises. It provides a file server interface supporting NFS and SMB protocols, allowing you to transfer data to S3 while providing low-latency access through local caching.

  5. AWS Snowball or Snowmobile: Given the extremely large volume of data (500TB of PostgreSQL data and 20PB of disk snapshots), you might want to consider using AWS Snowball devices or even AWS Snowmobile for the initial bulk data transfer. These physical devices can be shipped to your location, loaded with data, and then returned to AWS for direct upload to S3.

For optimal performance and cost-effectiveness:

  • Ensure even distribution of partitions and bucket key space.
  • Use hashing methods for key prefixes to distribute data evenly.
  • Adjust the default multipart upload configuration to optimize costs.
  • If using SSE-KMS, keep the S3 bucket key enabled to reduce KMS API calls.
  • Consider running multiple parallel transfer jobs to maximize throughput.

Given the massive scale of your data, a combination of these methods might be necessary. For instance, you could use Snowmobile for the initial bulk transfer of disk snapshots, while simultaneously using DMS for the PostgreSQL data transfer and ongoing replication.
Sources
what is the best way to transfer data from on-prem to s3 over internet | AWS re:Post
S3 Data Copy\Migration Best Practices | AWS re:Post
Step-by-step an Amazon RDS PostgreSQL database to an Amazon S3 data lake migration walkthrough - Database Migration Guide
Oracle external tables and PostgreSQL integration with Amazon S3 - Oracle to Aurora PostgreSQL Migration Playbook

profile picture
answered 2 months ago
profile pictureAWS
EXPERT
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions