Skip to content

How to copy a large file on ec2 from stdin to s3?

0

I have a case where copying a large file from stdin to s3 is failing. Using rhel 7.9, here's the command I'm using:

tar hcO * | aws s3 cp - s3://.../test_beta_backup.tar --region ap-southeast-2 --sse --expected-size 237104268

and I get the following error:

upload failed: - to s3://.../test_beta_backup.tar An error occurred (InvalidArgument) when calling the UploadPart operation: Part number must be an integer between 1 and 10000, inclusive

I am aware of the need to use the --expected-size parameter but it doesn't seem to make a difference. The size I specified above is the exact size of the data returned by du.

Here's what I tried:

  1. If I use the command above for a smaller data set (2-3GB) it works fine.
  2. If I create the large tar file first (same data, same size), then upload the file (instead of stdin) to s3 using the same parameters it works just fine:

tar hcO * > ../cass.tar

aws s3 cp ../cass.tar s3://.../test_beta_backup.tar --region ap-southeast-2 --sse --expected-size 237104268

  1. I tried increasing the expected-size parameter from the calculated size to 300000000, but it's still failing with the same error.
  2. I found some references online to another parameter, --multipart-chunk-size-mb=xx, but it doesn't appear to be supported by my version of s3.

Does the s3 copy command ignore the expected-size when copying from stdin? Is there any other way to get around this other than creating a temporary tar file first?

Thanks in advance.

2 Answers
3

The issue you're experiencing is likely related to how Amazon S3 handles large file uploads, especially when streaming from stdin. Here are some insights and potential solutions to your problem:

  1. Multipart Upload: When uploading large files to S3, especially those larger than 100MB, AWS automatically uses multipart upload. This process splits the file into smaller parts for more efficient and reliable uploads.

  2. Stdin Limitations: When streaming from stdin, the AWS CLI might have difficulty determining the total file size, which can affect how it manages the multipart upload process.

  3. Expected-Size Parameter: While the --expected-size parameter can help in some cases, it may not always resolve issues with stdin uploads, especially for very large files.

To address your specific situation:

  1. Increase Multipart Chunk Size: You mentioned that the --multipart-chunk-size-mb parameter isn't supported in your version. Consider updating your AWS CLI to the latest version, as this parameter can be crucial for managing large uploads.

  2. Use S3 Transfer Acceleration: If available in your region, you could try enabling S3 Transfer Acceleration, which can improve upload speeds for large files.

  3. Split the Upload: Instead of streaming the entire tar file, you could split it into smaller chunks and upload each separately. This approach gives you more control over the upload process.

  4. Use a Different Tool: Consider using other AWS-supported tools like s3cmd or the AWS SDK, which might handle large stdin uploads more effectively.

  5. Temporary File Approach: While not ideal, your method of creating a temporary tar file and then uploading it seems to work reliably. If other solutions fail, this might be the most practical approach, especially if you can clean up the temporary file immediately after the upload.

  6. Parallel Upload: If possible, use a transfer application capable of performing parallel requests across multiple connections. This can significantly improve transfer speeds and reliability for large files.

To maximize S3 transfer performance, ensure that your EC2 instance and S3 bucket are in the same AWS Region, and that your EC2 instance has sufficient network throughput to handle the large file transfer.

If you continue to face issues, it might be worth reaching out to AWS support for more specific guidance tailored to your setup and AWS CLI version.

Sources
Maximizing Amazon S3 transfer throughput to Amazon EC2 instances | AWS re:Post
Upload large files to S3 | AWS re:Post
Upload large files to S3 | AWS re:Post

answered a year ago
EXPERT
reviewed a year ago
1

It turns out that the error was mine by failing to realise that du returns the size in kb, not bytes. My calculated expected-size was therefore incorrect by a factor of 1024. Once I added the -b switch to du, correcting the calculated expected-size, the s3 cp worked as expected.

S3 seemingly ignores the expected-size parameter when copying from a file and gets the size from the file itself. But when using stdin it has no way of working out the size and has to rely on the expected-size provided. This explains the difference in behaviour that I experienced between copying from stdin and copying from a file.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.