Do recursive downloads using awscli cp share TCP connections?

0

I want to bulk download lots (in the millions) of small files. While uploading, I noticed that the process of copying was pretty slow, and TCP Slow start and other connection creation related overheads seem to be one cause of this (I was using rclone for this, not the s3 cli directly). When downloading the same data, if I were to use the s3 CLI, would it share the same TCP connection(s) to download the files, or would it establish a new connection for each file?

PS: I am not talking about parallelizing TCP connections themselves. I know this happens, and that's good. I specifically would like multiple files on the same TCP connection, with several of those in parallel, to make full use of available bandwidth.

Thanks!

asked 5 months ago205 views
2 Answers
0

The CLI is multi-threaded, and will re-use TCP connections (which is what I think is what you are asking).

There are a number of tuneables if you want to speed up uploads to use all available bandwidth. The most effective is to ensure that you turn the CLI to use as many threads as possible, and multi-part uploads (if your individual files are large enough).

Internally the CLI will create a queue, and dispatch that queue to one of the sessions it has open, until it reaches whatever you have specified for max_concurrent_sessions. Refer to the documentation on setting the CLI for maximum performance: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html for your particular environment.

Please note also that many operating systems also place a limit on the number of open file-descriptors for a given user. This will obviously prevent the CLI from opening more connections than the limit that is enforced by the operating system. As an example, right now, I am using MacOS, and if I check my users open file-descriptor limit it is:

% ulimit -n
256

Which is quite low considering it is shared by all the processes that are running as my user.

AWS
EXPERT
answered 5 months ago
profile picture
EXPERT
reviewed 2 months ago
-1

Amazon S3 only supports HTTP 1.1 which has a well known limitation that it cannot send multiple HTTP requests over a single TCP connection. When the client opens a TCP connection, it has to send the request and wait for the response. Hence, for this reason it is not possible to download multiple files in parallel over a single TCP connection, and instead it will establish a new connection for each file.

That being said, AWS S3 transfer commands are multithreaded. At any given time, multiple requests to Amazon S3 are in flight and the TCP connections can also be re-used. Therefore, reusing TCP connections for multiple files, and having multiple concurrent connections, is best practice for S3 in general. Please refer below documentation to know more :

https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-guidelines.html#optimizing-performance-guidelines-scale

Lastly, you can also customize the AWS CLI configurations for Amazon S3 in AWS Config File which has a default location of ~/.aws/config to optimize performance. Please refer below link :

https://docs.aws.amazon.com/cli/latest/topic/s3-config.html

AWS
SUPPORT ENGINEER
answered 5 months ago
  • Guarav, what you say about HTTP 1.1 is not true, by default it allows for connection re-use, and pipelining of requests (which the CLI does not use). If no explicit "Connection: Close" header is received to a request made over an HTTP 1.1 session, then the connection is left open, ready to accept another HTTP request. Both the client (CLI) and server can alter this behavior by enclosing a "Connection" header. What you describe is the default behavior of the HTTP 1.0 protocol. Please read https://datatracker.ietf.org/doc/html/rfc7230#page-50 specifically on Page 50 "Connection Management"

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions