Do recursive downloads using awscli cp share TCP connections?

0

I want to bulk download lots (in the millions) of small files. While uploading, I noticed that the process of copying was pretty slow, and TCP Slow start and other connection creation related overheads seem to be one cause of this (I was using rclone for this, not the s3 cli directly). When downloading the same data, if I were to use the s3 CLI, would it share the same TCP connection(s) to download the files, or would it establish a new connection for each file?

PS: I am not talking about parallelizing TCP connections themselves. I know this happens, and that's good. I specifically would like multiple files on the same TCP connection, with several of those in parallel, to make full use of available bandwidth.

Thanks!

질문됨 6달 전216회 조회
2개 답변
0

The CLI is multi-threaded, and will re-use TCP connections (which is what I think is what you are asking).

There are a number of tuneables if you want to speed up uploads to use all available bandwidth. The most effective is to ensure that you turn the CLI to use as many threads as possible, and multi-part uploads (if your individual files are large enough).

Internally the CLI will create a queue, and dispatch that queue to one of the sessions it has open, until it reaches whatever you have specified for max_concurrent_sessions. Refer to the documentation on setting the CLI for maximum performance: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html for your particular environment.

Please note also that many operating systems also place a limit on the number of open file-descriptors for a given user. This will obviously prevent the CLI from opening more connections than the limit that is enforced by the operating system. As an example, right now, I am using MacOS, and if I check my users open file-descriptor limit it is:

% ulimit -n
256

Which is quite low considering it is shared by all the processes that are running as my user.

AWS
전문가
답변함 5달 전
profile picture
전문가
검토됨 2달 전
-1

Amazon S3 only supports HTTP 1.1 which has a well known limitation that it cannot send multiple HTTP requests over a single TCP connection. When the client opens a TCP connection, it has to send the request and wait for the response. Hence, for this reason it is not possible to download multiple files in parallel over a single TCP connection, and instead it will establish a new connection for each file.

That being said, AWS S3 transfer commands are multithreaded. At any given time, multiple requests to Amazon S3 are in flight and the TCP connections can also be re-used. Therefore, reusing TCP connections for multiple files, and having multiple concurrent connections, is best practice for S3 in general. Please refer below documentation to know more :

https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-guidelines.html#optimizing-performance-guidelines-scale

Lastly, you can also customize the AWS CLI configurations for Amazon S3 in AWS Config File which has a default location of ~/.aws/config to optimize performance. Please refer below link :

https://docs.aws.amazon.com/cli/latest/topic/s3-config.html

AWS
지원 엔지니어
답변함 5달 전
  • Guarav, what you say about HTTP 1.1 is not true, by default it allows for connection re-use, and pipelining of requests (which the CLI does not use). If no explicit "Connection: Close" header is received to a request made over an HTTP 1.1 session, then the connection is left open, ready to accept another HTTP request. Both the client (CLI) and server can alter this behavior by enclosing a "Connection" header. What you describe is the default behavior of the HTTP 1.0 protocol. Please read https://datatracker.ietf.org/doc/html/rfc7230#page-50 specifically on Page 50 "Connection Management"

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠