- Más nuevo
- Más votos
- Más comentarios
Somewhat related: I did something very similar (quite) a few years ago. It was on EC2 rather than Lambda but the idea may be portable or helpful to you.
http://gumbyadventures.s3-website-ap-southeast-2.amazonaws.com/parallelism.html
Hi,
Did you validate (via CloudWatch metrics) that you reach the theoretical S3 limits ? See https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html
The section below explains how to leverage object prefixes to increase S3 performances
Your applications can easily achieve thousands of transactions per second in request performance when uploading and
retrieving storage from Amazon S3. Amazon S3 automatically scales to high request rates. For example, your application
can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3
prefix. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by using
parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read
performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes.
The scaling, in the case of both read and write operations, happens gradually and is not instantaneous. While Amazon S3
is scaling to your new higher request rate, you may see some 503 (Slow Down) errors. These errors will dissipate when the
scaling is complete. For more information about creating and using prefixes, see Organizing objects using prefixes.
If do not reach such limits, you can still increase parallelism at Lambda level until you reach S3 throttling.
Best,
Didier
I get 10,000 objects in about 3 seconds, so I don't think I reach the limit (and thanks to your quote, I now know that I can hash objects to multiple prefixes). However, I end up using 300 GB-sec of AWS Lambda per search request and am looking for a more performant way in the sense that performance will translate into smaller costs.
Contenido relevante
- OFICIAL DE AWSActualizada hace un año
- OFICIAL DE AWSActualizada hace un año
- OFICIAL DE AWSActualizada hace 3 años
Thank you for sharing! Your code was CPU-bound and Python's GIL makes threading terribly inefficient in that case, so you solved it by using multi-processing instead. Mine is I/O-bound. In any case, I am going to try calling S3 asynchronously and see what happens. Also, you only needed to run for an hour, so you could afford a really strong EC2 instance. I need to accept search requests all the time, so a $1/hour instance would cost me $730/month, which is a lot when you don't yet have many users.
Actually, I had a Lambda function for the search component - the code in the link ran periodically to index the content and then the EC2 instance was discarded. Intermediate changes/updates to the index were handled by another Lambda function. The cost per month was therefore less than $10 in EC2 charges. The output from the index process was (in my case) a single text file about 800MB in size. Reading that into the search Lambda took about seven seconds on a cold start; and searches through it took about 100ms per keyword. Clearly there was a lot of optimisation that could have been done (file compression; smart indexing) but for my purposes it was highly performant.
I see. So I guess you could afford >7 seconds per search requests (7 seconds for reading the file + the searching itself). Did you not try to split that large 800 MB file into smaller files that can be read in parallel?
Also, it is strange that you got the bandwidth of only ~120 MB/s...
Bear in mind this was many years ago - at the time, max Lambda memory was 1.5GB - this would have had an impact on network and CPU performance. And seven seconds was fine for what I needed - it was replacing a system that took > 20 seconds to search the same dataset (in a different format).