How to process fast 10,000 small objects stored in S3

0

The problem: Given a list of 10000 paths to objects stored in S3, I need to process the corresponding objects fast (under 1 second). Each object is 40 KB.

The background: Each object represents a document. Each document is associated with one or more users (could be a thousand users or more, hence duplication of documents is avoided). We need to process all documents associated with a given user. The processing is a search in the documents' contents based on a given query. The exact details of the search are unimportant here, except for the fact that the search results are small (~1KB). In the above problem statement, the user has 10000 documents associated with him.

An approach I have already tested is to use parallelization by utilizing multiple instances of a Lambda function, each downloading and processing a part of the list of objects.

The computation part of this approach is fast. It is the downloading of objects that is the bottleneck. I would like to find a smarter (cheaper and more performant) approach.

One idea is to merge all 10,000 objects into a single temporary object and then download the latter large object with a single get request. However, the only way of doing this that I am aware of is by multipart upload, which requires that each part be at least 5 MB. In my case, the parts are 40 KB each. I cannot merge the parts in advance, since an object can be associated with many different users and thus appear in combination with different objects.

Is there a workaround or a different approach I can use?

P.S. The question at SO

asked 6 months ago406 views
2 Answers
0

Somewhat related: I did something very similar (quite) a few years ago. It was on EC2 rather than Lambda but the idea may be portable or helpful to you.

http://gumbyadventures.s3-website-ap-southeast-2.amazonaws.com/parallelism.html

profile pictureAWS
EXPERT
answered 6 months ago
  • Thank you for sharing! Your code was CPU-bound and Python's GIL makes threading terribly inefficient in that case, so you solved it by using multi-processing instead. Mine is I/O-bound. In any case, I am going to try calling S3 asynchronously and see what happens. Also, you only needed to run for an hour, so you could afford a really strong EC2 instance. I need to accept search requests all the time, so a $1/hour instance would cost me $730/month, which is a lot when you don't yet have many users.

  • Actually, I had a Lambda function for the search component - the code in the link ran periodically to index the content and then the EC2 instance was discarded. Intermediate changes/updates to the index were handled by another Lambda function. The cost per month was therefore less than $10 in EC2 charges. The output from the index process was (in my case) a single text file about 800MB in size. Reading that into the search Lambda took about seven seconds on a cold start; and searches through it took about 100ms per keyword. Clearly there was a lot of optimisation that could have been done (file compression; smart indexing) but for my purposes it was highly performant.

  • I see. So I guess you could afford >7 seconds per search requests (7 seconds for reading the file + the searching itself). Did you not try to split that large 800 MB file into smaller files that can be read in parallel?

  • Also, it is strange that you got the bandwidth of only ~120 MB/s...

  • Bear in mind this was many years ago - at the time, max Lambda memory was 1.5GB - this would have had an impact on network and CPU performance. And seven seconds was fine for what I needed - it was replacing a system that took > 20 seconds to search the same dataset (in a different format).

0

Hi,

Did you validate (via CloudWatch metrics) that you reach the theoretical S3 limits ? See https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

The section below explains how to leverage object prefixes to increase S3 performances

Your applications can easily achieve thousands of transactions per second in request performance when uploading and 
retrieving storage from Amazon S3. Amazon S3 automatically scales to high request rates. For example, your application 
can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 
prefix. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by using 
parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read 
performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes. 
The scaling, in the case of both read and write operations, happens gradually and is not instantaneous. While Amazon S3 
is scaling to your new higher request rate, you may see some 503 (Slow Down) errors. These errors will dissipate when the 
scaling is complete. For more information about creating and using prefixes, see Organizing objects using prefixes.

If do not reach such limits, you can still increase parallelism at Lambda level until you reach S3 throttling.

Best,

Didier

profile pictureAWS
EXPERT
answered 6 months ago
  • I get 10,000 objects in about 3 seconds, so I don't think I reach the limit (and thanks to your quote, I now know that I can hash objects to multiple prefixes). However, I end up using 300 GB-sec of AWS Lambda per search request and am looking for a more performant way in the sense that performance will translate into smaller costs.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions