Most efficient (cost/speed) to check if multiple files are in a S3 "folder" with Python/Boto3

0

I have an S3 "folder" (i.e. "s3_bucket/data/") that contains millions of JSON files that are saved as "file_id.json".

We then have a list of file_ids as a list, between 1-100 in size (["file_id1", "file_id2", "file_id3"...]).

Currently I loop through each file_id and do a try/except to request the file. If it fails to load, it doesn't exist and it's added to a separate process for creation. This feels costly/inefficient.

What would you recommend as the quickest way of finding the file_ids that are NOT currently in the bucket/folder using Python/Boto3?

I considered listing all objects in the Prefix ("s3_bucket/data") then doing a local compare of the two lists but AWS currently restrict the listing to 1,000 objects and so this seems equally inefficient given the size of the folder.

Thanks!

Jack J
asked a year ago2945 views
2 Answers
2
Accepted Answer

To check for the existence of multiple files in an S3 "folder" using Python and Boto3, the most efficient method would be to take advantage of S3's prefix and delimiter options in the list_objects_v2 operation. However, you have rightly pointed out that there is a limitation with AWS, where the list_objects_v2 operation returns up to a maximum of 1000 keys (objects) at a time.

Here are two possible solutions:

Pagination: AWS SDKs return paginated results when the response is too large to handle. AWS uses a Pagination interface, which can be used to retrieve all the objects in a bucket, regardless of the number. You can use Boto3's built-in paginators to work around the limitation. This approach would involve listing all objects in the S3 bucket and comparing them with your list of file_ids. However, please note that if you have millions of files in your S3 bucket, this operation could be time-consuming and potentially cost more, as you are charged for LIST requests.

Parallelized checking: Instead of checking the files one by one, you could use a parallelized approach where you check the existence of multiple files concurrently. This could be done by using multi-threading or multi-processing in Python. However, this could potentially lead to higher costs due to increased GET requests, and you would need to handle exceptions for rate limiting (too many requests in a short period).

An important point to consider is that the cost-efficiency and speed of these methods can depend on the specific use case and circumstances, such as the size of the S3 bucket and the number of file_ids you are checking.

As an alternative to using Python and Boto3, you might consider using AWS Glue. AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue consists of a central data catalog, ETL (extract, transform, and load) engine, and flexible scheduler. AWS Glue can catalog your data, clean it, enrich it, and move it reliably between various data stores. Glue also provides various interfaces for different types of users including data scientists, data analysts, and ETL developers.

profile picture
EXPERT
answered a year ago
profile picture
EXPERT
reviewed a year ago
0

If you have already the file names and also the file prefix, you can build the object key instead of listing and verifying the bucket content. Listing and checking if it exists will be an expensive operation, instead you can do a HEAD and check if the file exists, it will be better and less expensive than doing GET. For example you can create a function like that:

def s3_exists(s3_bucket, s3_key):
    try:
        boto3.client('s3').head_object(Bucket=s3_bucket,
                                       Key=s3_key)
        return True
    except:
        return False

And later loop through the list calling this function, you can later add parallelism to improve execution time.

luneo7
answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions