Hello,
I have a S3 bucket with over 12 TB of data. The S3 metrics shows it has over 10 million objects. I had to list down all the objects in this bucket so I wrote a python code for it.
I ran the code 3 times, but the code stopped fetching a NextContinuation token after 3 million records, hence I cannot list all the objects in my bucket, I am only able to list 3 million records. I even tried running the code with StartAfter parameter using the Key from last record it fetched previously and still got no NextContinuation token. Here's my code:
============= Python Code =============
import boto3
import pandas as pd
client = boto3.client("s3")
continuation_token = None
record_count = 0
payload = dict(
Bucket='...',
MaxKeys=1000
)
while True:
**print("Fetching records...")**
**if continuation_token and len(continuation_token) >= 0:**
**payload.update(**
**ContinuationToken=continuation_token**
**)**
**response = client.list_objects_v2(**payload)**
**if not response.get("NextContinuationToken") or continuation_token == response.get("NextContinuationToken") or not response.get("Contents"):**
**exit("Process Finished")**
**# Dump to CSV**
**pd.DataFrame(response.get("Contents")).to_csv("./s3_objects.csv", index=False, mode="a", header=False)**
**# Update record count**
**record_count = len(response.get("Contents"))**
**print("Total records fetched", record_count, "\n")**
**# Updating continuation token**
**continuation_token = response.get("NextContinuationToken")**
**print(continuation_token)**
============= Python Code =============
Please help me understand and fix this problem.
Thankyou so much!
Update: Tried using awscli in Ubuntu 18.04 LTS
Command: aws s3 ls s3://... --recursive --summarize --human-readable > total_objects.txt
Result (At the end of file) :
Total Objects: 3036799
Total Size: 2.2 TiB
Note: I've obviously stripped down the original bucket name and replaced with "..."
Edited by: ISanV on May 5, 2021 8:06 AM