Python Cloudlib for S3 - My Python code is very slow to dowload my data files (total 15GB) . How to Fix ?

0

Hi,

I want to download the .mkv files in my S3 bucket . I tried with a sample S3 bucket with subfolders. It is working very smoothly.

But in my production side, my .mkv files are bit large, and the download via python code is taking very very long time to complete.

Whereas, if I use tools like Cybertruck, the entire 15GB file is downloaded in 15 minutes. So , it means , there is nothing wrong with the settings of AWS S3 bucket and its policies.

My code is as below.

import os
import shutil
import time
from os import path, makedirs
from cloudpathlib import CloudPath
from cloudpathlib import S3Client


downloadFolder = input('Enter the folder path for saving Downloaded Files:')
convertedFolder = input('Enter the folder path for saving Converted Files:')

# read the access variable from file
credential = os.path.expanduser(os.getcwd()+os.sep+'credentials.txt')
myvars = {}
with open(credential, "r") as myfile:
    for line in myfile:
        line = line.strip()
        name, var = line.partition("=")[::2]
        myvars[name.strip()] = str(var)

# access variables
access_key = "{}".format(myvars['access_key'])
secret_access_key = "{}".format(myvars['secret_access_key'])
bucket_name = "{}".format(myvars['bucket_name'])
folder_path = "{}".format(myvars['folder_path'])

# Connect to s3 service
client = S3Client(aws_access_key_id=access_key, aws_secret_access_key=secret_access_key)
# s3 = boto3.client('s3', region, aws_access_key_id=access_key, aws_secret_access_key=secret_access_key)


root_dir = CloudPath("s3://"+bucket_name+"/", client=client)

#find the number of files in s3 bucket
totalFileCount=0
for f in root_dir.glob(folder_path+'/*'):
        totalFileCount = totalFileCount+1
print('Total no. of files')
print(totalFileCount)

# for every two seconds, print the status of download/converted
measure1 = time.time()
measure2 = time.time()

filesCompleted = 0

for f in root_dir.glob(folder_path+'/*'):
    filename = f.name
    print("file= "+filename)
    curFileName = 'store1_' + filename
    f.download_to(downloadFolder+os.sep+curFileName)

    # convert .mkv to .mp4

    newName, ext = os.path.splitext(curFileName)
    outFileName = newName + '.mp4'
    src_path = downloadFolder + os.sep + curFileName
    dst_path = convertedFolder + os.sep + outFileName
    shutil.copy(src_path, dst_path)

    # For every two seconds print the status
    if measure2 - measure1 >= 2:

        # Find total no. of files in Downloaded folder
        currentlyDownloadedFiles = os.listdir(downloadFolder)

        curDownloadCount = len(currentlyDownloadedFiles)

        curConvertedFiles = os.listdir(convertedFolder)
        curConvertedCount = len(curConvertedFiles)

        print("Status ==> Downloaded: " + str(curDownloadCount) + "/" + str(totalFileCount) + "  Converted: " + str(
            curConvertedCount) + "/" + str(totalFileCount))
        measure1 = measure2
        measure2 = time.time()
    else:
        measure2 = time.time()

# client.set_as_default_client()
# S3Client.get_default_client()
#continue printing status until the Downloads and Converting files are fully complete
while (curDownloadCount < totalFileCount or  curConvertedCount < totalFileCount):
    currentlyDownloadedFiles = os.listdir(downloadFolder)
    curDownloadCount = len(currentlyDownloadedFiles)
    curConvertedFiles = os.listdir(convertedFolder)
    curConvertedCount=len(curConvertedFiles)

    # For every two seconds print the status
    if measure2 - measure1 >= 2:

        # Find total no. of files in Downloaded folder
        currentlyDownloadedFiles = os.listdir(downloadFolder)
        curDownloadCount = len(currentlyDownloadedFiles)

        curConvertedFiles = os.listdir(convertedFolder)
        curConvertedCount = len(curConvertedFiles)

        print("Status ==> Downloaded: " + str(curDownloadCount) + "/" + str(totalFileCount) + "  Converted: " + str(
            curConvertedCount) + "/" + str(totalFileCount))
        measure1 = measure2
        measure2 = time.time()
    else:
        measure2 = time.time()


Please let me know where I am wrong.

Thanks, Sabarisri

질문됨 2년 전234회 조회
1개 답변
0

Are you trying to download to your pc or a local drive? You can use the s3 client download_fileobj or the one using the file name and pair that with the config object to process multiple parts at the same time. I have large files so my chunk size is large, but you get the idea. You can make them much smaller, but the maximum number of parts is 10,000, so be sure you don't exceed that. Hope this helps. Source is the bucket, key is the path and file name in S3. (I'm downloading to my PC, so I use half the processes I can use.)

    mb = 1024 ** 2
    config = TransferConfig(multipart_threshold=1000 * mb, max_concurrency=int(psutil.cpu_count() / 2),
                            multipart_chunksize=1000 * mb, use_threads=True)
    try:
        with open(destination, 'wb') as data:
            s3.download_fileobj(source, key, data, Config=config)
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠