By using AWS re:Post, you agree to the Terms of Use

Questions tagged with S3 Transfer Acceleration

Sort by most recent
  • 1
  • 2
  • 12 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Python Cloudlib for S3 - My Python code is very slow to dowload my data files (total 15GB) . How to Fix ?

Hi, I want to download the .mkv files in my S3 bucket . I tried with a sample S3 bucket with subfolders. It is working very smoothly. But in my production side, my .mkv files are bit large, and the download via python code is taking very very long time to complete. Whereas, if I use tools like Cybertruck, the entire 15GB file is downloaded in 15 minutes. So , it means , there is nothing wrong with the settings of AWS S3 bucket and its policies. My code is as below. ``` import os import shutil import time from os import path, makedirs from cloudpathlib import CloudPath from cloudpathlib import S3Client downloadFolder = input('Enter the folder path for saving Downloaded Files:') convertedFolder = input('Enter the folder path for saving Converted Files:') # read the access variable from file credential = os.path.expanduser(os.getcwd()+os.sep+'credentials.txt') myvars = {} with open(credential, "r") as myfile: for line in myfile: line = line.strip() name, var = line.partition("=")[::2] myvars[name.strip()] = str(var) # access variables access_key = "{}".format(myvars['access_key']) secret_access_key = "{}".format(myvars['secret_access_key']) bucket_name = "{}".format(myvars['bucket_name']) folder_path = "{}".format(myvars['folder_path']) # Connect to s3 service client = S3Client(aws_access_key_id=access_key, aws_secret_access_key=secret_access_key) # s3 = boto3.client('s3', region, aws_access_key_id=access_key, aws_secret_access_key=secret_access_key) root_dir = CloudPath("s3://"+bucket_name+"/", client=client) #find the number of files in s3 bucket totalFileCount=0 for f in root_dir.glob(folder_path+'/*'): totalFileCount = totalFileCount+1 print('Total no. of files') print(totalFileCount) # for every two seconds, print the status of download/converted measure1 = time.time() measure2 = time.time() filesCompleted = 0 for f in root_dir.glob(folder_path+'/*'): filename = f.name print("file= "+filename) curFileName = 'store1_' + filename f.download_to(downloadFolder+os.sep+curFileName) # convert .mkv to .mp4 newName, ext = os.path.splitext(curFileName) outFileName = newName + '.mp4' src_path = downloadFolder + os.sep + curFileName dst_path = convertedFolder + os.sep + outFileName shutil.copy(src_path, dst_path) # For every two seconds print the status if measure2 - measure1 >= 2: # Find total no. of files in Downloaded folder currentlyDownloadedFiles = os.listdir(downloadFolder) curDownloadCount = len(currentlyDownloadedFiles) curConvertedFiles = os.listdir(convertedFolder) curConvertedCount = len(curConvertedFiles) print("Status ==> Downloaded: " + str(curDownloadCount) + "/" + str(totalFileCount) + " Converted: " + str( curConvertedCount) + "/" + str(totalFileCount)) measure1 = measure2 measure2 = time.time() else: measure2 = time.time() # client.set_as_default_client() # S3Client.get_default_client() #continue printing status until the Downloads and Converting files are fully complete while (curDownloadCount < totalFileCount or curConvertedCount < totalFileCount): currentlyDownloadedFiles = os.listdir(downloadFolder) curDownloadCount = len(currentlyDownloadedFiles) curConvertedFiles = os.listdir(convertedFolder) curConvertedCount=len(curConvertedFiles) # For every two seconds print the status if measure2 - measure1 >= 2: # Find total no. of files in Downloaded folder currentlyDownloadedFiles = os.listdir(downloadFolder) curDownloadCount = len(currentlyDownloadedFiles) curConvertedFiles = os.listdir(convertedFolder) curConvertedCount = len(curConvertedFiles) print("Status ==> Downloaded: " + str(curDownloadCount) + "/" + str(totalFileCount) + " Converted: " + str( curConvertedCount) + "/" + str(totalFileCount)) measure1 = measure2 measure2 = time.time() else: measure2 = time.time() ``` Please let me know where I am wrong. Thanks, Sabarisri
1
answers
0
votes
9
views
asked 23 days ago

Data transfer speeds from S3 bucket -> EC2 SLURM cluster are slower than S3 bucket -> Google SLURM cluster

Hello, I am currently benchmarking big data multi-cloud transfer speeds at a range of parallel reads using a cluster of EC2 instances & similar Google machines. I first detected an issue when using a `c5n.2xlarge` EC2 instance for my worker nodes reading a 7 GB dataset in multiple formats from an S3 bucket. I have verified that the bucket is in the same cloud region as the EC2 nodes, but the data transfer executed far slower to EC2 instances than it did for GCP. The data is not going into EBS, rather being read in-memory, where the data chunks are then removed from memory when the process is complete. Here are a list of things I have tried to diagnose the problem: 1. Upgrading to a bigger instance type. I am aware that there is a network bandwidth limit to each instance type, and I saw a read speed increase when I changed to a `c5n.9xlarge` (From your documentation, there should be 50 Gpbs of bandwidth), but it was still slower than reading from S3 to a Google VM with larger network proximity. I also upgraded instance type again, but there little to no speed increase. Note that hyperthreading is turned off for each EC2 instance. 2. Changing the S3 bucket parameter `max_concurrent_requests` to `100`. I am using python to benchmark these speeds, so this parameter was passed into a `storage_options` dictionary that is used in different remote data access APIs (see the [Dask documentation](https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html#:~:text=%22config_kwargs%22%3A%20%7B%22s3%22%3A%20%7B%22addressing_style%22%3A%20%22virtual%22%7D%7D%2C) for more info). Editing this parameter had absolutely no effect on the transfer speeds. 3. Verified that enhanced networking is active on all worker nodes & controller node. 4. Performed the data transfer directly from a worker node command line for both AWS and GCP machines. This was done to rule out my testing code being at fault, and the results were the same: S3 -> EC2 was slower than S3-> GCP. 5. Varying how many cores of each EC2 instance were used in each SLURM job. For the Google machines, each worker node has 4 cores and 16 GB memory, so each job that I submit there takes up an entire node. However, when I had to upgrade my EC2 worker node instances, there are clearly more cores than just 4 per node. To try and maintain a fair comparison, I configured each SLURM job to only access 8 cores per node in my EC2 cluster (I am performing 40 parallel reads at maximum, so if my understanding is correct each node will have 8 separate data stream connections, with 5 total nodes being active at a time with `c5n.9xlarge` instances). I also tried seeing if allocating all of a node's resources for my 40 parallel reads would speed things up (2 instances with all 18 cores in each active, and a third worker instance with only 4 cores active), but there was no effect. I'm fairly confident there is a solution to this, but I am having an extremely difficult time figuring out what it is. I know that setting an endpoint shouldn't be the problem, because GCP is faster than EC2 and there is egress occurring there. Any help would be appreciated, because I want to make sure I get an accurate picture of S3->EC2 before presenting my work. Please let me know if more information is needed!
1
answers
0
votes
36
views
asked a month ago

How to properly and completely terminate a multipart upload?

In our Java app we have what is basically boilerplate S3 V2 code for creating a multipart upload of a file to S3. We absolutely need the ability to cancel the upload, and recover all resources used by the upload process, INCLUDING the CPU and network bandwidth. Initially we tried simply cancelling the completionFuture on the FileUpload, but that doesn't work. I can watch the network traffic continue to send data to S3, until the entire file is uploaded. Cancelling the completionFuture seems to stop S3 from reconstructing the file, but that's not sufficient. In most cases we need to cancel the upload because we need the network bandwidth for other things, like streaming video. I found the function shutdownNow() in the TransferManager class, and that seemed promising, but it looks like it's not available in the V2 SDK (I found it in the V1 sources). I've seen a function getSubTransfers() in the V1 MultipleFileUpload class that returns a list of Uploads, and the Upload class has an abort() function, but again, we need to use V2 for other reasons. I've also found and implemented code that calls listMultipartUploads, looks for the upload we want to cancel, creates an abortMultipartUploadRequest, issues it and the threads keep on rolling, and rolling, and rolling.... Is there a "correct" way of terminating a multipart upload, including the threads processing the upload?
0
answers
0
votes
50
views
asked 3 months ago
  • 1
  • 2
  • 12 / page