Best way to Read multiple files from S3 in paralell with .net

1

I have a lot of xml files in S3(more 1,2 Million). They are in 12 folders (1 for month) with about 100.000 files in each folder. Each file is little size, about 6k-7k I have to read each file, parse the fields in the xml and call an external API Rest with this information. (I dont need download the files to local, only read them) What is the best way to do it? I try it sequentially, but it is very slowly.

I also tried to put the file in an array and use paralell.for, but is slow too. I don't count the time to call API. I see that in read and parse the file it about 10 files per second. (arrayficheros has 1000 elements at max, only get about 6-7Mb of ram)

 using (var stream = await (client.GetObjectStreamAsync(parametros.bucketName, responseMetadatos.S3Objects[x].Key, null)))
                            {                             
                                arrayficheros[x] = new StreamReader(stream).ReadToEnd();
                            }         
-------
Parallel.For(0, arrayficheros.length,
                                 new ParallelOptions { MaxDegreeOfParallelism = 6 }, 
                                 x =>
                                  {
                                    DoWork( arrayficheros[x] );
                                  });

Thanks

  • Is this supposed to be a one-time thing, or do you need a reusable solution? Where are you running this code? On your machine or on some cloud service? Where do you see the bottleneck? I.e. is your CPU at 100%, or is your bandwidth maxed out? How many requests per second can the target API handle?

    You can approach this from different angles. You can either try to optimize your code (there's room for improvement) or you can scale out (e.g. by leveraging Lambda)

preguntada hace 2 años5883 visualizaciones
3 Respuestas
1

This is a good scenario for partitioning to have horizontal scale and parallel processing. Consider partitioning per folder that corresponds each month, or if it is possible you can even consider partitioning per month and day example files from: Jan 01-10, Jan 11-20, Jan 21-31, Feb 01-10 and so on… And each partition to be processed by one instance, You can run this on Amazon ECS as Fargate Task. In your C# logic consider implementing Error retries and exponential backoff since there will be a lot of external API calls, read the post Exponential Backoff And Jitter to learn more about the benefits of Exponential Backoff.

AWS
respondido hace 2 años
1

Note that you can send up to 5,500 GET requests per second per prefix in an Amazon S3 bucket. Considering you have approx. 100,000 files in each of the 12 folders and you can read from those folders in parallel, you should be able to send all those GET requests in approx. 20 seconds. However, the network latency as well as the time needed to transfer the actual file content will increase the overall time needed. This will also be affected by the actual processing duration of your files (calling external APIs etc.)

To process the content of an object (file) stored in Amazon S3, you need to download that object. You could also read only a portion of the object data if you know the exact byte offsets of interest. However, for text-based file formats like XML, it is probably better to download the whole object, because the XML parsers could fail processing partial content.

To improve overall performance, you could parallelize the file downloading and the data processing. Consider the producer-consumer pattern here. You would have multiple producers (downloaders) and multiple consumers (data processors). I assume you can parallelize the data processing, and the files are independent of each other and don't have any particular processing order requirements.

If you are not limited on storage, you could use the Amazon.S3.Transfer.TransferUtility class to download your files in parallel. This utility will handle all the connectivity for you (like e.g. retrying on failure). You can specify the number of parallel requests for downloading:

using Amazon.S3;
using Amazon.S3.Transfer;

using var client = new AmazonS3Client();

// 100 requests in parallel - adjust as needed
var config = new TransferUtilityConfig { ConcurrentServiceRequests = 100 };
using var utility = new TransferUtility(client, config);
var request = new TransferUtilityDownloadDirectoryRequest
{
    DownloadFilesConcurrently = true,
    BucketName = "YourBucket", // your bucket name here
    LocalDirectory = "/your/path/", // the path where to store the downloaded files
    S3Directory = "prefix-by-month" // the bucket prefix (folder) for a month
};

await utility.DownloadDirectoryAsync(request);

You can construct 12 requests for your 12 prefixes and download them all in parallel (12 * 100 parallel requests). Something like this:

var prefixes = new List<string>(); // 12 prefixes (folders) here
var requests = prefixes.Select(
    p => new TransferUtilityDownloadDirectoryRequest { S3Directory = p, /* ... */});
var downloadTasks = requests.Select(r => utility.DownloadDirectoryAsync(r));

// wait until all 12 tasks complete in parallel
await Task.WhenAll(downloadTasks);

Now you can either wait until all the files will be downloaded, or subscribe to the request.DownloadedDirectoryProgressEvent events and process the files as they are arriving.

Note that you are limited in the number of thread pool threads, so you can only execute a limited number of requests in parallel. You can obtain these limits by calling ThreadPool.GetMaxThreads(out var workerThreads, out var portThreads).

If you don't want or cannot use storage as a cache for the files, you can implement your producer-consumer data flow (e.g. via System.Threading.Tasks.Dataflow) using the AWS SDK GetObjectStreamAsync method. However, you should also implement all the necessary details:

  • retrying when the call fails
  • retrying when the request throttling occurs
  • chunk-wise reading from the stream obtained by GetObjectStreamAsync (the stream might contain only partial file, so you would need to read the stream multiple times as new data gets transferred)

You can also use the source code of the Amazon.S3.Transfer.TransferUtility class as an inspiration and create your own utility that gets the objects from an S3 bucket in parallel but doesn't store them into files but rather pushes them into an in-memory queue for further processing. The source code can be found on GitHub.

profile pictureAWS
respondido hace 2 años
0

What you have with Parallel.For seems like the optimal way to do this to me. Could it simply be a bandwidth issue? One way to test this would be to spin up an EC2 machine in the same region as the bucket, add some console.writelines, and try the same code on it really quick.

StevenS
respondido hace 2 años

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas