- Mais recentes
- Mais votos
- Mais comentários
Hello,
Thank you for writing on re:Post
I understand that you would like to increase the speed at which your Spark application writes to S3.
I would recommend experimenting with increasing the size of the executors and/or the number of executors, since it looks like the executors are finishing quite quickly.
Other thing we may want to consider include:
- How data is being partitioned, partitioning the data can often improve write efficiency. This can be done by using a partitionBy with the write.
- The data format being written, this can have a significant performance impact.
- We can also consider using Redshift's UNLOAD function if the data isn't being heavily modified in Spark. Redshift's UNLOAD performance is typically quite good. For more information on UNLOAD: https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html#unload-parameters
Please try the above suggestions. Hope they improve your experience.
Thanks and have a great day!
Without analyzing in detail, I know that using Redshift Spectrum to read the files may, in certain cases, be faster than COPY. E.g. it has the ability to automatically split the file. If you can catalog this data set as an external table, then do an insert/select, you may get better performance.
Hi,
I understand you have an application in EMR writing to S3 in Parquet Format and you are trying to load the generated data in Redshift uding the Copy Command.
You have built your table using best practices, at the same time the size of your cluster is quite small 2 ra3.xplus nodes (8 cores and 64 GB of memory in total).
As mentioned in the documentation you should have multiple files, which is for sure your situation, which are multiple of the number of slices in your cluster.
The nodes you are utilizing have 2 slices each, these means you will load 4 files at the time.
Load time will depend on 3 factors: total size of the dataset to load ( how many GB in size is the Parquet table ) , the number and size of each file, the parallelism at which you can load.
With 13 Billions rows I would expect the table to be large and have thousands of files.
If your files are already quite large (closer to the upper range -125 MB at least - mentioned in the documentation I mentioned above; for ease of read I add an excerpt here: the file should be: "about equal size, from 1 MB to 1 GB after compression. For optimum parallelism, the ideal file size is 1–125 MB after compression"), you could speed up the copy command would be to scale out the cluster using the Fast Classic Resizing so that you can increase the number of parallel loads.
If your files are small ( less than 100MB or even worst less then 10 or 1 MB) then I would first recommend to compact the files in larger ones and retry, if still too slow you have to consider resizing the cluster anyway to increase the load parallelism.
Hope this helps.
Conteúdo relevante
- AWS OFICIALAtualizada há 2 anos
- AWS OFICIALAtualizada há 5 meses
- AWS OFICIALAtualizada há um ano
- AWS OFICIALAtualizada há 2 anos
Hi , while the answer is correct and thorough, the User is asking about COPY command performance.