Skip to content

Copy from multiply files, DIST KEY matters?

0

I ve read that when copying from s3 multiply files using single COPY command Reshift takes advantage of processing data in parallel and sending data to different slices. But if my table has DIST KEY set up, does it matter if files themself are splitted by DIST KEY value? or it will not affect the performance and data can be mixed in the files?

asked 6 years ago369 views
2 Answers
0
Accepted Answer

Toebs2 is correct in that you can't control with file with a single DISTKEY value in it will be processed by which slice because Redshift doesn't expose it's hashing algorithm and mapping algorithm that maps a some function of the hash value (probably some form of a MOD() function) onto which slice in the ring of slices.

Other analytic column-stores like Vertica expose this information and therefore support data loading optimization with a direct node (or slice) assignment of files to COPY commands.

However, pre-splitting the files to load will have an impact on performance. I just can't say whether it is a positive or negative impact or if the impact varies due to other factors. There are a lot of variables that influence whether benefit is positive or negative for any given workload.

Consider a general case where there is one input file to the COPY command containing a large set of DISTKEY values that will distribute rows to all slices. That file gets assigned to one slice to process. That slice, let's call it the "initiator slice", must perform the following steps by itself:

  1. Perform all the file reading of the input.
  2. Parse all the input from the input format, probably something like delimited text lines, into records and validated field data types using the information in the COPY command.
  3. Compute the DISTKEY of each record, a hash in this case.
  4. Compute the mapping of the DISTKEY hash onto the slice ring to determine which slice will store that row.
  5. Buffer that record for the target slice.
  6. Send buffer(s) of records to the slice over the cluster interconnect network to all other slices.

All slices must perform the following steps independently, including the "initiator slice" that does all the above work (hopefully the "initiator slice" takes a memory mapped network buffer short cut to mitigate some of the cost of asymmetric workload it's performing for the COPY command):

  1. Receive buffers of records to store locally.
  2. Convert that data to Redshift columnar blocks.
  3. Write columnar blocks for all records it receives.

The above shows how the workload for one file with a large set of DISTKEY values in a COPY command results in an asymmetric workload where the "initiator slice" does the lion's share of the work compared to the target slices.

Now it should be rather obvious that work pattern for a COPY command that only processes a file containing only one DISTKEY value degenerates into the "initiator slice" only needing to send network buffers to one target slice. Overall, this will reduce the workload on the "initiator slice" but increase the workload on the single target slice. So, this may likely have an reduction in throughput WRT to this one file to process because only one network path is exercised and only one slice is doing all the block formatting and writing.

However, there are other cluster wide considerations. The ultimate general case cap on a many file COPY command where every slice gets one or more files to process and is every slice processing an input file and sending record buffers to every slice (the hard work) PLUS every slice receiving record buffers from every other slice. In effect, a full mess cluster communications task which has the tendency to put a high load if not saturate the cluster interconnect network.

Now, if we pre-split all the input data down to one DISTKEY value per file it's easy to see while the load on the "initiator slices" is reduced because it only has to communicate with one target slice and the load on each target slice is increased, the overall load on the cluster interconnect network has gone down. Maybe this brings that cluster interconnect load down below the saturation point, and if not saturated frees capacity on the cluster interconnect network to process inter-slice communication from other concurrent statements, even other concurrent COPY commands. Also the reduced "initiator slice" workload also frees up node local resources, like CPU, memory, and cluster interconnect network bandwidth for the slice to process work from other concurrent statements too. So, maybe the overall cluster throughput can increase.

Only a real world benchmark with real data and a real SQL workload will tell the story as to whether pre-splitting input files for COPY command(s) will have a statement level or overall cluster throughput level improvement or degradation in performance.

As always "your mileage may vary". ;-)

I hope this helps to break down and explain where the work of the COPY command is distributed across the cluster and how it impacts the cluster and slice level resources.

Regards,
-Kurt

answered 6 years ago
0

It could matter, but you have no control over it, so it doesn't help you.

So, imagine you have ten slices and your distkey column has 10 values. Let's also imagine one distkey value goes to one slice.

So if we have a single file with all data, that's loaded by a single slice, which then distributes the rows.

Now imagine we have ten files, each with only one distkey value.

This would be great if you could ensure the correct file was loaded by the correct slice.

However, you can't. You don't know which slice will get which distkey value, and you don't know which slice which load which file.

So each slice will load a file, and there's a 1 in 10 chance a slice will load its own file, and so the other 9 slices will redistribute their rows to the other slices.

answered 6 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.