Redshift temp INSERTs are slow and expensive

0

Signed up for Redshft Serverless trial.

I run a proc that selects 4 columns from 110 records into a temp table, then stages replacement rows in a second temp table as it walks through the first temp table. When it's done, it does a bulk UPDATE from the stage back to the real table. Previously, UPDATEs were done directly to the temp records, rather than the insert stage, but that was slower.

The whole process takes ~30 seconds at best, the vast majority of the time being the temp inserts. If I do all the logic except inserting rows to the second temp table, it takes ~3-4 seconds.

So far, I've burned through $400 worth of Redshift in 2 days, doing basically nothing. How is such a small amount of small inserts to small temp tables so expensive? Where are the cycles going?

edit By building an insert string and periodically inserting in blocks, it's cut down to <5 seconds, but that still doesn't explain much.

  • Can you provide some code? The "walks through" statement concerns me as I believe you are doing a loop and doing 110 update statements.

  • No, I explained that. In plpgsql:

    1. bulk select 4 short cols to temp table
    2. loop selects from the temp table, modifies one value in logic, then inserts a modified row to a second temp table
    3. at the very end, does a very fast update-everything-where-key-matches back to the real table

    Literally only 110 rows, even in the real table. It's definitely, positively, 100% slow doing inserts to a temp table, that aren't even "committed" until all 110 are done. Even if I just do them 10 at a time, it takes a fraction of the time.

  • So in the loop, you are inserting one row at a time. That is an anti-pattern for Redshift. If you share some example code, I can help you refactor it to a more appropriate pattern for a data warehouse solution like Redshift.

  • No, I already stated that I changed the code to do blocks of inserts, which sped up the second temp table load 10x (which didn't actually reduce the cost of the whole task).

  • This isn't the best way to debug this. If you could provide an example piece of code, it would be a lot easier. But, a "block of inserts" is still concerning. The 1 insert statement should insert all of the rows by selecting from another table. If you are doing change data capture, the process should be a single delete and then a single insert. You can also use the merge command to apply new records.

asked 7 months ago642 views
1 Answer
0

Since I can't fit this in a comment:

Again, I'm asking for some explanation/justification as to why it costs so much money to do practically nothing. Anywhere I look where people have asked adjacent questions about the performance viability of updating existing data, the response is always "you're doing Redshift wrong, give us your code." This is inherently inappropriate, particularly when a simple explanation of a simple process has already been provided.

As it stands now: 1 INSERT INTO ... SELECT of 110 rows (all records in real, to temp table), 110 SELECTs into plpgsql vars (with 1 WHERE condition, from a 110-row temp table), 110 simple value changes in logic prepped/appended to a VARCHAR "INSERT INTO temp2 VALUES (1,2,3'test'),(2,3,4,'test')...etc" string, 1 single-statement-single-execution bulk INSERT of 110 (execute the VARCHAR to load temp2), and 1 single-statement-single-execution bulk UPDATE of 110 (with 2 WHERE conditions, temp2-to-real), at a grand total of <5 seconds, somehow costs 3-4 dollars.

  • Redshift markets the ability to update existing data with SQL; this would be cost-prohibitive with any actual volume or regularity. "As little as $3 a day" is the Serverless pitch. It seems if you use AWS's own example methods on actual data, you'll have no money in a week.
  • Manipulating 110 records, using a stored proc and temp tables, maybe 150 times total over 3 days, not even doing "analytics," shouldn't have ended up costing $600. For a system that is supposed to work on petabytes of data, this doesn't make any sense even if my proc is "wrong."
  • The amount of provisioned horsepower (minimal or maximal) makes no difference to speed, presumably because there is so little actual data & work, and also seemingly makes no difference to cost, so the system isn't being crushed.
  • Everything aside, INSERTs should never be so intensive (to temp, even) at no volume, even if they are one-at-a-time (and to that point, they were still COMMITed as a single block).

So the only answer seems to be: don't bother updating your data.

answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions