Spurious row counts in query plan?

0

I'm investigating query plans.

I have the following tables, query and query plan;

create table customer
(
  age          smallint      encode zstd,
  gender       char(1)       encode zstd,
  customer_id  bigint        encode zstd primary key distkey,
  name         varchar(512)  encode zstd
)
diststyle key
compound sortkey( customer_id );

create table sale
(
  customer_id  bigint  encode zstd references customer( customer_id ) distkey,
  sale_id      bigint  encode zstd
)
diststyle key
compound sortkey( customer_id );

explain select
  age,
  gender,
  name,
  sale_id
from
  customer,
  sale
where
  customer.customer_id = sale.customer_id;

 XN Merge Join DS_DIST_NONE  (cost=0.00..23.66 rows=754 width=533)
   Merge Cond: ("outer".customer_id = "inner".customer_id)
   ->  XN Seq Scan on sale  (cost=0.00..11.60 rows=1160 width=16)
   ->  XN Seq Scan on customer  (cost=0.00..1.30 rows=130 width=533)

Note also I've issued;

set analyze_threshold_percent to 0;
VACUUM FULL customer TO 100 PERCENT;
VACUUM FULL sales TO 100 PERCENT;
ANALYZE customer;
ANALYZE sale;

ANALYZE does not SKIP.

Now, both these tables are new and empty. Never seen a record.

So - now - examine the query plan.

See the row counts?

Why are empty tables producing non-zero record counts?

Toebs2
asked 5 years ago208 views
4 Answers
0
Accepted Answer

Hi Toebs2,

Those aren't actual row counts. Those are plan operator output estimates that can be partly based on statistics resulting from the ANALYZE command, partly from cardinality estimates based on prior filters and joins, and rules and overriding constant values in the query planner/optimizer itself.

I don't know the Redshift code nor as a customer do I have access to it. However, from having worked on and worked with other database engines for a long time I can tell you that quite often there are a lot of edge condition overriding constants in SQL query planner/optimizer code to avoid absolute edges, like 0's in the stats that often don't play to nicely with the other operators when computing plan costs. So. small overrides at the access plan layer of a complex query plan can manifest in some unexpected numbers as they propagate up estimation formulae of the plan tree.

In your case the plan still doesn't have much work to do, so I wouldn't focus on it too much. On the other hand I would be thankful that there probably are overriding constants in the query planner/optimizer that have been put there to avoid edge conditions where misleading estimates might cause the choice of a query plan that performs horribly badly.

I've seen this happen before in databases I won't name. Most were fixed with, you guessed it, a overriding constant of an edge condition stat in the query planner/optimizer code. If you ever trip over one you'll know it because you'll be looking at query plan that has something like a cartesian cross product or a nested loop join over tables with some ungodly number of rows.

I'll leave it to the Redshift team to comment on the Redshift specifics and details of the question.

Regards,
-Kurt

klarson
answered 5 years ago
0

Hi Toebs2,

Those aren't actual row counts. Those are plan operator output estimates that can be partly based on statistics resulting from the ANALYZE command, partly from cardinality estimates based on prior filters and joins, and rules and overriding constant values in the query planner/optimizer itself.

I don't know the Redshift code nor as a customer do I have access to it. However, from having worked on and worked with other database engines for a long time I can tell you that quite often there are a lot of edge condition overriding constants in SQL query planner/optimizer code to avoid absolute edges, like 0's in the stats that often don't play to nicely with the other operators when computing plan costs. So. small overrides at the access plan layer of a complex query plan can manifest in some unexpected numbers as they propagate up estimation formulae of the plan tree.

In your case the plan still doesn't have much work to do, so I wouldn't focus on it too much. On the other hand I would be thankful that there probably are overriding constants in the query planner/optimizer that have been put there to avoid edge conditions where misleading estimates might cause the choice of a query plan that performs horribly badly.

I've seen this happen before in databases I won't name. Most were fixed with, you guessed it, a overriding constant of an edge condition stat in the query planner/optimizer code. If you ever trip over one you'll know it because you'll be looking at query plan that has something like a cartesian cross product or a nested loop join over tables with some ungodly number of rows.

I'll leave it to the Redshift team to comment on the Redshift specifics and details of the question.

Regards,
-Kurt

klarson
answered 5 years ago
0

Thankyou, Klarson.

I will look at the Postgres source code and see what it does.

+5 points for you! :-)

Toebs2
answered 5 years ago
0

Hmm, interesting.

I think I may have stumbled across a forum bug.

You posted the same reply twice, somehow - and marking one of them as correct has given you ten points, not five.

Reminds me to issue multiple replies in the future :-)

Toebs2
answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions