How are NULLs stored in RedShift and why are NOT NULLable columns with DEFAULT 0 larger than columns with NULL?

0

I have for example two tables with the same data.

CREATE TABLE version1 (
  col1 INTEGER ENCODE AZ64,
  col2 INTEGER ENCODE ZSTD,
);

CREATE TABLE version2 (
  col1 INTEGER NOT NULL DEFAULT 0 ENCODE AZ64,
  col2 INTEGER NOT NULL DEFAULT 0 ENCODE ZSTD,
);

In every case where I apply NOT NULL and use a default, the columns are 2x larger. How does RedShift store NULLs? It doesn't seem to matter what the TYPE or the ENCODING is in any situation, NULL columns appear cheaper.

For example with 500m rows, version1.col1 is 104mb and version2.col1 is 208mb on the same data.

gefragt vor 9 Monaten680 Aufrufe
2 Antworten
1
Akzeptierte Antwort

So the answer is that each column value that's NULL takes up 1 bit independently of the data type, and doesn't take up the size of the data type.

Additionally doing something like NOT NULL DEFAULT 0 to completely remove NULLs will be significantly worse on performance because it causes 0 to both take the full size of the type, and will cause the block min value to be 0 which causes RedShift to have to read the entire block when something something like SELECT COUNT(*) WHERE lots_of_zero > 0 instead of simply taking the block count.

The size and query difference can be significant under every type of ENCODE with a NULL heavy column being > 2x smaller over DEFAULT 0 and scan 2-10x less rows depending on the query.

beantwortet vor 8 Monaten
1

As there is nothing about it in Redshift's documentation, I would say that it behaves just like the database it was built over: PostgreSQL.

Check this SO question about how null is stored in PostgreSQL.

You could check if creating 9 columns will indeed take more space, to confirm that Redshift indeed works like Postgre in this case.

Hope this clarify things a bit. If you cross anything different please share here, I would love to know!

beantwortet vor 9 Monaten
  • Looked into it, they take up 1 bit per value and interestingly swapping 0 for NULL will let you set your block ranges to min 1 which can considerably impact query perf since column level blocks can be skipped and the block metadata can be directly used.

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen