How are NULLs stored in RedShift and why are NOT NULLable columns with DEFAULT 0 larger than columns with NULL?

0

I have for example two tables with the same data.

CREATE TABLE version1 (
  col1 INTEGER ENCODE AZ64,
  col2 INTEGER ENCODE ZSTD,
);

CREATE TABLE version2 (
  col1 INTEGER NOT NULL DEFAULT 0 ENCODE AZ64,
  col2 INTEGER NOT NULL DEFAULT 0 ENCODE ZSTD,
);

In every case where I apply NOT NULL and use a default, the columns are 2x larger. How does RedShift store NULLs? It doesn't seem to matter what the TYPE or the ENCODING is in any situation, NULL columns appear cheaper.

For example with 500m rows, version1.col1 is 104mb and version2.col1 is 208mb on the same data.

已提問 9 個月前檢視次數 678 次
2 個答案
1
已接受的答案

So the answer is that each column value that's NULL takes up 1 bit independently of the data type, and doesn't take up the size of the data type.

Additionally doing something like NOT NULL DEFAULT 0 to completely remove NULLs will be significantly worse on performance because it causes 0 to both take the full size of the type, and will cause the block min value to be 0 which causes RedShift to have to read the entire block when something something like SELECT COUNT(*) WHERE lots_of_zero > 0 instead of simply taking the block count.

The size and query difference can be significant under every type of ENCODE with a NULL heavy column being > 2x smaller over DEFAULT 0 and scan 2-10x less rows depending on the query.

已回答 8 個月前
1

As there is nothing about it in Redshift's documentation, I would say that it behaves just like the database it was built over: PostgreSQL.

Check this SO question about how null is stored in PostgreSQL.

You could check if creating 9 columns will indeed take more space, to confirm that Redshift indeed works like Postgre in this case.

Hope this clarify things a bit. If you cross anything different please share here, I would love to know!

已回答 9 個月前
  • Looked into it, they take up 1 bit per value and interestingly swapping 0 for NULL will let you set your block ranges to min 1 which can considerably impact query perf since column level blocks can be skipped and the block metadata can be directly used.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南