How are NULLs stored in RedShift and why are NOT NULLable columns with DEFAULT 0 larger than columns with NULL?

0

I have for example two tables with the same data.

CREATE TABLE version1 (
  col1 INTEGER ENCODE AZ64,
  col2 INTEGER ENCODE ZSTD,
);

CREATE TABLE version2 (
  col1 INTEGER NOT NULL DEFAULT 0 ENCODE AZ64,
  col2 INTEGER NOT NULL DEFAULT 0 ENCODE ZSTD,
);

In every case where I apply NOT NULL and use a default, the columns are 2x larger. How does RedShift store NULLs? It doesn't seem to matter what the TYPE or the ENCODING is in any situation, NULL columns appear cheaper.

For example with 500m rows, version1.col1 is 104mb and version2.col1 is 208mb on the same data.

asked 8 months ago609 views
2 Answers
1
Accepted Answer

So the answer is that each column value that's NULL takes up 1 bit independently of the data type, and doesn't take up the size of the data type.

Additionally doing something like NOT NULL DEFAULT 0 to completely remove NULLs will be significantly worse on performance because it causes 0 to both take the full size of the type, and will cause the block min value to be 0 which causes RedShift to have to read the entire block when something something like SELECT COUNT(*) WHERE lots_of_zero > 0 instead of simply taking the block count.

The size and query difference can be significant under every type of ENCODE with a NULL heavy column being > 2x smaller over DEFAULT 0 and scan 2-10x less rows depending on the query.

answered 8 months ago
1

As there is nothing about it in Redshift's documentation, I would say that it behaves just like the database it was built over: PostgreSQL.

Check this SO question about how null is stored in PostgreSQL.

You could check if creating 9 columns will indeed take more space, to confirm that Redshift indeed works like Postgre in this case.

Hope this clarify things a bit. If you cross anything different please share here, I would love to know!

answered 8 months ago
  • Looked into it, they take up 1 bit per value and interestingly swapping 0 for NULL will let you set your block ranges to min 1 which can considerably impact query perf since column level blocks can be skipped and the block metadata can be directly used.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions