How are NULLs stored in RedShift and why are NOT NULLable columns with DEFAULT 0 larger than columns with NULL?

0

I have for example two tables with the same data.

CREATE TABLE version1 (
  col1 INTEGER ENCODE AZ64,
  col2 INTEGER ENCODE ZSTD,
);

CREATE TABLE version2 (
  col1 INTEGER NOT NULL DEFAULT 0 ENCODE AZ64,
  col2 INTEGER NOT NULL DEFAULT 0 ENCODE ZSTD,
);

In every case where I apply NOT NULL and use a default, the columns are 2x larger. How does RedShift store NULLs? It doesn't seem to matter what the TYPE or the ENCODING is in any situation, NULL columns appear cheaper.

For example with 500m rows, version1.col1 is 104mb and version2.col1 is 208mb on the same data.

질문됨 9달 전679회 조회
2개 답변
1
수락된 답변

So the answer is that each column value that's NULL takes up 1 bit independently of the data type, and doesn't take up the size of the data type.

Additionally doing something like NOT NULL DEFAULT 0 to completely remove NULLs will be significantly worse on performance because it causes 0 to both take the full size of the type, and will cause the block min value to be 0 which causes RedShift to have to read the entire block when something something like SELECT COUNT(*) WHERE lots_of_zero > 0 instead of simply taking the block count.

The size and query difference can be significant under every type of ENCODE with a NULL heavy column being > 2x smaller over DEFAULT 0 and scan 2-10x less rows depending on the query.

답변함 8달 전
1

As there is nothing about it in Redshift's documentation, I would say that it behaves just like the database it was built over: PostgreSQL.

Check this SO question about how null is stored in PostgreSQL.

You could check if creating 9 columns will indeed take more space, to confirm that Redshift indeed works like Postgre in this case.

Hope this clarify things a bit. If you cross anything different please share here, I would love to know!

답변함 9달 전
  • Looked into it, they take up 1 bit per value and interestingly swapping 0 for NULL will let you set your block ranges to min 1 which can considerably impact query perf since column level blocks can be skipped and the block metadata can be directly used.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠