Columnar storage

02 Mar 2024

https://arxiv.org/pdf/2304.05028.pdf

Modern read heavy data analytics rely on columnar data storage. Parquet and ORC spawned from Hadoop are dominant, but much has changed in the hardware and workloads in the last decade.

Intro

Why is columnar storage used?

You can skip irrelevant attributes
You can compress data better
You can do parallel query processing

What has changed in the last decade?

SSD performance
Cloud storage usage: high bandwidth and latency
New better compression, indexing and filtering techniques

What does the paper do?

Creates a realistic benchmark
Does tests varying: encoding algos, block compression, metadata organization, index and filter and nested data modelling.

What were the general learnings?

Parquet has faster column decoding, due to its encoding algo
ORC does better selection pruning due to its zone maps
Cheaper storage means we are moving to be more compute bound and we should use less aggressive compression
We should avoid using block compression by default as the bandwidth savings don’t outweigh the compute overhead
Parquet/OCR only do basic data structure support with zone maps and bloom filters. We should do more here to precompute results.
Parquet/OCR is not sufficient for ML workloads. Which do a lot of feature projections.
Parquet/OCR don’t utilise GPUs effectively. If we are using GPUs we want aggressive compression as I/O overhead dominates file scan time.

What is selection pruning?

Just in SQL you select certain columns right? And in columnar storage as it’s grouped in columns, its great at doing this as it only needs to read the column files which are selected.

What are dictionary codes? What are zone maps? What are bloom filters?

Background

Hadoop introduced two row-oriented formats

SequenceFile: Key value pairs
Avro: JSON

What is vectorised processing? What is late materialisation?

The history is that

Meta released RCFile
Meta released ORC
Twitter Cloudera released Parquet

What is record shredding? What is assembly algorithm from Dremel? What is google Capacitor format? What is Youtube Artus format? What makes it have constant seek time for nested schema? What is Meta DWRF format? What is Meta Alpha format? What is Arrow?

Lakehouse trend is to support better metadata management. For example ACID transactions. They add a metadata layer and do not directly modify the underlying columnar file formats.

Delta Lake, Iceberg, Hudi

Feature Taxonomy

Format Layout

Both Parquet and ORC use PAX format.

What is PAX format? What is vectorised query processing? What is the tuple reconstruction overhead in a row group? What allows DuckDB and Arrow to perform parallel reads?

Both formats use lightweight encoding schemes to each column chunk. Then use general purpose block compression algorithms.

Entry point is the footer. Contains table schema and tuple count.

Encoding

Parquet and ORC support standard OLAP compression techniques

Dictionary encoding
run length encoding
Bitpacking

What is dictionary encoding What is run length encoding What is bitpacking?

Compression

Both parquet and ORC enable block compression by default. And it seems that block compression is unhelpful for query speed.

Index and Filter

Parquet and ORC include zone maps and optional Bloom Filters to enable selection pruning.

What is a zone map? What are bloom filters?

Looks like spark enables pageIndex and bloom filters.

Nested Data Model

Open formats need to support nested data to support JSON and Protocol Buffers.

Parquet just takes nested data from Dremel.

How does Dremel handle nested data?

Benchmarking

Standard OLAP benchmarks like SSB, TPC-H and TPC-DS generate datasets with uniform distributions. Some benchmarks do allow data skew, YCSB, DSB, BigDataBench, but often not sufficient. Real data is idea, but not comprehensive enough.

What values does the benchmark care about?

Sortedness

Skew pattern
Column properties

What things matter about a column?

Number of distinct values: More compressible

Null ratio: Sparse data

Value range: bitpacking is affected by avg, variance of values

Sortedness: Affects encoding algo efficiency. (RLE, Delta)

Skew pattern: Frequency of values in a column. Are there hotspots?

Parameter distribution in real world data

Most data has a number of distinct values being 1% of total values.

Columnar data needs to handle both gentile and heavy hotspot skews.

Most integer values are small and therefore benefits from bitpacking.

Experiments

Block compression

Block compression default does make S3 marginally faster but not by much. At the end of the day block compression dominates I/O of storage.

S3 specific optimisations we could do

Store metadata continuously in the format to avoid multiple round trips
Appropriately size the row groups or files to hide the access latency
Coalescing small range requests to better utilise the cloud storage bandwidth

Scan with predicates

Zone maps are only useful on columns where values are well clustered.

Bloom filters are only useful for point queries.

Bloom filters tell you if something does not exist in the data.
Great for high selectivity queries.

We should consider more advanced techniques like

Column indexes
Range filters

Columnar storage

02 Mar 2024

https://arxiv.org/pdf/2304.05028.pdf

Modern read heavy data analytics rely on columnar data storage. Parquet and ORC spawned from Hadoop are dominant, but much has changed in the hardware and workloads in the last decade.

Intro

Why is columnar storage used?

You can skip irrelevant attributes
You can compress data better
You can do parallel query processing

What has changed in the last decade?

SSD performance
Cloud storage usage: high bandwidth and latency
New better compression, indexing and filtering techniques

What does the paper do?

Creates a realistic benchmark
Does tests varying: encoding algos, block compression, metadata organization, index and filter and nested data modelling.

What were the general learnings?

Parquet has faster column decoding, due to its encoding algo
ORC does better selection pruning due to its zone maps
Cheaper storage means we are moving to be more compute bound and we should use less aggressive compression
We should avoid using block compression by default as the bandwidth savings don’t outweigh the compute overhead
Parquet/OCR only do basic data structure support with zone maps and bloom filters. We should do more here to precompute results.
Parquet/OCR is not sufficient for ML workloads. Which do a lot of feature projections.
Parquet/OCR don’t utilise GPUs effectively. If we are using GPUs we want aggressive compression as I/O overhead dominates file scan time.

What is selection pruning?

Just in SQL you select certain columns right? And in columnar storage as it’s grouped in columns, its great at doing this as it only needs to read the column files which are selected.

What are dictionary codes? What are zone maps? What are bloom filters?

Background

Hadoop introduced two row-oriented formats

SequenceFile: Key value pairs
Avro: JSON

What is vectorised processing? What is late materialisation?

The history is that

Meta released RCFile
Meta released ORC
Twitter Cloudera released Parquet

What is record shredding? What is assembly algorithm from Dremel? What is google Capacitor format? What is Youtube Artus format? What makes it have constant seek time for nested schema? What is Meta DWRF format? What is Meta Alpha format? What is Arrow?

Lakehouse trend is to support better metadata management. For example ACID transactions. They add a metadata layer and do not directly modify the underlying columnar file formats.

Delta Lake, Iceberg, Hudi

Feature Taxonomy

Format Layout

Both Parquet and ORC use PAX format.

What is PAX format? What is vectorised query processing? What is the tuple reconstruction overhead in a row group? What allows DuckDB and Arrow to perform parallel reads?

Both formats use lightweight encoding schemes to each column chunk. Then use general purpose block compression algorithms.

Entry point is the footer. Contains table schema and tuple count.

Encoding

Parquet and ORC support standard OLAP compression techniques

Dictionary encoding
run length encoding
Bitpacking

What is dictionary encoding What is run length encoding What is bitpacking?

Compression

Both parquet and ORC enable block compression by default. And it seems that block compression is unhelpful for query speed.

Index and Filter

Parquet and ORC include zone maps and optional Bloom Filters to enable selection pruning.

What is a zone map? What are bloom filters?

Looks like spark enables pageIndex and bloom filters.

Nested Data Model

Open formats need to support nested data to support JSON and Protocol Buffers.

Parquet just takes nested data from Dremel.

How does Dremel handle nested data?

Benchmarking

What values does the benchmark care about?

Sortedness

Skew pattern
Column properties

What things matter about a column?

Number of distinct values: More compressible

Null ratio: Sparse data

Value range: bitpacking is affected by avg, variance of values

Sortedness: Affects encoding algo efficiency. (RLE, Delta)

Skew pattern: Frequency of values in a column. Are there hotspots?

Parameter distribution in real world data

Most data has a number of distinct values being 1% of total values.

Columnar data needs to handle both gentile and heavy hotspot skews.

Most integer values are small and therefore benefits from bitpacking.

Experiments

Block compression

Block compression default does make S3 marginally faster but not by much. At the end of the day block compression dominates I/O of storage.

S3 specific optimisations we could do

Store metadata continuously in the format to avoid multiple round trips
Appropriately size the row groups or files to hide the access latency
Coalescing small range requests to better utilise the cloud storage bandwidth

Scan with predicates

Zone maps are only useful on columns where values are well clustered.

Bloom filters are only useful for point queries.

Bloom filters tell you if something does not exist in the data.
Great for high selectivity queries.

We should consider more advanced techniques like

Column indexes
Range filters

Columnar storage

Intro

Background

Feature Taxonomy

Format Layout

Encoding

Compression

Index and Filter

Nested Data Model

Benchmarking

Column properties

Parameter distribution in real world data

Experiments

Block compression

Scan with predicates

oboe

about posts

Columnar storage

Intro

Background

Feature Taxonomy

Format Layout

Encoding

Compression

Index and Filter

Nested Data Model

Benchmarking

Column properties

Parameter distribution in real world data

Experiments

Block compression

Scan with predicates

You might also like

What is flume?

Postmortems

Learning vim