OLAPs and Columnar File Formats

In earlier posts, we explored the Lakehouse architecture, which focuses on splitting storage and computation. By leveraging S3 object storage and open file formats, we can store a theoretically infinite volume of data.

But which file format is most efficient for this architecture? 🤔

This paper, “A Deep Dive into Common Open Formats for Analytical” written by researchers at Microsoft, compares Parquet, Arrow, and ORC to determine which is most suitable for OLAP workloads. The Verdict? There is no single winner. The “best” format depends entirely on the specific use case.

However, the real value of this paper is in its rigorous comparison methodology. It evaluates the formats across these critical dimensions:

Encoding Efficiency: How different data types (Integers vs. Strings) impact file size and the effectiveness of encodings like Dictionary or Run-Length.
Decoding & Transcoding: The often-overlooked CPU cost of converting data from a disk-optimized format (like Parquet) to an in-memory format (like Arrow).
I/O Optimization: The effectiveness of Projection Pushdown (reading only necessary columns) and Predicate Pushdown (filtering rows before reading them).
Compression Trade-offs: The balance between storage savings and decompression speed (comparing Zlib, Gzip, Snappy, etc.).
Complex Data & Vectorization: How well the formats handle nested structures (lists/maps) and whether they support SIMD instructions for faster processing.

Recommendation: I suggest reading the paper up to the fourth section (Methodology) to get the “low-hanging fruit.” This section covers the essential vocabulary and keywords you’ll need for further research. Even if you don’t read the whole thing, it’s worth keeping this paper in mind as a reference for future decision-making.