https://blog.twitter.com/2013/dremel-made-simple-with-parquet
https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
There are several advantages to columnar formats:
- Organizing by column allows for better compression, as data is more homogenous. The space savings are very noticeable at the scale of a Hadoop cluster.
- I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. Better compression also reduces the bandwidth required to read the input.
- As we store data of the same type in each column, we can use encodings better suited to the modern processors’ pipeline by making instruction branching more predictable.
No comments:
Post a Comment