Long ago data storage was simple — heapfiles and b-trees and that’s it. Today the options are overwhelming — ORC, Parquet, Avro on HDFS or S3 or a RDBMS solution like Postgresql, MariaDB, or commercial ones like Oracle and DB2. Even within RDBMS engines and Cloud services there are many options!

This guide is a “random walk” into the broad realm of storage.


Sample DataSet

I pulled the Chicago Public workers salary info. Did you know public salaries are published w/ the **REAL **names of each person? Look up your favorite police officer or commissioner’s salary for fun (it won’t include overtime, perks nor bribes, unfortunately).

Image for post

I added a few fake people — Foo & Bar families. Otherwise its real data.

Column Oriented vs Row Oriented

First the basic storage mechanics of data. Row and column orientation.

**Row oriented **stores each individual record together, Doug Foo’s full record then Jane Foo’s, and so on in sequence.

[Doug,Foo,Foostack,Programmer,M,120000] [Jane, Foo,Foostack,Programmer,F,110000] [Doug,Bar,…]…

**Column **or **_columnar _**stores all column data together — so all First Names (Doug,Jane, etc), then all the Last Names, Titles, … , finally all Salaries.

[Doug,Jane,Doug,Jane,…][Foo,Foo,Bar,Bar…][Programmer,Programmer…]…[M,F,M,F…][120000,110000,100000,130000…]…

Data is generally read off disk by blocks of KB (or MB/GB), so a single read for 1 record brings in a lot more than what you may want.

Typical Block Sizes[*1]:

  • HDFS: 128MB (Chunks)
  • Google FS: 64MB (Chunks)
  • Amazon AWS S3: 128KB
  • Oracle Data Warehouse: 32KB
  • Postgres DB: 8KB
  • Linux and Windows Filesystems: 4KB

Salary records are pretty small ~ 50chars (<50 bytes). An 8KB block with metadata could easily store 100+ row oriented records [8192 / 50 = 163]. So a single read of the block containing “Doug Foo”’s record includes a lot more.

For **columnar **block reads — the read varies on column size. For example with 8KB blocks:

  • First Name may avg 10 bytes, meaning upwards of ~800 names.
  • Salary could fit in a 32bit Int (4bytes) yielding ~2000 entries!
  • Gender is M/F/? — could be stored as 1 byte or even a half nibble ~ 8000!

This will be important to keep in mind as we do more analysis later.

#big-data-analytics #columnar-databases #storage-containers #big-data #database

Columnar Stores — When/How/Why?
1.25 GEEK