Data Warehousing
Read snowflake docs
Read only data separate from operational data - not updated in place.
Historically longer than operational data - 5 to 10 years or whatever
- OLTP: Online Transactional Processing (transaction oriented tasks for daily CRUD)
- OLAP: Online Analytical Processing (complex queries on historical data)
Fact vs Dimension
Fact is something concrete like sales, dimension is something adjacent like customer or time.
Star schema: one fact table, multiple dimension tables
Snowflake schema: one fact table, multiple levels of dimension tables
Constellation schema: multiple fact tables linked together by multiple levels of dimension tables
Data Cube Operations
- drill up
- roll up
- slice
- dice
- drill down
- pivot
Make n-d lattice of cuboids to analyze data at a certain granularity.
ex: 3d
year, item, color
a = (2010, bikes, red)
b = (210, bikes, *) // one dimension up via aggregation (roll up)
Data cube materialization: pre-computing cuboids is expensive, on-demand is slow when you need data. Partial materialization is ideal but have to chose what cuboids and cells to pre-compute (heuristics for decision).
Iceberg cube: minimum support for frequent queries
- Multi way Array Aggregation (base cuboid, bottom-up, simultaneously aggregate multiple dimensions)
- Bottom Up Computation (start from 1 all and divide dimensions into partitions while > threshold - iceberg pruning)
- Star Cubing (bottom up and top down expansion of shared dimensions)