Monday, 4 June 2012
Delays and Rates
From a recent post on news.ycombinator.com, vertically aligned for ease of comparison, with corresponding rates to better understand the implications:
(Edit: Expanded with numbers from a recent Ars Technica article on SSDs)
Register << 1 ns
L1 cache reference (lower bound) < 1 ns 2,000,000,000 Hz
L1 cache reference (upper bound) 3 ns 333,333,333 Hz
Branch mispredict 5 ns 200,000,000 Hz
L2 cache reference 7 ns 142,857,143 Hz
L3 cache reference 20 ns 50,000,000 Hz
Mutex lock/unlock 25 ns 40,000,000 Hz
Main memory reference 100 ns 10,000,000 Hz
Compress 1K bytes with Zippy 3,000 ns 333,333 Hz
Send 2K bytes over 1 Gbps network 20,000 ns 50,000 Hz
Read 1 MB sequentially from memory 250,000 ns 4,000 Hz
Round trip within same datacenter 500,000 ns 2,000 Hz
Disk seek (lower bound) 3,000,000 ns 333 Hz
Disk seek (upper bound) 10,000,000 ns 100 Hz
Read 1 MB sequentially from disk 20,000,000 ns 50 Hz
Send packet CA->Netherlands->CA 150,000,000 ns < 7 Hz
Surprises and take home lesson(s):
1. Data intensive (I/O bound) systems are REALLY slow compared to the raw CPU grunt that is available.
2. Within-datacenter Network I/O is faster than disk I/O.
3. It makes sense to think about network I/O in the same way as we used to think about the SIMD/AltiVec/CUDA tradeoff. The payoff has to be worth while, because the packaging/transfer operations are expensive.
4. Branch mis-prediction is actually pretty expensive compared to L1 cache. For CPU bound inner-loop code, it makes sense to spend a bit more time trying to avoid branching.
Here is the table from Ars Technica:
Level Access time Typical size
Registers "instantaneous" under 1 KB
Level 1 Cache 1-3 ns 64 KB per core
Level 2 Cache 3-10 ns 256 KB per core
Level 3 Cache 10-20 ns 2-20 MB per chip
Main Memory 30-60 ns 4-32 GB per system
Hard Disk 3,000,000-10,000,000 ns