Software controlled processor speed .... Tunable via software from. 200-50MHz .... recover. Razor FF. Stabilizer F. F. P
Razor: Dynamic Voltage Scaling Based on Timing Speculation Todd Austin, David Blaauw, Trevor Mudge Advanced Computer Architecture Laboratory, University of Michigan University of Michigan Lead PhD Students: Dan Ernst, Nam Sung Kim Prototype Design Team: Shidhartha Das, Sanjay Pant, Toan Pham, Rajeev Rao Razor Demo Team: Chris Drake, Seokwoo Lee Krisztián Flautner ARM Ltd. Advanced Computer Architecture Lab University of Michigan
September 2003
Razor DVS Austin / Blaauw / Mudge
Voltage Scaling under Dynamic Workloads • Adapt frequency/voltage to performance demands of workload
Vdd Freq
Voltage
Utilization
– Software controlled processor speed – Lower processor voltage during periods of low operating frequency
Time
• Quadratic reduction in dynamic power and energy • Super-quadratic reduction in leakage Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
1
Impact of Process Scaling on Power •
Increasing uncertainty with process scaling – – – –
•
Inter- and intra-die process variations Temperature variation Power supply drop Capacitive and inductive noise
Intra-die variations in ILD thickness
Impact on traditional design: – Addressing worst-case variation in design requires large safety margins – Higher energy / lower performance – Reduced yield – Difficulty in design closure
•
Key Observation: worst-case conditions also highly improbable – Significant gain for circuits optimized for common case – Efficiency mechanisms needed to tolerate infrequent worst-case scenarios Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
Shaving Voltage Margins with Razor •
Goal: reduce voltage margins with in-situ error detection and correction for delay failures Percentage Errors
60
40
Zero margin
Sub-critical
20
Traditional DVS
0 0 .8
1 .0
• Proposed Approach:
1 .2
1 .4
1 .6
1 .8
2 .0
S u p p ly V o lt a g e
– Tune processor voltage based on error rate – Eliminate safety margins, purposely run below critical voltage • Data-dependent latency margins • Trade-off: voltage power savings vs. overhead of correction
•
Analogous to wireless power modulation Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
2
Razor Flip-Flop Implementation •
Compare latched data with shadow-latch on delayed clock clk
Logic Stage 0 1
L1
D
Shadow Latch
Logic Stage
Q
Main Flip-Flop
L2
Error_L
comparator
Error
RAZOR FF clk_del
•
Upon failure: place data from shadow-latch in main latch
•
Key design issues:
– Ensure shadow latch always correct using conservative design techniques – Maintaining pipeline forward progress – Short path impact on shadow-latch – Power overhead of error detection and correction
- Recovering pipeline state after errors - Meta-stable results in main flip-flop
Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
Razor Flip-Flop Implementation clk_b
clk D
Q clk_b
clk
Meta-stability detector Inv_n
Error_L clk_del_b
Inv_p
Error_L
clk_del
Shadow Latch
Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
3
Pipeline Recovery IF inst
ID inst
EX inst
MEM inst
MEM WB inst
inst
clk clk_d ID.d EX.d
Redo instruction in MEM No Error Error
MEM.d error Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
Centralized Pipeline Recovery Control Cycle: 0123456
clock
recover
recover
error recover
MEM error
Razor FF
error
EX
Razor FF
ID
Razor FF
PC
IF
Razor FF
inst2 inst5 inst4 inst3 inst1 inst6 WB (reg/mem) error
recover
• Once cycle penalty for timing failure • Global synchronization may be difficult for fast, complex designs Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
4
Distributed Pipeline Recovery Control Cycle: 0789123456
error
bubble
recover
Flush Control
EX error
recover
flushID
flushID
bubble
MEM (read-only) error
bubble
recover flushID
error
Stabilizer FF
ID
Razor FF
PC
Razor FF
IF
Razor FF
inst2 Razor FF
inst5 inst2 inst1 inst6 inst8 inst7 inst4 inst3
WB (reg/mem)
bubble
recover flushID
• Builds on existing branch / data speculation recovery framework • Multiple cycle penalty for timing failure • Scalable design since all recovery communication is local Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
Short Paths Constraints • Delayed clock imposes a short-path constraint clock intended path
short path
Min. Path Delay > tdelay + thold
clock_del tdelay
thold
Min. path delay
– Razor necessary only for latches on slow paths – Pad fast path for latches with mixed path delays – Trade-off between DVS headroom and short path constraints Advanced Computer Architecture Lab University of Michigan
ff Pad with extra delay
Razor_ff clock
Long Paths Short Paths Razor DVS Austin / Blaauw / Mudge
5
Razor I - Prototype Razor Implementation 4 stage 64-bit Alpha pipeline – 200MHz expected operation in 0.18μm technology, 1.8V, ~500mW – Tunable via software from 200-50MHz, 1.8-1.1V
•
3 mm
I-Cache
Register File WB
Razor overhead: – Total of 192 Razor flip-flops out of 2408 total (9%) – Error-free power overhead: • Razor flip-flops: < 1% • Short path buffer: 2.1% – Recovery power overhead: • Razor latch power overhead: 2% at 10% error rate • Additional power overhead due to re-execution of instructions Advanced Computer Architecture Lab University of Michigan
IF ID
EX
MEM
•
3.3 mm
D-Cache
Razor DVS Austin / Blaauw / Mudge
Error-Rate Studies – Hardware Measurement
Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
6
Hardware Measurement Setup 36
18
X 18x18
clk/2
Slow Pipeline B 36
clk/2
X
48-bitLFSR LFSR 48-bit
!=
clk/2
40-bitError ErrorCounter Counter 40-bit
48-bitLFSR LFSR 48-bit
Slow Pipeline A
18x18
clk/2 18
clk/2
Fast Pipeline 36
stabilize
X 18x18
clk
clk
Advanced Computer Architecture Lab University of Michigan
clk Razor DVS Austin / Blaauw / Mudge
Razor Demo
Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
7
Error Rate Studies – Empirical Results 18x18-bit Multiplier Block at 90 MHz and 27 C 100.0000000%
35% energy savings with 1.3% error
0.1000000% 0.0100000% 0.0010000%
22% saving
0.0001000% 0.0000100%
random
Error rate
10.0000000% 1.0000000%
30% energy saving
0.0000010% 0.0000001% 0.0000000% 1.78 1.74 1.70 1.66 1.62 1.58 1.54 1.50 1.46 1.42 1.38 1.34 1.30 1.26 1.22 1.18 1.14 Supply Voltage (V) Environmental-margin Safety-margin Zero-margin @ 1.69 V @ 1.63 V @ 1.54 V
once every 20 seconds!
Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
Error Rate Studies – SPICE-Level Simulations Based on a SPICE-level simulations of a Kogge-Stone adder Kogge-Stone Adder at 870 MHz and 27 C 100.00%
10.00%
1.00%
0.10%
Error rate
•
random bzip
0.01%
ammp 0.00% 2
1.8
1.6
1.4
1.2
1
0.8
0.6
Supply Voltage
Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
8
Effects of Razor DVS Pipeline Throughput Energy IPC Total Energy, Etotal = Eproc + Erecovery
Optimal Etotal Energy of Processor Operations, Eproc
Energy of Pipeline Recovery, Erecovery
Energy of Processor w/o Razor Support Decreasing Supply Voltage Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
Simulation Methodology • Challenge: instruction latency depends circuit evaluation latency – May vary with changes in stage inputs, stage logic, voltage, temperature…
• Dynamic timing simulation combines architectural/circuit simulation – SimpleScalar/Alpha architectural-level simulation – Gate-level simulation of per-stage logic blocks • Logic block model describes cells, local and global interconnect • Cells characterized with SPICE at varied slew/cap-load/voltage • Each cycle, circuit simulator evaluates delay of each stages’ logic block – Based on actual instruction inputs from architectural simulator
• Initial implementation utilized a hand-generated EX-stage circuit model – Effort ongoing to automate extraction/decomposition/integration into SimpleScalar Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
9
EX-Stage Analysis – Optimal Voltage Sweep BZIP 1.5
Relative IPC and Energy
Recovery cost includes energy to recover entire pipeline (18x an add)
Rel Energy Rel Performance
1.3
1.1
0.9
0.7
0.5 0.31% Error Rate, 58% Energy Savings 0.6
0.7
0.65
0.8
0.75
0.9
0.85
1
0.95
1.1
1.05
1.2
1.15
1.3
1.25
1.4
1.35
1.5
1.45
1.6
1.55
1.7
1.65
1.8
1.75
0.3
Voltage Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
EX-Stage Analysis – Optimal Voltage Sweep GCC
Relative IPC and Energy
1.5 Rel Energy Rel Performance
1.3 1.1 0.9 0.7
1.62% Error Rate, 24% Energy Savings
0.5
0. 6
68 0.
75 0.
83 0.
98
0. 9
0.
05 1.
13 1.
28
1. 2
1.
35 1.
43 1.
58
65
1. 5
1.
1.
73 1.
1. 8
0.3
Voltage Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
10
Simulation Analysis – Energy-Optimal Voltage
• Simulator only models ALU in EX stage of pipeline Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
Supply Voltage Control System reset
Ediff = Eref - Esample
-
Voltage Control Function
Voltage Regulator
Vdd
Pipeline
error signals
Ediff
Eref
. . .
Σ
Esample
• Current design utilizes a very simple proportional control function – Control algorithm implemented in software
Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
11
Simulation Analysis – Razor DVS Execution Gap 2
30.00%
Voltage Error Rate
1.6
27.00% 24.00% 21.00% 18.00%
1.4
15.00% 1.2
12.00%
Error Rate
Supply Voltage
1.8
9.00%
1
6.00% 0.8 3.00% 0.6
0.00%
Time
Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
Simulation Analysis – Razor DVS Execution GCC 2
40.00%
Voltage Error Rate
1.8
35.00%
1.6 30.00% 25.00%
1.2 1
20.00%
0.8
15.00%
Error Rate
Supply Voltage
1.4
0.6 10.00% 0.4 5.00%
0.2 0
0.00%
Time
Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
12
Simulation Analysis – Razor DVS Performance DVS
Fixed
Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
Other Approaches to Dynamic Voltage Scaling • Traditional DVS – Valid voltage / delay combinations “blessed” at design time – Approach leaves a significant amount of energy “on the table” – Temperature, process, data, and safety margins placed on voltage
• Slack detector – automatic tuning – National/ARM’s Intelligent Energy Management (IEM) – Processor voltage automatically tuned to external ambient conditions – Inverter chain designed to track most restrictive critical path, margin still required
•
M e m C o n tr ol
control
Data cache
Floating point and graphics
Ex Unit
Control Unit
L2 Cache
I O U N I T
Cache control L2 tags
L2 Cache
DSP power / reliability / QoS trade-offs – Shanbhag @ UIUC – Voltage overscaling approach that limits noise impacts Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
13
Conclusions In-situ detection/correction of timing errors
clk
– Tune processor voltage based on error rate – Eliminate process, temperature, and safety margins (tune for near-zero error rate) – Purposely run below critical voltage to capture data-dependent latency margins
Q1
Main Flip-Flop
Error_L
Error
clk_del
error
recover
flushI D
Advanced Computer Architecture Lab University of Michigan
error
bu bbl e
bu bb le
rec ove r
recover
flushI D
WB
(read-only)
error
flushI D
error
bu bbl e
Stabilizer FF
PC
Trade-off: supply voltage power savings vs. overhead of correction
MEM
EX bu bb le
Razor FF
ID
Razor FF
IF
Flush Control
comparator
RAZOR FF
Implemented with architecture/circuit support – Double-sampling metastability-tolerant Razor flip-flops validate pipeline results – Pipeline initiates recovery after circuit timing errors, no voltage/clock re-tuning needed
•
0 1
Shadow Latch
Razor FF
•
D1
Razor FF
•
(reg/mem)
recover
flushI D
Razor DVS Austin / Blaauw / Mudge
Future Directions • Research opportunities – – – –
Razor for caches/memory and control logic Voltage control algorithms, especially per-stage tuning Typical-case energy optimized designs (instead of worse-case latency optimized) Turnkey application of Razor technology
• Prototype design, fabrication, evaluation – Razor I – Q4 2003 – Razor’ized combinational logic, global tuning – Razor II – Q3 2004 – Razor’ized caches and control logic, per-stage tuning
• Other applications – Single-event upset (SEU) protection using Razor error detection/re-execution – Over-clocking for performance improvement (2x shown among hobbyists) Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
14
More Details on Meta-Stability • Sub-critical operation invites meta-stability – Meta-stability detector itself can become meta-stable – double latch error signal to obtain sufficient small probability clk_b
clk D
Q clk_b
pos
clk
neg
clk_del_b
restore
clk_del
pos
error
fail
– Flush entire pipe – No forward progress – Reduce frequency
restore
restore
bubble flush
Dynamic Or / Latch
bubble flush
neg
Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
Short Path Failure IF inst2 inst1
ID inst2 inst1
EX inst1
MEM
WB
clk clk_d I1
ID.d
I2 I1
EX.d
I2
MEM.d
Short Path
error Advanced Computer Architecture Lab University of Michigan
Razor DVS Austin / Blaauw / Mudge
15