Razor - EECS @ Michigan - University of Michigan

6 downloads 443 Views 322KB Size Report
Software controlled processor speed .... Tunable via software from. 200-50MHz .... recover. Razor FF. Stabilizer F. F. P
Razor: Dynamic Voltage Scaling Based on Timing Speculation Todd Austin, David Blaauw, Trevor Mudge Advanced Computer Architecture Laboratory, University of Michigan University of Michigan Lead PhD Students: Dan Ernst, Nam Sung Kim Prototype Design Team: Shidhartha Das, Sanjay Pant, Toan Pham, Rajeev Rao Razor Demo Team: Chris Drake, Seokwoo Lee Krisztián Flautner ARM Ltd. Advanced Computer Architecture Lab University of Michigan

September 2003

Razor DVS Austin / Blaauw / Mudge

Voltage Scaling under Dynamic Workloads • Adapt frequency/voltage to performance demands of workload

Vdd Freq

Voltage

Utilization

– Software controlled processor speed – Lower processor voltage during periods of low operating frequency

Time

• Quadratic reduction in dynamic power and energy • Super-quadratic reduction in leakage Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

1

Impact of Process Scaling on Power •

Increasing uncertainty with process scaling – – – –



Inter- and intra-die process variations Temperature variation Power supply drop Capacitive and inductive noise

Intra-die variations in ILD thickness

Impact on traditional design: – Addressing worst-case variation in design requires large safety margins – Higher energy / lower performance – Reduced yield – Difficulty in design closure



Key Observation: worst-case conditions also highly improbable – Significant gain for circuits optimized for common case – Efficiency mechanisms needed to tolerate infrequent worst-case scenarios Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

Shaving Voltage Margins with Razor •

Goal: reduce voltage margins with in-situ error detection and correction for delay failures Percentage Errors

60

40

Zero margin

Sub-critical

20

Traditional DVS

0 0 .8

1 .0

• Proposed Approach:

1 .2

1 .4

1 .6

1 .8

2 .0

S u p p ly V o lt a g e

– Tune processor voltage based on error rate – Eliminate safety margins, purposely run below critical voltage • Data-dependent latency margins • Trade-off: voltage power savings vs. overhead of correction



Analogous to wireless power modulation Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

2

Razor Flip-Flop Implementation •

Compare latched data with shadow-latch on delayed clock clk

Logic Stage 0 1

L1

D

Shadow Latch

Logic Stage

Q

Main Flip-Flop

L2

Error_L

comparator

Error

RAZOR FF clk_del



Upon failure: place data from shadow-latch in main latch



Key design issues:

– Ensure shadow latch always correct using conservative design techniques – Maintaining pipeline forward progress – Short path impact on shadow-latch – Power overhead of error detection and correction

- Recovering pipeline state after errors - Meta-stable results in main flip-flop

Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

Razor Flip-Flop Implementation clk_b

clk D

Q clk_b

clk

Meta-stability detector Inv_n

Error_L clk_del_b

Inv_p

Error_L

clk_del

Shadow Latch

Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

3

Pipeline Recovery IF inst

ID inst

EX inst

MEM inst

MEM WB inst

inst

clk clk_d ID.d EX.d

Redo instruction in MEM No Error Error

MEM.d error Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

Centralized Pipeline Recovery Control Cycle: 0123456

clock

recover

recover

error recover

MEM error

Razor FF

error

EX

Razor FF

ID

Razor FF

PC

IF

Razor FF

inst2 inst5 inst4 inst3 inst1 inst6 WB (reg/mem) error

recover

• Once cycle penalty for timing failure • Global synchronization may be difficult for fast, complex designs Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

4

Distributed Pipeline Recovery Control Cycle: 0789123456

error

bubble

recover

Flush Control

EX error

recover

flushID

flushID

bubble

MEM (read-only) error

bubble

recover flushID

error

Stabilizer FF

ID

Razor FF

PC

Razor FF

IF

Razor FF

inst2 Razor FF

inst5 inst2 inst1 inst6 inst8 inst7 inst4 inst3

WB (reg/mem)

bubble

recover flushID

• Builds on existing branch / data speculation recovery framework • Multiple cycle penalty for timing failure • Scalable design since all recovery communication is local Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

Short Paths Constraints • Delayed clock imposes a short-path constraint clock intended path

short path

Min. Path Delay > tdelay + thold

clock_del tdelay

thold

Min. path delay

– Razor necessary only for latches on slow paths – Pad fast path for latches with mixed path delays – Trade-off between DVS headroom and short path constraints Advanced Computer Architecture Lab University of Michigan

ff Pad with extra delay

Razor_ff clock

Long Paths Short Paths Razor DVS Austin / Blaauw / Mudge

5

Razor I - Prototype Razor Implementation 4 stage 64-bit Alpha pipeline – 200MHz expected operation in 0.18μm technology, 1.8V, ~500mW – Tunable via software from 200-50MHz, 1.8-1.1V



3 mm

I-Cache

Register File WB

Razor overhead: – Total of 192 Razor flip-flops out of 2408 total (9%) – Error-free power overhead: • Razor flip-flops: < 1% • Short path buffer: 2.1% – Recovery power overhead: • Razor latch power overhead: 2% at 10% error rate • Additional power overhead due to re-execution of instructions Advanced Computer Architecture Lab University of Michigan

IF ID

EX

MEM



3.3 mm

D-Cache

Razor DVS Austin / Blaauw / Mudge

Error-Rate Studies – Hardware Measurement

Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

6

Hardware Measurement Setup 36

18

X 18x18

clk/2

Slow Pipeline B 36

clk/2

X

48-bitLFSR LFSR 48-bit

!=

clk/2

40-bitError ErrorCounter Counter 40-bit

48-bitLFSR LFSR 48-bit

Slow Pipeline A

18x18

clk/2 18

clk/2

Fast Pipeline 36

stabilize

X 18x18

clk

clk

Advanced Computer Architecture Lab University of Michigan

clk Razor DVS Austin / Blaauw / Mudge

Razor Demo

Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

7

Error Rate Studies – Empirical Results 18x18-bit Multiplier Block at 90 MHz and 27 C 100.0000000%

35% energy savings with 1.3% error

0.1000000% 0.0100000% 0.0010000%

22% saving

0.0001000% 0.0000100%

random

Error rate

10.0000000% 1.0000000%

30% energy saving

0.0000010% 0.0000001% 0.0000000% 1.78 1.74 1.70 1.66 1.62 1.58 1.54 1.50 1.46 1.42 1.38 1.34 1.30 1.26 1.22 1.18 1.14 Supply Voltage (V) Environmental-margin Safety-margin Zero-margin @ 1.69 V @ 1.63 V @ 1.54 V

once every 20 seconds!

Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

Error Rate Studies – SPICE-Level Simulations Based on a SPICE-level simulations of a Kogge-Stone adder Kogge-Stone Adder at 870 MHz and 27 C 100.00%

10.00%

1.00%

0.10%

Error rate



random bzip

0.01%

ammp 0.00% 2

1.8

1.6

1.4

1.2

1

0.8

0.6

Supply Voltage

Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

8

Effects of Razor DVS Pipeline Throughput Energy IPC Total Energy, Etotal = Eproc + Erecovery

Optimal Etotal Energy of Processor Operations, Eproc

Energy of Pipeline Recovery, Erecovery

Energy of Processor w/o Razor Support Decreasing Supply Voltage Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

Simulation Methodology • Challenge: instruction latency depends circuit evaluation latency – May vary with changes in stage inputs, stage logic, voltage, temperature…

• Dynamic timing simulation combines architectural/circuit simulation – SimpleScalar/Alpha architectural-level simulation – Gate-level simulation of per-stage logic blocks • Logic block model describes cells, local and global interconnect • Cells characterized with SPICE at varied slew/cap-load/voltage • Each cycle, circuit simulator evaluates delay of each stages’ logic block – Based on actual instruction inputs from architectural simulator

• Initial implementation utilized a hand-generated EX-stage circuit model – Effort ongoing to automate extraction/decomposition/integration into SimpleScalar Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

9

EX-Stage Analysis – Optimal Voltage Sweep BZIP 1.5

Relative IPC and Energy

Recovery cost includes energy to recover entire pipeline (18x an add)

Rel Energy Rel Performance

1.3

1.1

0.9

0.7

0.5 0.31% Error Rate, 58% Energy Savings 0.6

0.7

0.65

0.8

0.75

0.9

0.85

1

0.95

1.1

1.05

1.2

1.15

1.3

1.25

1.4

1.35

1.5

1.45

1.6

1.55

1.7

1.65

1.8

1.75

0.3

Voltage Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

EX-Stage Analysis – Optimal Voltage Sweep GCC

Relative IPC and Energy

1.5 Rel Energy Rel Performance

1.3 1.1 0.9 0.7

1.62% Error Rate, 24% Energy Savings

0.5

0. 6

68 0.

75 0.

83 0.

98

0. 9

0.

05 1.

13 1.

28

1. 2

1.

35 1.

43 1.

58

65

1. 5

1.

1.

73 1.

1. 8

0.3

Voltage Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

10

Simulation Analysis – Energy-Optimal Voltage

• Simulator only models ALU in EX stage of pipeline Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

Supply Voltage Control System reset

Ediff = Eref - Esample

-

Voltage Control Function

Voltage Regulator

Vdd

Pipeline

error signals

Ediff

Eref

. . .

Σ

Esample

• Current design utilizes a very simple proportional control function – Control algorithm implemented in software

Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

11

Simulation Analysis – Razor DVS Execution Gap 2

30.00%

Voltage Error Rate

1.6

27.00% 24.00% 21.00% 18.00%

1.4

15.00% 1.2

12.00%

Error Rate

Supply Voltage

1.8

9.00%

1

6.00% 0.8 3.00% 0.6

0.00%

Time

Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

Simulation Analysis – Razor DVS Execution GCC 2

40.00%

Voltage Error Rate

1.8

35.00%

1.6 30.00% 25.00%

1.2 1

20.00%

0.8

15.00%

Error Rate

Supply Voltage

1.4

0.6 10.00% 0.4 5.00%

0.2 0

0.00%

Time

Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

12

Simulation Analysis – Razor DVS Performance DVS

Fixed

Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

Other Approaches to Dynamic Voltage Scaling • Traditional DVS – Valid voltage / delay combinations “blessed” at design time – Approach leaves a significant amount of energy “on the table” – Temperature, process, data, and safety margins placed on voltage

• Slack detector – automatic tuning – National/ARM’s Intelligent Energy Management (IEM) – Processor voltage automatically tuned to external ambient conditions – Inverter chain designed to track most restrictive critical path, margin still required



M e m C o n tr ol

control

Data cache

Floating point and graphics

Ex Unit

Control Unit

L2 Cache

I O U N I T

Cache control L2 tags

L2 Cache

DSP power / reliability / QoS trade-offs – Shanbhag @ UIUC – Voltage overscaling approach that limits noise impacts Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

13

Conclusions In-situ detection/correction of timing errors

clk

– Tune processor voltage based on error rate – Eliminate process, temperature, and safety margins (tune for near-zero error rate) – Purposely run below critical voltage to capture data-dependent latency margins

Q1

Main Flip-Flop

Error_L

Error

clk_del

error

recover

flushI D

Advanced Computer Architecture Lab University of Michigan

error

bu bbl e

bu bb le

rec ove r

recover

flushI D

WB

(read-only)

error

flushI D

error

bu bbl e

Stabilizer FF

PC

Trade-off: supply voltage power savings vs. overhead of correction

MEM

EX bu bb le

Razor FF

ID

Razor FF

IF

Flush Control

comparator

RAZOR FF

Implemented with architecture/circuit support – Double-sampling metastability-tolerant Razor flip-flops validate pipeline results – Pipeline initiates recovery after circuit timing errors, no voltage/clock re-tuning needed



0 1

Shadow Latch

Razor FF



D1

Razor FF



(reg/mem)

recover

flushI D

Razor DVS Austin / Blaauw / Mudge

Future Directions • Research opportunities – – – –

Razor for caches/memory and control logic Voltage control algorithms, especially per-stage tuning Typical-case energy optimized designs (instead of worse-case latency optimized) Turnkey application of Razor technology

• Prototype design, fabrication, evaluation – Razor I – Q4 2003 – Razor’ized combinational logic, global tuning – Razor II – Q3 2004 – Razor’ized caches and control logic, per-stage tuning

• Other applications – Single-event upset (SEU) protection using Razor error detection/re-execution – Over-clocking for performance improvement (2x shown among hobbyists) Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

14

More Details on Meta-Stability • Sub-critical operation invites meta-stability – Meta-stability detector itself can become meta-stable – double latch error signal to obtain sufficient small probability clk_b

clk D

Q clk_b

pos

clk

neg

clk_del_b

restore

clk_del

pos

error

fail

– Flush entire pipe – No forward progress – Reduce frequency

restore

restore

bubble flush

Dynamic Or / Latch

bubble flush

neg

Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

Short Path Failure IF inst2 inst1

ID inst2 inst1

EX inst1

MEM

WB

clk clk_d I1

ID.d

I2 I1

EX.d

I2

MEM.d

Short Path

error Advanced Computer Architecture Lab University of Michigan

Razor DVS Austin / Blaauw / Mudge

15