Advanced Computer Architecture
30-01-2018 simplified version

A) A processor embeds two cores that have private L1 caches; the L2 and L3 caches are shared. The caches obey the MESI protocol and have the following structure: 64KB, 8-way L1 I-cache and D-cache, each 32-byte block; 1024KB, 8-way associative, 128-byte block L2 cache; 16MB, 8-way associative, 128-byte block L3 cache. The latencies (disregarding virtual memory TLB) expressed in clock cycles are: 2 in L1, 6 in L2, 12 in L3. Addresses are 48-bit long. Write operations are managed with a write-back, allocate policy.

Each core of the processor fetches, decodes issues and retires instructions according to the Tomasulo’s scheme. The front-end in-order section (fetch and decode) consists of 2 stages. The issue logic takes 1 clock cycle. The architecture supports dynamic speculative execution, and control dependencies from branches are solved when the branch evaluates the condition, even if it is not at commit. The execution model obeys the attached state transition diagram. There is a functional unit (FUs) Int1 for integer arithmetics (arithmetic and local instructions, branches and jumps, no multiplication), 1 FUs FAdd1 for floating point addition/subtraction, 1 FU FMolt1 for floating point multiplication, and a FU for division, FDiv1.

There are 12 integer (R0-R11) and 12 floating point (F0-F11) registers. Speculation is handled through a 8-entry ROB, a pool of 4 Reservation Stations (RS) Rs1-4 shared among all FUs, 1 load buffers Load1, 1 store buffer Store1 (see the attached execution model): an instruction is first placed in the ROB (if one entry is available), then up to 2 instructions are dispatched to the shared RS (if available) when they are ready for execution and then executed in the proper FU. FUs are *pipelined* (except for the float division unit, which is blocking) and have the latencies quoted in the following table:

|  |  |
| --- | --- |
| Int - 2 | Fadd – 3 |
| Fmolt – 4 | Fdiv – 8 |

Further assumption

* The code is assumed to be already in the I-cache; data caches are described in point A) and are assumed empty and invalidated; the cost of a miss is 60.

c1) assuming a write-back protocol for cache management, show state transitions for all instructions in the first iteration and instructions up to PC04 included in the second iteration :

PC01 ADDI R1,R0, 0000FFFFA0A0hex

PC02 LD F1,0(R1)

PC03 MULTF F2,F1,F0

PC04 ST F2,0(R1)

PC05 MULTF F2,F2,F2

PC06 LD F3,8(R1)

PC07 MULTF F6,F3,F1

PC08 MULTF F5,F2,F6

PC09 ST F5,8(R1)

PC10 ADD R1,R1,16

PC11 BLI R1, 00010000A0A0hex,PC02

c2) show ROB, RS and buffer status at the issue of PC03 in the second iteration.

Dynamic speculative execution

Decoupled ROB RS execution model

|  |  |  |  |
| --- | --- | --- | --- |
| **ISTRUCTION** |  |  | **INSTRUCTION STATE**  |
|  | **n.ite** | **ROBpos** | **WO** | **RE** | **DI** | **EX** | **WB** | **RR** | **CO** |
| PC01 ADDI R1,R0, 0000FFFFA0A0hex |  | **0** |  | **1** | **2** | **3-4** | **5** | **6** | **7** |
| PC02 LD F1,0(R1) |  | **1** | **2-4** | **5** | **6** | **7-86** | **87** | **88** | **89** |
| PC03 MULTF F2,F1,F0 |  | **2** | **3-86** | **87** | **88** | **89-92** | **93** | **94** | **95** |
| PC04 ST F2,0(R1) |  | **3** | **4-92** | **93** | **94** | **95-96** | **-** | **97** | **98** |
| PC05 MULTF F2,F2,F2 |  | **4** | **5-92** | **93** | **94** | **95-98** | **99** | **100** | **101** |
| PC06 LD F3,8(R1) |  | **5** |  | **6-87** | **88** | **89-90** | **91** | **92-101** | **102** |
| PC07 MULTF F6,F3,F1 |  | **6** | **7-90** | **91** | **92** | **93-96** | **97** | **98-102** | **103** |
| PC08 MULTF F5,F2,F6 |  | **7** | **8-98** | **99** | **100** | **101-104** | **105** | **106** | **107** |
| PC09 ST F5,8(R1) |  | **0** | **9-104** | **105** | **106** | **107-108** | **-** | **109** | **110** |
| PC10 ADD R1,R1,16 |  | **1** |  | **90** | **91** | **92-93** | **94** | **95-110** | **111** |
| PC11 BLI R1, 00010000A0A0hex,PC02  |  | **2** |  | **96** | **97** | **98-99** | **-** | **100-112** |  |
| PC12 whatever |  | **3** |  |  |  | **Canc. at 100** |  |  |  |
| PC02 LD F1,0(R1) |  | **3** |  | **104** |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |

|  |  |
| --- | --- |
|  | Reservation station and load/store buffers |
| Busy | Op | Vj | Vk | ROBj | ROBk | ROB pos | Address |
| Rs1 |  |  |  |  |  |  |  |  |
| Rs2 |  |  |  |  |  |  |  |  |
| Rs3 |  |  |  |  |  |  |  |  |
| Rs4 |  |  |  |  |  |  |  |  |
| Load1 |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |
| Store1 |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |

ROBj ROBk: sources not yet available

ROB pos: ROB entry number where instruction is located

|  |  |
| --- | --- |
|  | Result Register status |
| Integer | R0 | R1 | R2 | R3 | R4 | R5 | R6 | R7 | R8 | R9 | R10 | R11 |  |  |
| ROB pos |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| state |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| Float. | F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 | F8 | F9 | F10 | F11 |  |  |
| ROB pos |  | 1 | 4 | 5 |  |  |  |  |  |  |  |  |  |  |
| state |  | B | B | B |  |  |  |  |  |  |  |  |  |  |

|  |
| --- |
| Reorder Buffer (ROB) |
|  ROB Entry#  | Busy | Op | Status | Destination | Value |
| 0 | Yes | PC09 ST F5,8(R1) |  |  |  |
| 1 | Yes  | PC10 ADD R1,R1,16 |  |  |  |
| 2 | Yes | PC10 ADD R1,R1,16 |  |  |  |
| 3 | yes HT | PC04 ST F2,0(R1) |  |  |  |
| 4 | Yes | PC05 MULTF F2,F2,F2 |  |  |  |
| 5 | Yes | PC06 LD F3,8(R1) |  |  |  |
| 6 | Yes | PC07 MULTF F6,F3,F1 |  |  |  |
| 7 | Yes | PC08 MULTF F5,F2,F6 |  |  |  |

DI

EX

WB

RR

RE

CO

wo

QUEUE

**Decoupled execution model for bundled (paired) instructions**

The state diagram depicts the model for a dynamically scheduled, speculative execution microarchitecture equipped with a Reorder Buffer (ROB) and a set of Reservation Stations (RS). The ROB and RSs are allocated during the ISSUE phase, denoted as RAT (Register Alias Allocation Table) in INTEL microarchitectures, as follows: an instruction is fetched from the QUEUE of decoded instructions and ISSUED if there is an empty slot in the ROB ( head and tail of the ROB queue do not match); a maximum of two instructions are moved into the RS (if available) when all of their operands are available. Access memory instructions are allocated in the ROB and then moved to a load/store buffer (if available) when operands (address and data, if proper) are available .

**States** are labelled as follows:

WO: Waiting for Operands (at least one of the operands is not available)

RE: Ready for Execution (all operands are available)

DI: Dispatched (posted to a free RS or load/store buffer)

EX: Execution (moved to a load/store buffer or to a matching and free UF)

WB: Write Back (result is ready and is returned to the Rob by using in exclusive mode the Common Data Bus CDB)

RR: Ready to Retire (result available or STORE has completed)

CO: Commit (result is copied to the final ISA register)

**State transitions** happen at the following events:

 *from* QUEUE *to* WO: ROB entry available, operand missing

*from* QUEUE *to* RE: ROB entry available, all operands available

*loop at* WO: waiting for operand(s)

*from* WO *to* RE: all operands available

*loop at* RE: waiting for a free RS or load/store buffer

*from* RE *to* DI: RS or load/store buffer available

*loop on* DI: waiting for a free UF

*from* DI *to* EX: UF available

*loop at* EX: multi-cycle execution in a UF, or waiting for CDB

*from* EX *to* WB: result written to the ROB with exclusive use of CDB

*from* EX *to* RR: STORE completed, branch evaluted

*loop at* RR: instruction completed, not at the head of the ROB, or bundled with a not RR instruction

*from* RR *to* CO: bundle of RR instructions at the head of the ROB, no exception raised

**Resources***Register-to-Register* instructions hold resources as follows:

ROB: from state WO (or RE) up to CO, inclusive;

RS: state DI

UF: EX and WB

*Load/Store* instructions hold resources as follows:

ROB: from state WO (or RE) up to CO, inclusive;

Load buffer: from state WO (or RE) up to WB

Store buffer: from state WO (or RE) up to EX (do not use WB)

**Forwarding**: a write on the CDB (WB) makes the operand available to the consumer in the same clock cycle. If the consumer is doing a state transition from QUEUE to WO or RE, that operand is made available; if the consumer is in WO, it goes to RE in the same clock cycle of WB for the producer.

**Branches**: they compute Next-PC and the branch condition in EX and optionally forward Next-PC to the “in-order” section of the pipeline (Fetch states) in the next clock cycle. They do not enter WB and go to RR instead.