A) A processor has the following cache hierarchy: L1 I-cache and L1 D-cache each 16 KB, $2-$ way associative, 16-byte block; unified L2, 256 KB, 4-way associative, 32 -byte block; unified L3 512KB, 4-way associative, 32-byte block. The latencies (disregarding virtual memory TLB) expressed in clock cycles are: 2 in L1, 3 in L2, 6 in L3.
The processor executes the IA-32 ISA (but with 32 -bit addressing mode).
a1) Assuming initially empty and invalidated cache lines throughout the hierarchy, show the content of each D-cache after the execution of the instruction $L D R 1,\left(0000 A F 00_{\text {hex }}\right.$ ) (further assumption: the instruction is in the I-cache).
a2) determine hexadecimal values $X$ and $Y$ (other than $0000 \mathrm{AF} 00_{\text {hex }}$ ) such that the instructions LD F1, (X) and LD F2, O(Y) load data in the same L3 D-cache block, in the same L2 D-cache block and in different L1 D-Cache blocks.
B) The processor runs at 2.33 GHz and is connected to RAM memory through a 64 -bit bus interfaced to a DDR2-800 2-way interleaved memory, 8-byte block length. The bus/DDR2 interface guarantees the maximum transfer rate allowed by the DDR2-800 chips. Compute the cost of a miss measured in processor clocks, assuming that addressing the memory requires a single bus clock cycle, and that each row activation in the memory banks requires 2 bus clock cycles.
C) The processor has a superscalar, 2-way pipeline, that fetches, decodes issues and retires (commits) bundles containing each 2 instructions. The front-end in-order section (fetch and decode) consists of 4 stages. The issue logic takes 1 clock cycle, if the instructions in the bundle are independent, otherwise it takes 2 clock cycles. The architecture supports dynamic speculative execution, and control dependencies from branches are solved only at commit time, even if the outcome of the branch logic is available earlier. The execution model obeys the attached state transition diagram.
There are 2 functional units (FUs) Int1-INT2 for integer arithmetics (arithmetic and local instructions, branches and jumps, no multiplication ), 2 FUS FAdd1-Fadd2 for floating point addition/subtraction, 2 FU FMolt1 for floating point multiplication, and a FU for division, FDiv1. There are 12 integer ( $\mathrm{R} 0-\mathrm{R} 11$ ) and 12 floating point (F0-F11) registers.
Speculation is handled through a 6-entry ROB, a pool of 4 Reservation Stations (RS) Rs1-4 shared among all FUs, 2 load buffers Load1-2, 2 store buffers Store1-2 (see the attached execution model): an instruction bundle is first placed in the ROB (if two entries are available), then each instruction is dispatched to one of the shared RS (if available) and then executed in the proper FU. FUs are pipelined (not the Fdiv one) and have the latencies quoted in the following table:

| Int -1 | Fadd -2 |
| :--- | :--- |
| Fmolt -4 | Fdiv -6 |

Further assumption

- caches are described in point A) and are assumed empty and invalidated.
c1) show state transitions for the instructions of the first iteration of the following code fragment (assume a conventional L3 miss time of 6 clock cycles), highlighting conflicts, if any:

PC01 MOVI R1,0X0000AFAF
PC02 MOVI R2,0X0000000A
PC03 LD F1,0(R1)
PC04 LD F2,8(R1)
PC05 LD F3,16(R1)
PC06 MULTF F4,F3,F0
-- FO HAS BEEN PRE-LOADED
PC07 MULTF F5,F2,F0
PC08 ADDF F7,F4,F5
PC09 MULTF F6,F1,F0
PC10 ADDF F7,F7,F6
PC11 SD 0(R1),F7
PC12 ADDI R1,R1,8
PC13 SUBI R2,R2,1
PC14 BNEZ R2,PC03
c2) show ROB, RS and buffer status at the issue of the BNEZ of the first iteration.
D) The code fragment is executed on a statically scheduled pipeline having the following structure (A1-A2 FP addition/subtraction, M1-M4 FP multiplication):

IF1A, ID1A, EXA,
,MEM1A, MEM2A, WBA

$$
\mathrm{A} 1, \mathrm{~A} 2
$$

M1, M2, M3, M4
d1) Assuming BNEZ decides on the condition in EXA, schedule the branch delay slot;
d2) Using b1), produce a schedule for the first iteration, assuming caches always hit with a
latency of 1 clock cycle;
d3) compute the CPI of the algorithm;
d4) evaluate maximum unrolling (if possible), with 32 registers (both INT e FP);
d5) unroll once and schedule the unrolled code, and compute the new CPI.

Dynamic speculative execution
Decoupled ROB RS execution model

| ISTRUCTION |  |  | INSTRUCTION STATE |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | $\begin{array}{\|l\|} \hline \text { n. } \\ \text { ite } \\ \hline \end{array}$ | $\begin{aligned} & \text { ROB } \\ & \text { pos } \\ & \hline \end{aligned}$ | WO | RE | DI | EX | WB | RR | CO |
| PC01 MOVI R1,0X0000AFAF | - |  |  |  |  |  |  |  |  |
| PC02 MOVI R2,0X0000000A | - |  |  |  |  |  |  |  |  |
| PC03 LD F1, 0 (R1) | - |  |  |  |  |  |  |  |  |
| PC04 LD F2,8(R1) | 1 |  |  |  |  |  |  |  |  |
| PC05 LD F3,16(R1) | 1 |  |  |  |  |  |  |  |  |
| PC06 MULTF F4,F3,F0 | 1 |  |  |  |  |  |  |  |  |
| PC07 MULTF F5,F2,F0 | 1 |  |  |  |  |  |  |  |  |
| PC08 ADDF F7,F4,F5 | 1 |  |  |  |  |  |  |  |  |
| PC09 MULTF F6,F1,F0 | 1 |  |  |  |  |  |  |  |  |
| PC10 ADDF F7,F7,F6 | 1 |  |  |  |  |  |  |  |  |
| PC11 SD 0(R1), F7 | 1 |  |  |  |  |  |  |  |  |
| PC12 ADDI R1,R1,8 | 1 |  |  |  |  |  |  |  |  |
| PC13 SUBI R2,R2,1 | 1 |  |  |  |  |  |  |  |  |
| PC14 BNEZ R2, PC03 | 1 |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |


|  | Reservation station and load/store buffers |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Busy | Op | Vj | Vk | $\mathrm{ROB}_{j}$ | $\mathrm{ROB}_{\mathrm{k}}$ | ROB <br> pos | Address |
| Rs 1 |  |  |  |  |  |  |  |  |
| Rs 2 |  |  |  |  |  |  |  |  |
| Rs3 |  |  |  |  |  |  |  |  |
| Rs 4 |  |  |  |  |  |  |  |  |
| Load1 |  |  |  |  |  |  |  |  |
| Load2 |  |  |  |  |  |  |  |  |
| Store1 |  |  |  |  |  |  |  |  |
| Store2 |  |  |  |  |  |  |  |  |

$R O B_{j}$ ROB $_{k}:$ sources not yet available
ROB pos: ROB entry number where instruction is located


| Reorder Buffer (ROB) |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| ROB Entry\# | Busy | Op | Status | Destination | Value |  |
| 1 |  |  |  |  |  |  |
| 2 |  |  |  |  |  |  |
| 3 |  |  |  |  |  |  |
| 4 |  |  |  |  |  |  |
| 5 |  |  |  |  |  |  |
| 6 |  |  |  |  |  |  |



## Decoupled execution model for bundled (paired) instructions

The state diagram depicts the model for a dynamically scheduled, speculative execution microarchitecture equipped with a Reorder Buffer (ROB) and a set of Reservation Stations (RS). The RSs are allocated during the ISSUE phase, denoted as RAT (Register Alias Allocation Table) in INTEL microarchitectures, as follows: a bundle ( 2 instructions) if fetched from the QUEUE of decoded instructions and ISSUED if there is a free couple of consecutive entries in the ROB (head and tail of the ROB queue do not match); instructions are moved into a RS (if available) when all of its operands are available. Access memory instructions are allocated in the ROB and then moved to a load/store buffer (if available) when operands (address and data, if proper) are available .
States are labelled as follows:

| WO: | Waiting for Operands (at least one of the operands is not available) |
| :--- | :--- |
| RE: | Ready for Execution (all operands are available) |
| DI: | Dispatched (posted to a free RS or load/store buffer) |
| EX: | Execution (moved to a load/store buffer or to a matching and free UF) |
| WB: | Write Back (result is ready and is returned to the Rob by using in exclusive mode the Common Data Bus CDB) |
| RR: | Ready to Retire (result available or STORE has completed) |
| CO: | Commit (result is copied to the final ISA register) |

State transitions happen at the following events:
from QUEUE to WO:
from QUEUE to RE:
loop at WO:
from WO to RE:
loop at RE:
from RE to DI :
loop on DI :
from DI to EX:
loop at EX:
from EX to WB :
from EX to RR:
loop at RR:
from RR to CO :

ROB entry available, operand missing
ROB entry available, all operands available
waiting for operand(s)
all operands available
waiting for a free RS or load/store buffer
RS or load/store buffer available
waiting for a free UF
UF available
multi-cycle execution in a UF, or waiting for CDB
result written to the ROB with exclusive use of CDB
STORE completed, branch evaluted
instruction completed, not at the head of the ROB, or bundled with a not RR instruction
bundle of RR instructions at the head of the ROB, no exception raised

## Resources

Register-to-Register instructions hold resources as follows:
ROB: from state WO (or RE) up to CO, inclusive;
RS: state DI
UF: EX and WB
Load/Store instructions hold resources as follows:
ROB: from state WO (or RE) up to CO, inclusive;
Load buffer: from state WO (or RE) up to WB
Store buffer: from state WO (or RE) up to EX (do not use WB)
Forwarding: a write on the CDB (WB) makes the operand available to the consumer in the same clock cycle. If the consumer is doing a state transition from QUEUE to WO or RE, that operand is made available; if the consumer is in WO, it goes to RE in the same clock cycle of WB for the producer.
Branches: they compute Next-PC and the branch condition in EX and optionally forward Next-PC to the "in-order" section of the pipeline (Fetch states) in the next clock cycle. They do not enter WB and go to RR instead.

