Register Scoreboards: Beyond Simple Forwarding in Pipelined Processors

Introduction

Data forwarding (also known as bypassing) is a fundamental technique in pipelined processor design that allows results to be used before they’re written back to the register file. For single-cycle operations in a classic RISC pipeline, forwarding elegantly solves most Read-After-Write (RAW) data hazards. However, when we introduce multi-cycle operations like multiplication, division, or memory loads, simple forwarding becomes insufficient. This article explores why register scoreboards are essential for handling these complex scenarios, using a practical RISC-V implementation as a case study.

Tiny Vedas Codebase Here

The Promise and Limitations of Forwarding

How Basic Forwarding Works

In a standard 5-stage RISC pipeline (Fetch, Decode, Execute, Memory, Writeback), forwarding allows the Execute stage to receive operands directly from later pipeline stages rather than waiting for writeback to the register file:

Time:     t0    t1    t2    t3    t4    t5
Instr1:   IF    ID    EX    MEM   WB
Instr2:         IF    ID    EX    MEM   WB
                     ↑
                     Forwarding from EX/MEM

When instruction 2 depends on instruction 1’s result, we can forward from the EX/MEM pipeline register directly to the Execute stage, eliminating the need to stall.

Where Forwarding Falls Short

Consider a multiply operation that takes 3 cycles to complete:

Time:          t0    t1    t2    t3    t4    t5    t6
MUL r3,r1,r2:  IF    ID    EX1   EX2   EX3   MEM   WB
ADD r4,r3,r5:        IF    ID    EX    MEM   WB
                          ↑
                          Where is r3?

At time t2, when the ADD instruction reaches the decode stage and needs to read r3, the multiply is still in its first execution cycle (EX1). The result won’t be available for forwarding until time t5, three cycles later. Simply forwarding from the execute stage is no longer sufficient.

The Multi-Cycle Operation Challenge

Different Operation Latencies

Modern processors must handle operations with varying execution times:

ALU operations: 1 cycle (ADD, SUB, AND, OR, XOR, shifts)
Multiply operations: 1-4 cycles (depending on implementation)
Divide operations: 8-32+ cycles (often iterative)
Load operations: 1+ cycles (cache hit) to 100+ cycles (cache miss)

Each of these creates a different forwarding challenge:

Pipelined multi-cycle ops (multiply): Results available at different stages
Blocking multi-cycle ops (divide): No intermediate results to forward
Variable latency ops (loads): Unpredictable completion time

The Information Gap

The core problem is that forwarding logic needs to know:

Which registers are currently unavailable? (RAW hazard detection)
When will each register’s value be ready? (Stall duration calculation)
Where should the value be forwarded from? (Bypass path selection)

For single-cycle operations, all three questions have trivial answers. For multi-cycle operations, we need a mechanism to track in-flight register writes.

Enter the Register Scoreboard

What is a Register Scoreboard?

A register scoreboard is a hardware structure that tracks which registers have pending writes. At its simplest, it’s a bit vector where each bit indicates whether a register has an in-flight operation:

logic [31:0] scoreboard;  // One bit per register

// Set bit when issuing long-latency operation
scoreboard[rd_addr] <= 1'b1;

// Clear bit when operation completes
scoreboard[rd_addr] <= 1'b0;

// Check for hazards
raw_hazard = scoreboard[rs1_addr] | scoreboard[rs2_addr];

Implementation in the Tiny Vedas Core

Looking at a real implementation from the Tiny Vedas RISC-V core, the register scoreboard module is instantiated in IDU1:

rsb #(
    .N_REG(32)
) rsb_idu0_i (
    .clk           (clk),
    .rstn          (rstn),
    .pipe_flush    (pipe_flush),
    .rs1_addr      (idu0_out.rs1_addr),
    .rs2_addr      (idu0_out.rs2_addr),
    .rs1_rd_en     (idu0_out.rs1 & idu0_out.legal),
    .rs2_rd_en     (idu0_out.rs2 & idu0_out.legal),
    .rs1_hit       (rs1_rsb_hit_idu0),
    .rs2_hit       (rs2_rsb_hit_idu0),
    .set_rd_addr   (idu0_out.rd_addr),
    .set_rd_wr_en  (idu0_out.legal & (idu0_out.mul | idu0_out.load) & ~(pipe_stall | idu0_rsb_hit_stall)),
    .clear_rd_addr (exu_wb_rd_addr),
    .clear_rd_wr_en(exu_wb_rd_wr_en)
);

Key observations:

Selective tracking: Only multiply and load operations set the scoreboard bit (idu0_out.mul | idu0_out.load)
Early detection: Scoreboard is checked in IDU0, before register file read in IDU1
Writeback clearing: Scoreboard bits are cleared when results actually write back

Scoreboard + Forwarding: A Hybrid Approach

The scoreboard doesn’t replace forwarding; it complements it. The implementation shows this hybrid approach:

// WB to IDU1 forwarding
assign idu1_out_i.rs1_data = ((exu_wb_rd_addr == idu0_out.rs1_addr) & idu0_out.rs1 & exu_wb_rd_wr_en)
                              ? exu_wb_data : rs1_data;
assign rs1_fwd_idu0 = ((exu_wb_rd_addr == idu0_out.rs1_addr) & idu0_out.rs1 & exu_wb_rd_wr_en);

// Only stall if scoreboard hit AND no forwarding available
assign idu0_rsb_hit_stall = (rs1_rsb_hit_idu0 & ~rs1_fwd_idu0) | (rs2_rsb_hit_idu0 & ~rs2_fwd_idu0);

The processor only stalls when there’s both a scoreboard hit (register in-flight) AND no forwarding path available. If the value happens to be at the writeback stage, forwarding rescues us from the stall.

Conclusion

Register scoreboards solve a fundamental problem in pipelined processor design. While simple forwarding works well for single-cycle operations, multi-cycle operations require a mechanism to track in-flight register writes. The scoreboard provides this tracking with minimal hardware overhead.

References

Patterson, D. A., & Hennessy, J. L. (2017). Computer Organization and Design RISC-V Edition: The Hardware Software Interface
Hennessy, J. L., & Patterson, D. A. (2011). Computer Architecture: A Quantitative Approach (5th ed.)
RISC-V Instruction Set Manual, Volume I: User-Level ISA
Thornton, J. E. (1964). Design of a Computer: The Control Data 6600