· Processor Design · 4 min read
Register Scoreboards: Beyond Simple Forwarding in Pipelined Processors
Why operand forwarding alone breaks down with multi-cycle operations—and how a register scoreboard fixes it.
Introduction
Data forwarding (also known as bypassing) is a fundamental technique in pipelined processor design that allows results to be used before they’re written back to the register file. For single-cycle operations in a classic RISC pipeline, forwarding elegantly solves most Read-After-Write (RAW) data hazards. However, when we introduce multi-cycle operations like multiplication, division, or memory loads, simple forwarding becomes insufficient. This article explores why register scoreboards are essential for handling these complex scenarios, using a practical RISC-V implementation as a case study.
The Promise and Limitations of Forwarding
How Basic Forwarding Works
In a standard 5-stage RISC pipeline (Fetch, Decode, Execute, Memory, Writeback), forwarding allows the Execute stage to receive operands directly from later pipeline stages rather than waiting for writeback to the register file:
Time: t0 t1 t2 t3 t4 t5
Instr1: IF ID EX MEM WB
Instr2: IF ID EX MEM WB
↑
Forwarding from EX/MEM
When instruction 2 depends on instruction 1’s result, we can forward from the EX/MEM pipeline register directly to the Execute stage, eliminating the need to stall.
Where Forwarding Falls Short
Consider a multiply operation that takes 3 cycles to complete:
Time: t0 t1 t2 t3 t4 t5 t6
MUL r3,r1,r2: IF ID EX1 EX2 EX3 MEM WB
ADD r4,r3,r5: IF ID EX MEM WB
↑
Where is r3?
At time t2, when the ADD instruction reaches the decode stage and needs to read r3, the multiply is still in its first execution cycle (EX1). The result won’t be available for forwarding until time t5, three cycles later. Simply forwarding from the execute stage is no longer sufficient.
The Multi-Cycle Operation Challenge
Different Operation Latencies
Modern processors must handle operations with varying execution times:
- ALU operations: 1 cycle (ADD, SUB, AND, OR, XOR, shifts)
- Multiply operations: 1-4 cycles (depending on implementation)
- Divide operations: 8-32+ cycles (often iterative)
- Load operations: 1+ cycles (cache hit) to 100+ cycles (cache miss)
Each of these creates a different forwarding challenge:
- Pipelined multi-cycle ops (multiply): Results available at different stages
- Blocking multi-cycle ops (divide): No intermediate results to forward
- Variable latency ops (loads): Unpredictable completion time
The Information Gap
The core problem is that forwarding logic needs to know:
- Which registers are currently unavailable? (RAW hazard detection)
- When will each register’s value be ready? (Stall duration calculation)
- Where should the value be forwarded from? (Bypass path selection)
For single-cycle operations, all three questions have trivial answers. For multi-cycle operations, we need a mechanism to track in-flight register writes.
Enter the Register Scoreboard
What is a Register Scoreboard?
A register scoreboard is a hardware structure that tracks which registers have pending writes. At its simplest, it’s a bit vector where each bit indicates whether a register has an in-flight operation:
logic [31:0] scoreboard; // One bit per register
// Set bit when issuing long-latency operation
scoreboard[rd_addr] <= 1'b1;
// Clear bit when operation completes
scoreboard[rd_addr] <= 1'b0;
// Check for hazards
raw_hazard = scoreboard[rs1_addr] | scoreboard[rs2_addr];
Implementation in the Tiny Vedas Core
Looking at a real implementation from the Tiny Vedas RISC-V core, the register scoreboard module is instantiated in IDU1:
rsb #(
.N_REG(32)
) rsb_idu0_i (
.clk (clk),
.rstn (rstn),
.pipe_flush (pipe_flush),
.rs1_addr (idu0_out.rs1_addr),
.rs2_addr (idu0_out.rs2_addr),
.rs1_rd_en (idu0_out.rs1 & idu0_out.legal),
.rs2_rd_en (idu0_out.rs2 & idu0_out.legal),
.rs1_hit (rs1_rsb_hit_idu0),
.rs2_hit (rs2_rsb_hit_idu0),
.set_rd_addr (idu0_out.rd_addr),
.set_rd_wr_en (idu0_out.legal & (idu0_out.mul | idu0_out.load) & ~(pipe_stall | idu0_rsb_hit_stall)),
.clear_rd_addr (exu_wb_rd_addr),
.clear_rd_wr_en(exu_wb_rd_wr_en)
);
Key observations:
- Selective tracking: Only multiply and load operations set the scoreboard bit (
idu0_out.mul | idu0_out.load) - Early detection: Scoreboard is checked in IDU0, before register file read in IDU1
- Writeback clearing: Scoreboard bits are cleared when results actually write back
Scoreboard + Forwarding: A Hybrid Approach
The scoreboard doesn’t replace forwarding; it complements it. The implementation shows this hybrid approach:
// WB to IDU1 forwarding
assign idu1_out_i.rs1_data = ((exu_wb_rd_addr == idu0_out.rs1_addr) & idu0_out.rs1 & exu_wb_rd_wr_en)
? exu_wb_data : rs1_data;
assign rs1_fwd_idu0 = ((exu_wb_rd_addr == idu0_out.rs1_addr) & idu0_out.rs1 & exu_wb_rd_wr_en);
// Only stall if scoreboard hit AND no forwarding available
assign idu0_rsb_hit_stall = (rs1_rsb_hit_idu0 & ~rs1_fwd_idu0) | (rs2_rsb_hit_idu0 & ~rs2_fwd_idu0);
The processor only stalls when there’s both a scoreboard hit (register in-flight) AND no forwarding path available. If the value happens to be at the writeback stage, forwarding rescues us from the stall.
Conclusion
Register scoreboards solve a fundamental problem in pipelined processor design. While simple forwarding works well for single-cycle operations, multi-cycle operations require a mechanism to track in-flight register writes. The scoreboard provides this tracking with minimal hardware overhead.
References
- Patterson, D. A., & Hennessy, J. L. (2017). Computer Organization and Design RISC-V Edition: The Hardware Software Interface
- Hennessy, J. L., & Patterson, D. A. (2011). Computer Architecture: A Quantitative Approach (5th ed.)
- RISC-V Instruction Set Manual, Volume I: User-Level ISA
- Thornton, J. E. (1964). Design of a Computer: The Control Data 6600