Hardware techniques to exploit ILP

Exploiting the local ILP

The processor provides slots to host the local window from which the local ILP is exploited.
In a pipeline, each stage represents a slot in which an instruction resides. An n stage pipeline may host up to n dynamically consecutive instructions (typically, a pipeline hosts between one and a few tens of instructions).
Hardware techniques like bypassing and branch delay slots are used to help filling the pipeline.
Bypass serves to shorten the delay in a RAW dependency. The value to be transmitted is forwarded to the reading instruction as soon as it is computed rather than after it has been written (on the next figure, value r1-1 is forwarded to the bne instruction).
Branch delay serves to hide the latency of a computed control flow instruction. While the control flow target address is computed, the following instructions are fetched and inserted in the pipeline to be run. They are semantically part of the computation preceding the branch even though they statically follow it (on the next figure, the add instruction is run whether the branch is taken or not).

To increase the size of the local window, either superscalar or VLIW or EPIC processors are used.
A superscalar processor offers multiple identical and generic pipelines. Each pipeline can handle any instruction.
A VLIW (Very Large Instruction Word) processor offers multiple specific pipelines. Each pipeline can handle a subset of the instructions, e.g. an integer computation pipeline, a memory access pipeline and a branch pipeline. The compiler packs multiple instructions into a large word adapted to the set of available pipelines.
An EPIC (Explicitely Parallel Instruction set Computer) processor is a VLIW processor with a large set of registers to allow static register renaming. The static renaming increases the local ILP by removing some name dependencies, which helps the compiler to find more instructions to fill the words.

To hide long latencies (e.g. a division or a memory access with multi-level cache misses) the processor may run instructions out-of-order.
An out-of-order processor issues instructions according to the availability of their sources and operators. For example, while a memory load is in flight, instructions which are independent of the loaded value can be issued even though they follow the load in the dynamic trace.
The processor is organized around a set of concurrent engines rather than a pipeline.
The fetch engine reads instructions in-order and saves them in a fetch buffer.
The decode and rename engine reads in-order the instructions saved by the fetch engine. It eliminates name dependencies and detects RAW dependencies through register renaming. The needed operator is identified from the instruction decoding. The renamed instructions are saved in a ReOrder Buffer (ROB) which represents the window from which local ILP can be exploited.
The issue engine selects ready renamed instructions, i.e. which sources and needed operator are all available. The operation may take multiple cycles. When the result is computed, it is written to a write-back buffer and the instructions in the ROB waiting for the written value are notified. Waiting instructions may be issued out-of-order.
The write-back engine selects write-back buffer entries to write their result to the destination register file. Terminated instructions are marked in the ROB. Instructions may terminate out-of-order.
The commit engine removes the terminated instructions from the ROB in-order.

To further hide long latencies out-of-order processors can be equiped with speculation units. For example, a branch predictor speculates on the outcome of a branch instruction. The predicted target is fetched before the branch has been computed. If the bet is successful, it is as if the branch computation had a latency equal to the prediction duration (e.g. a single cycle). Hence, the branch computation latency is partially (multiple cycle prediction) or even fully (single cycle prediction) hidden.
A speculative processor may speculate on branches, loads, values and anything which has a long latency. Branch speculation helps filling the ROB, while other speculations related to execution help to empty the ROB, making room for new instructions. The more efficient the execution speculation is, the more accurate the branch predictor must be to keep the ROB filled and provide a high local ILP.

Exploiting the global ILP

Multithreading (or SMT for Simultaneous MultiThreading) is a substitute to speculation to hide latencies.
A multithreaded processor inputs instructions from multiple threads. After renaming, the instructions seem to belong to a single thread as the architectural registers of the different inputing threads are mapped into a single set of renaming registers. They share the issue engine, including the set of functional units. The write back engine sends each result to its thread destination.
If the hosted threads belong to different applications, multithreading improves processor throughput but not application latency (latency may even be slightly degraded because of some cycles stolen by the other concurrently running threads). But when the hosted threads are all related to a single parallelized application (e.g. an OpenMP code), then they do improve its latency through global ILP exploitation.

Exploiting global and local ILP

To exploit both local and global ILP, a multithreaded out-of-order processor is required. Out-of-order computation exploits local ILP and multithreading exploits global ILP. As multitheading is a substitute to speculation, the processor may avoid speculation.

The processor may host a single or multiple cores. On a given die area, the aim is to maximize the instruction throughput. There is a trade-off between a set of superscalar multithreaded out-of-order and speculative cores and a set of scalar multithreaded out-of-order non speculative cores.
It is probably more efficient to rely on many scalar cores than on a few superscalar ones.

The scalar cores should allow out-of-order computation to facilitate inter-thread producer to consumer synchronisation. An instruction reading a value written by another instruction of another thread waits until the producing instruction has sent the value to be consumed.