Transmitting a result

The p_swre instruction is used to send a result to the continuation (i.e. the forked hart).
The value is written to a result buffer private to the receiving hart.
If the sending hart and the receiving hart are not on neighbor cores, the result travels along a line linking all the cores (one core per cycle).

The p_lwre instruction is used to receive a result from the forking hart.
The value is read from the current hart result buffer once it is full (this synchronizes the sender and the receiver).

In the following example computing a dot product, register a0 is used to hold the computed result.
Function iter returns its result (a chunk dot product) through register a0 with instruction p_swre t0, a0. In this example, the result is always sent to the same or preceding core in a single cycle.
The receiving hart gets the transmitted result with the p_lwre instruction. Usually, the receiving instruction is decoded by its core before the sending one in another core. The out-of-order issue mechanism implemented in each core makes the receiving instruction wait for the sending one, letting the following code be fetched, decoded, issued and executed.
In this example, the 32 harts of the processor fill with as many calls to iter and as many result receiving instructions. Each time an iter call returns, its result sending instruction p_swre t0, a0 matches a p_lwre which is issued.
The execution behaviour is: fill the harts in parallel, compute each chunk dot product in parallel, accumulate the partial sums sequentially and then print.