Part III

Instruction Execution
Pipelining is a CPU implementation technique whereby multiple instructions are overlapped in execution.

- Break CPU instructions into smaller units and pipeline.
- E.g., classical five-stage pipeline for RISC:

```
instr. i
IF → ID → EX → MEM → WB

instr. i + 1
IF → ID → EX → MEM → WB

instr. i + 2
IF → ID → EX → MEM → WB
```

parallel execution
Pipelining in CPUs

Ideally, a $k$-stage pipeline improves performance by a factor of $k$.

- Slowest (sub-)instruction determines clock frequency.
  - Ideally, break instructions into $k$ equi-length parts.
- Issue one instruction per clock cycle (IPC = 1).

Examples:

- Intel Pentium 4: 31+ pipeline stages
- Intel Core i7: 16 stages
Hazards

The effectiveness of pipelining is hindered by hazards.

Structural Hazard
Different pipeline stages need same functional unit (resource conflict; e.g., memory access ↔ instruction fetch)

Data Hazard
Result of one instruction not ready before access by later instruction.

Control Hazard
Arises from branches or other instructions that modify PC ("data hazard on PC register").

Hazards lead to pipeline stalls that decrease IPC.
Structural Hazards

A structural hazard will occur if a CPU has only one memory access unit and instruction fetch and memory access are scheduled in the same cycle.

Resolution:
- **Provision** hardware accordingly (e.g., separate fetch units)
- **Schedule** instructions (at compile- or runtime)
Structural hazards can also occur because functional units are **not fully pipelined**.

- *E.g.*, a (complex) floating point unit might not accept new data on every clock cycle.
- Often a space/cost ↔ performance trade-off.
## Data Hazards

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Register(s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD</td>
<td>R1, 0(R2)</td>
</tr>
<tr>
<td>DSUB</td>
<td>R4, R1, R5</td>
</tr>
<tr>
<td>AND</td>
<td>R6, R1, R7</td>
</tr>
<tr>
<td>OR</td>
<td>R8, R1, R9</td>
</tr>
<tr>
<td>XOR</td>
<td>R10, R1, R11</td>
</tr>
</tbody>
</table>

- Instructions read R1 before it was written by LD (stage WB writes register results).
- Would cause incorrect execution result.

![Data Processing Pipeline Diagram](image_url)
Data Hazards

Resolution:

- **Forward** result data from instruction to instruction.
  - Could resolve hazard \( \text{LD} \leftrightarrow \text{AND} \) on previous slide (forward \( R1 \) between cycles 2 and 3).
  - **Cannot** resolve hazard \( \text{LD} \leftrightarrow \text{DSUB} \) on previous slide.

- **Schedule** instructions (at compile- or runtime).
  - Cannot avoid all data hazards.

- Detecting data hazards can be hard, *e.g.*, if they go through memory.

\[
\begin{align*}
\text{SD} & \quad R1, 0(R2) \\
\text{LD} & \quad R3, 0(R4)
\end{align*}
\]
Tight loops are a good candidate to improve instruction scheduling.

```plaintext
for (i = 1000; i > 0; i = i - 1)
    x[i] = x[i] + s;
```

source: Hennessy & Patterson, Chapter 2
Control Hazards

Control hazards are often more severe than are data hazards.

- Most simple implementation: **flush pipeline, redo instr. fetch**

```
<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>clock</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td>idle</td>
<td>idle</td>
</tr>
</tbody>
</table>

branch instr. \( i \)

<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>clock</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>idle</td>
<td>idle</td>
<td>idle</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

instr. \( i + 1 \)

<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>clock</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

target instr.

<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>clock</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

target instr. + 1

With increasing pipeline depths, the penalty gets **worse**.
A simple optimization is to only flush if the branch was taken.

- Penalty only occurs for taken branches.
- If the two outcomes have different (known) likeliness:
  - Generate code such that a non-taken branch is more likely.
- Aborting a running instruction is harder when the branch outcome is known late.
  - Should not change exception behavior.

This scheme is called predicted-untaken.

- Likewise: predicted-taken (but often less effective)
Modern CPUs try to predict the target of a branch and execute the target code speculatively.

- Prediction must happen early (ID stage too late).

Thus: **Branch Target Buffers (BTBs)**

- Lookup Table: PC → ⟨predicted target, taken?⟩.

<table>
<thead>
<tr>
<th>Lookup PC</th>
<th>Predicted PC</th>
<th>Taken?</th>
</tr>
</thead>
<tbody>
<tr>
<td>:</td>
<td>:</td>
<td>:</td>
</tr>
</tbody>
</table>

- Consult Branch Target Buffer parallel to instruction fetch.
  - If entry for current PC can be found: follow prediction.
  - If not, create entry after branching.

- Inner workings of modern branch predictors are highly involved (and typically kept secret).
Selection queries are sensitive to branch prediction:

```sql
SELECT COUNT(*)
    FROM lineitem
    WHERE quantity < n
```

Or, written as C code:

```c
for (unsigned int i = 0; i < num_tuples; i++)
    if (lineitem[i].quantity < n)
        count++;
```
Selection Conditions (Intel Q6700)

![Graph showing execution time vs predicate selectivity for Intel Q6700. The x-axis represents predicate selectivity from 0% to 100%, and the y-axis represents execution time in arbitrary units (a.u.). The graph exhibits a curvilinear trend, peaking around 60% selectivity.]
Predication: Turn **control flow** into **data flow**.

```c
for (unsigned int i = 0; i < num_tuples; i++)
    count += (lineitem[i].quantity < n);
```

- This code does **not** use a branch any more.³
- The price we pay is a + operation for **every** iteration.
- Execution cost should now be **independent** of predicate selectivity.

³except to implement the loop
Predication (Intel Q6700)
This was an example of **software predication**.

✎ **How about this query?**

```sql
SELECT quantity
FROM lineitem
WHERE quantity < n
```

Some CPUs also support **hardware predication**.

- *E.g.*, Intel Itanium2:
  Execute **both** branches of an if-then-else and discard one result.
int sel_lt_int_col_int_val(int n, int* res, int* in, int V) {
    for(int i=0,j=0; i<n; i++){
        /* branch version */
        if (src[i] < V)
            out[j++] = i;
        /* predicated version */
        bool b = (src[i] < V);
        out[j++] = i;
        j += b;
    }
    return j;
}

Two Cursors

The `count += ...` still causes a **data hazard**.

- This limits the CPUs possibilities to execute instructions in parallel.

Some tasks can be rewritten to use **two cursors**:

```c
for (unsigned int i = 0; i < num_tuples / 2; i++) {
    count1 += (data[i] < n);
    count2 += (data[i + num_tuples / 2] < n);
}

count = count1 + count2;
```
Experiments (Intel Q6700)
Conjunctive Predicates

In general, we have to handle multiple predicates:

\[
\text{SELECT } A_1, \ldots, A_n \\
\text{FROM } R \\
\text{WHERE } p_1 \text{ AND } p_2 \text{ AND } \ldots \text{ AND } p_k
\]

The standard C implementation uses `&&` for the conjunction:

```c
for (unsigned int i = 0; i < num_tuples; i++)
    if (p_1 && p_2 && \ldots && p_k)
        \ldots;
```
Conjunctive Predicates

The `&&` introduce even more branches. The use of `&&` is equivalent to:

```cpp
for (unsigned int i = 0; i < num_tuples; i++)
    if (p1)
        if (p2)
            ...
        if (p_k)
            ...
```

An alternative is the use of the logical `&`:

```cpp
for (unsigned int i = 0; i < num_tuples; i++)
    if (p1 & p2 & ... & p_k)
        ...
```
Conjunctive Predicates

This allows us to express queries with conjunctive predicates without branches.

```c
for (unsigned int i = 0; i < num_tuples; i++)
{
    answer[j] = i;
    j += (p_1 & p_2 & ... & p_k);
}
```
Experiments (Sun UltraSparc IIi)

Fig. 2. Three implementations: Sun.

for each one if we can derive estimates for the function cost and selectivity for the optimization algorithm. Since the loop code is small, we can probably tolerate thousands of such queries with a small expansion in the executable code size. However, for ad-hoc queries we need to be able to allow the functions to be specified at run-time. There are two complementary problems. First, executing a function call (and potentially dereferencing a function pointer as well) may be a significant performance overhead in a tight inner loop. Second, we don't know the selectivities and function costs until query time, and these statistics are important for the selection of the appropriate inner-loop plan. There are several potential solutions to this problem. We outline one below.

When responding to an ad-hoc query, we still may have time to perform the optimization described above, compile a new version of the loop, with the appropriate combination of && and &, and link it into the running code. Systems such as Tempo [Consel and Noel 1996; Noel et al. 1998] allow such run-time compilation. Run-time code specialization of this sort would be beneficial only if the optimization time plus the compilation time are smaller than the improvement in the running-time of the resulting plan. As we saw in Sections 4.4 and 4.5, the optimization time is relatively small. The code to be compiled is also relatively small. For scans of large tables, such an approach may indeed pay off.

Run-time code specialization is different from self-modifying code. With self-modification, a program changes its own byte-code during its execution. While such a technique might actually present the most efficient solution to our code specialization problem, code modification is generally considered to be a bad idea. Such code is not reentrant, sharable, able to reside in ROM; it leads to cache coherency problems; it isn't easy to understand and it is architecture dependent.

6.3 Internal Parallelism

The results for the experiment of Section 3.2 on a Sun UltraSparc are given in Figure 2. Unlike the Pentium, as the selectivity approaches 1, the performance of the && plan continues to worsen. The reason for this behavior is that the Sun can execute multiple instructions at a time. For the & algorithm and the ACM Transactions on Database Systems, Vol. 29, No. 1, March 2004.
A query compiler could use a **cost model** to select between variants.

\[ p \&\& q \]

When \( p \) is highly selective, this might amortize the double branch misprediction risk.

\[ p \& q \]

Number of branches halved, but \( q \) is evaluated regardless of \( p \)'s outcome.

\[ j += \ldots \]

Performs memory write in **each** iteration.

**Notes:**

- Sometimes, \&\& is necessary to prevent null pointer dereferences: `if (p \&\& p->foo == 42)`.  
- Exact behavior is hardware-specific.
Unfortunately, predicting the cost of a variant might be hard.

→ Many parameters involved: characteristics of data, machine, workload, etc.

E.g., branching vs. no-branching in TPC-H Q12:

![Graph showing CPU cycles per tuple for branching and no-branching in TPC-H Q12.](image)

Idea:

- Generate **variants** of primitive operators.
  - with/without branching
  - different compilers
  - operator parameters (hash table configuration, etc.)
- Try to **learn** cost model for each variant.
- **Exploit and explore:**
  - **Profile** every execution to refine the cost model.
  - Choose variant based on cost model (exploit), **but** with a small probability choose a **random variant** (explore).
Micro Adaptivity

Vector-At-A-Time Execution:

→ Re-consider variant choice for every n vectors.

→ Adapt to specifics of the particular query/operator.

→ Even adjust to varying characteristics as the query progresses.

(a) Q14:Selection(select_>=_sint_col_sint_val) no branching branching micro adaptive
(b) Q7: Selection(select_<=_sint_col_sint_val) clang gcc icc micro adaptive
this risk and exploit any possible speedup that this optimization may bring. With Micro Adaptivity, we mitigate a performance degradation of 43% in the map primitives that are still bypassing the expression evaluator for calling certain functions. This holds for all decompression code in Scans, and also for (important) primitives in hash-table lookup and hash joins. The reason is that in the Vectorwise codebase, some optimizations only unroll a small fraction of the query time, so the impact of this optimization may bring. In this TPC-H run we get a speedup of 9%, compared to not using full computation at all. Un-
Use Case: (De-)Compression

Compression can help overcome the I/O bottleneck of modern CPUs.

- disk ↔ memory
- memory ↔ cache (!)
- Column stores have high potential for compression. ✏️ Why?

But:

- (De-)compression has to be fast.
- 200–500 MB/s (LZRW1 and LZOP) won’t help us much.
- Aim for multi-gigabyte per second decompression speeds.
- Maximum compression rate is not a goal.
MonetDB/X100 implements lightweight compression schemes:

**PFOR (Patched Frame-of-Reference)**
small integer values that are positive offsets from a base value; one base value per (disk) block

**PFOR-DELTA (PFOR on Deltas)**
encode **differences** between subsequent items using PFOR

**PDICT (Patched Dictionary Compression)**
integer codes refer into an array for values (the dictionary)

All three compression schemes allow **exceptions**, values that are too far from the base value or not in the dictionary.
E.g., compress the digits of $\pi$ using 3-bit PFOR compression.

<table>
<thead>
<tr>
<th>header</th>
<th>3</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>1</td>
<td>5</td>
</tr>
<tr>
<td>⊥</td>
<td>⊥</td>
<td>⊥</td>
</tr>
</tbody>
</table>

Decompressed numbers: 31415926535897932
During decompression, we have to consider all the exceptions:

```c
for (i = j = 0; i < n; i++)
    if (code[i] != ⊥)
        output[i] = DECODE (code[i]);
    else
        output[i] = exception[--j];
```

For PFOR, `DECODE` is a simple addition:

```c
#define DECODE(a) ((a) + base_value)
```

⚠️ The **branch** in the above code may bear a high **misprediction risk**.
Misprediction Cost

Figure 6.4: Decompression bandwidth, branch miss rate and instructions-per-cycle depending on the exception rate.

Demonstrates most clearly on Pentium4 how NAIVE decompression throughput rapidly deteriorates as the exception rate gets nearer to 50%. The cause are branch mispredictions on the if-then-else test for an exception, that becomes impossible to predict. In the graph on top, we see that the IPC takes a nosedive to 0.5 at that point, showing that branch mispredictions are severely penalized by the 31 stage pipeline of Pentium4.

To avoid this problem, we propose the following alternative “patch” approach:

```
#include <stdlib.h>

int Decompress<ANY>( int n, int bitwidth, ANY *__restrict__ output, void *__restrict__ input, ANY *__restrict__ exception, int *next_exception)
{
    int next, code[n], cur = *next_exception;
    UNPACK[bitwidth](code, input, n); /* bit-unpack the values */
    /* LOOP1: decode regardless of exceptions */
    
    return 0;
}
```

We collected IPC, cache misses, and branch misprediction statistics using CPU event counters on all test platforms.

Like with predication, we can avoid the high misprediction cost if we’re willing to invest some unnecessary work.

Run decompression in **two phases**:

1. **Decompress** all regular fields, but don’t care about exceptions.
2. Work in all the exceptions and **patch** the result.

```c
/* ignore exceptions during decompression */
for (i = 0; i < n; i++)
    output[i] = DECODE(code[i]);

/* patch the result */
foreach exception
    patch corresponding output item;
```
We don't want to use a branch to find all exception targets!

**Thus:** interpret values in “exception holes” as **linked list**:

\[
\begin{align*}
&\text{header} \quad 3 \quad 1 \\
&4 \quad 1 \quad 5 \quad 5 \quad 2 \quad 6 \quad 5 \quad 3 \quad 5 \quad 0 \\
&1 \quad 7 \quad 3 \quad 3 \quad 2 \\
&9 \quad 9 \quad 8 \quad 9
\end{align*}
\]

\[\text{compressed data} \quad \{ \text{exceptions} \}\]

→ Can now traverse exception holes and patch in exception values.
The resulting decompression routine is branch-free:

```c
/* ignore exceptions during decompression */
for (i = 0; i < n; i++)
    output[i] = DECODE(code[i]);

/* patch the result (traverse linked list) */
j = 0;
for (cur = first_exception; cur < n; cur = next) {
    next = cur + code[cur] + 1;
    output[cur] = exception[--j];
}
```

→ See slide 129 for experimental data on two-loop decompression.
Example

3-bit-PFOR-compressed representation of the digits of $e$?

$e = 2.71828182845904523536028747135266249775724709369959574966967627724076630353547594571382178525\ldots$
PFOR Compression Speed

Section 6.2: Super-scalar compression

![Graphs showing IPC and Bandwidth for Xeon 3GHz, Opteron 2GHz, and Itanium 1.3GHz.](image)

On Itanium2, the branch mispredictions are avoided thanks to branch predication explained in Section 2.1.5.2. As a result, the performance of the NAIVE kernel closely tracks that of PFOR and PDICT, as presented in the rightmost graph in Figure 6.4. Overall, the patching schemes are clearly to be preferred over the NAIVE approach, as they are faster on all tested architectures.

6.2.5 Compression

Previous database compression work mainly focuses on decompression performance, and views compression as a one-time investment that is amortized by repeated use of the compressed data. This is caused by the low throughput of compression, often an order of magnitude slower than decompression (see Figure 6.2), such that compression bandwidth is clearly lower than I/O write.

The actual execution of instructions is handled in individual functional units:
- e.g., load/store unit, ALU, floating point unit.
- Often, some units are replicated.

Chance to execute multiple instructions at the same time.

Intel’s Nehalem, for instance, can process up to 4 instructions at the same time.
- → IPC can be as high as 4.

→ Such CPUs are called superscalar CPUs.
Higher IPCs are achieved with help of **dynamic scheduling**.

Instructions are **dispatched** to **reservation stations**.

- They are **executed** as soon as all hazards are cleared.
- **Register renaming** helps to reduce data hazards.

This technique is also known as **Tomasulo’s algorithm**.
Example: Dynamic Scheduling in MIPS

From instruction unit

Instruction queue

Load-store operations

FP registers

Operand buses

Floating-point operations

Address unit

Store buffers

Load buffers

Operation bus

FP adders

Common data bus (CDB)

FP multipliers

Reservation stations

Memory unit

Data

Address

3 2 1
Instruction-Level Parallelism

Usually, not all units can be kept busy with a single instruction stream:

Reasons:
- data hazards, cache miss stalls, ...
**Idea:** Use the spare slots for an independent instruction stream.

This technique is called **simultaneous multithreading**.\(^4\)

Surprisingly few changes are required to implement it.

- Tomasulo’s algorithm requires **virtual registers** anyway.
- Need separate fetch units for both streams.

\(^4\)Intel uses the term “hyperthreading.”
Resource Sharing

Threads **share** most of their resources:

- **caches** (all levels),
- branch prediction functionality (to some extent).

This may have **negative effects**... 
- threads that **pollute** each other’s caches

...but also **positive effects**.
- threads that **cooperatively** use the cache?
Use Cases

Tree-based indexes:

Hash-based indexes:

Both cases depend on hard-to-predict pointer chasing.
Helper Threads

Idea:

- Next to the main processing thread run a helper thread.
- Purpose of the helper thread is to prefetch data.
- Helper thread works ahead of the main thread.

- Main thread populates work-ahead set with pointers to prefetch.
Consider the traversal of a tree-structured index:

```
1    foreach input item do
2        read root node, prefetch level 1 ;
3        read node on tree level 1, prefetch level 2 ;
4        read node on tree level 2, prefetch level 3 ;
5        ...
```

Helper thread will not have enough time to prefetch.
**Thus:** Process input in groups.

1. \texttt{foreach group }$g$\texttt{ of input items do}
2. \hspace{1em} \texttt{foreach item in }$g$\texttt{ do}
3. \hspace{2em} read root node, prefetch level 1 ;
4. \hspace{1em} \texttt{foreach item in }$g$\texttt{ do}
5. \hspace{2em} read node on tree level 1, prefetch level 2 ;
6. \hspace{1em} \texttt{foreach item in }$g$\texttt{ do}
7. \hspace{2em} read node on tree level 2, prefetch level 3 ;
8. \hspace{2em} \ldots

Data may now have arrived in caches by the time we reach next level.
Helper Thread

Helper thread accesses addresses listed in work-ahead set, e.g.,

```c
temp += *((int *) p);
```

- **Purpose:** load data into caches
  - 💡 **Why not use `prefetch` assembly instructions?**
- Only **read** data; do **not** affect semantics of main thread.
- Use a **ring buffer** for work-ahead set.
- **Spin-lock** if helper thread is too fast.

🔍 **Which thread is going to be the faster one?**
Problems

There’s a high chance that both threads access the same cache line at the same time.

- Must ensure in-order processing.
- CPU will raise a Memory Order Machine Clear (MOMC) event when it detects parallel access.
  - Pipelines flushed to guarantee in-order processing.
  - MOMC events cause a high penalty.
- Effect is worst when the helper thread spins to wait for new data.

Thus:
- Let helper thread work backward.
Experiments (Tree-Structured Index)

Figure 5: Clustered and Unclustered Probes

Figure 4(c) shows the distribution of cache miss events per probe.


© Jens Teubner · Data Processing on Modern Hardware · Winter 2018/19