Data Processing on Modern Hardware

Jens Teubner, TU Dortmund, DBIS Group
jens.teubner@cs.tu-dortmund.de

Summer 2014
Part IV

Vectorization
Pipelining is one technique to leverage available hardware parallelism.

- Separate chip regions for individual tasks execute independently.
- Advantage: Use parallelism, but maintain **sequential execution semantics** at front-end (here: assembly instruction stream).
- We discussed problems around **hazards** in the previous chapter.
- VLSI technology limits the degree up to which pipelining is feasible. (↗H. Kaeslin. Digital Integrated Circuit Design. Cambridge Univ. Press.)
Hardware Parallelism

Chip area can as well be used for **other types of parallelism**:

- Task 1
- Task 2
- Task 3

Computer systems typically use identical hardware circuits, but their function may be controlled by different **instruction streams** $s_i$:
Do you know an example of this architecture?

This is your multi-core CPU!

Also called MIMD: Multiple Instructions, Multiple Data
(Single-core is SISD: Single Instruction, Single Data.)
Most modern processors also include a **SIMD** unit:

Execute same assembly instruction on a set of values.

Also called **vector unit; vector processors** are entire systems built on that idea.
The processing model is typically based on **SIMD registers** or **vectors**:

\[
\begin{array}{cccc}
  a_1 & a_2 & \cdots & a_n \\
  b_1 & b_2 & \cdots & b_n \\
\end{array}
\]

\[
\begin{array}{c}
  a_1 + b_1 \\
  a_2 + b_2 \\
  \vdots \\
  a_n + b_n \\
\end{array}
\]

Typical values (*e.g.*, x86-64):

- 128 bit-wide registers (**xmm0** through **xmm15**).
- Usable as 16 × 8 bit, 8 × 16 bit, 4 × 32 bit, or 2 × 64 bit.
Much of a processor’s control logic depends on the number of in-flight instructions and/or the number of registers, but not on the size of registers.

→ scheduling, register renaming, dependency tracking, . . .

SIMD instructions make independence explicit.

→ No data hazards within a vector instruction.
→ Check for data hazards only between vectors.
→ data parallelism

Parallel execution promises \(n\)-fold performance advantage.

→ (Not quite achievable in practice, however.)
How can I make use of SIMD instructions as a programmer?

1. **Auto-Vectorization**
   - Some compiler automatically detect opportunities to use SIMD.
   - Approach rather limited; don’t rely on it.
   - Advantage: platform independent

2. **Compiler Attributes**
   - Use `__attribute__((vector_size (...)))` annotations to state your intentions.
   - Advantage: platform independent
     (Compiler will generate non-SIMD code if the platform does not support it.)
/*
 * Auto vectorization example (tried with gcc 4.3.4)
 */
#include <stdlib.h>
#include <stdio.h>

int main (int argc, char **argv)
{
    int a[256], b[256], c[256];

    for (unsigned int i = 0; i < 256; i++)
    {
        a[i] = i + 1;
        b[i] = 100 * (i + 1);
    }

    for (unsigned int i = 0; i < 256; i++)
    {
        c[i] = a[i] + b[i];
    }

    printf("c = [ %i, %i, %i, %i ]\n", c[0], c[1], c[2], c[3]);

    return EXIT_SUCCESS;
}
Resulting assembly code (gcc 4.3.4, x86-64):

```
loop:
  movdqu (%r8,%rcx), %xmm0 ; load a and b
  addl $1, %esi
  movdqu (%r9,%rcx), %xmm1 ; into SIMD registers
  paddd %xmm1, %xmm0 ; parallel add
  movdqa %xmm0, (%rax,%rcx) ; write result to memory
  addq $16, %rcx ; loop (increment by
  cmpl %r11d, %esi ; SIMD length of 16 bytes)
  jb loop
```
/* Use attributes to trigger vectorization */
#include <stdlib.h>
#include <stdio.h>

typedef int v4si __attribute__((vector_size (16)));

union int_vec {
  int  val[4];
  v4si  vec;
};
typedef union int_vec int_vec;

int
main (int argc, char **argv)
{
  int_vec a, b, c;


  c.vec = a.vec + b.vec;

  printf ("c = [ %i, %i, %i, %i ]\n",
          c.val[0], c.val[1], c.val[2], c.val[3]);

  return EXIT_SUCCESS;
}
Resulting assembly code (gcc, x86-64):

```
movl  $1, -16(%rbp)  ; assign constants
movl  $2, -12(%rbp)  ; and write them
movl  $3, -8(%rbp)   ; to memory
movl  $4, -4(%rbp)
movl  $100, -32(%rbp)
movl  $200, -28(%rbp)
movl  $300, -24(%rbp)
movl  $400, -20(%rbp)
movdqa -32(%rbp), %xmm0 ; load b into SIMD register xmm0
paddq -16(%rbp), %xmm0 ; SIMD xmm0 = xmm0 + a
movdqa %xmm0, -48(%rbp) ; write SIMD xmm0 back to memory
movl  -40(%rbp), %ecx ; load c into scalar
movl  -44(%rbp), %edx ; registers (from memory)
movl  -48(%rbp), %esi
movl  -36(%rbp), %r8d
```

Data transfers scalar ↔ SIMD go through memory.
Use C Compiler Intrinsics

- Invoke SIMD instructions directly via compiler macros.
- Programmer has good control over instructions generated.
- Code no longer portable to different architecture.
- Benefit (over hand-written assembly): compiler manages register allocation.
- Risk: If not done carefully, automatic glue code (casts, etc.) may make code inefficient.
/*  
 * Invoke SIMD instructions explicitly via intrinsics.  
 */
#include <stdlib.h>
#include <stdio.h>

#include <xmmintrin.h>

int
main (int argc, char **argv)
{
  int a[4], b[4], c[4];
  __m128i x, y;

  b[0] = 100; b[1] = 200; b[2] = 300; b[3] = 400;

  x = _mm_loadu_si128 ((__m128i *) a);
  y = _mm_loadu_si128 ((__m128i *) b);

  x = _mm_add_epi32 (x, y);

  _mm_storeu_si128 ((__m128i *) c, x);

  printf ("c = [ %i, %i, %i, %i ]\n", c[0], c[1], c[2], c[3]);

  return EXIT_SUCCESS;
}
Resulting assembly code (gcc, x86-64):

\[
\begin{align*}
\text{movdqu} & \ -16(\%rbp), \ %xmm1 \quad ; \ _{\text{mm} \_\text{loadu} \_\text{si128}}() \\
\text{movdqu} & \ -32(\%rbp), \ %xmm0 \quad ; \ _{\text{mm} \_\text{loadu} \_\text{si128}}() \\
\text{padd} & \ %xmm0, \ %xmm1 \quad \quad \quad \quad \quad \quad ; \ _{\text{mm} \_\text{add} \_\text{epi32}}() \\
\text{movdqu} & \ %xmm1, \ -48(\%rbp) \quad ; \ _{\text{mm} \_\text{storeu} \_\text{si128}}() \\
\end{align*}
\]
SIMD functionality naturally fits a number of **scan-based** database tasks:

- **arithmetics**
  
  ```sql
  SELECT price + tax AS net_price
  FROM orders
  ```

  This is what the code examples on the previous slides did.

- **aggregation**
  
  ```sql
  SELECT COUNT(*)
  FROM lineitem
  WHERE price > 42
  ```

  📘 How can this be done efficiently? Similar: \( \text{SUM}(\cdot) \), \( \text{MAX}(\cdot) \), \( \text{MIN}(\cdot) \), ...
Selection queries are a slightly more tricky:

- There are no branching primitives for SIMD registers.
  - What would their semantics be anyhow?
- Moving data between SIMD and scalar registers is quite expensive.
  - Either go through memory, move one data item at a time, or extract sign mask from SIMD registers.

Thus:

- Use SIMD to generate bit vector; interpret it in scalar mode.

If we can count with SIMD, why can’t we play the $j += (\cdots)$ trick?
Decompression

**Column decompression** (↗ slides 116ff.) is a good candidate for SIMD optimization.

- Use case: $n$-bit fixed-width **frame of reference** compression; phase 1 (ignore exception values).
  - → no branching, no data dependence

- With 128-bit SIMD registers (9-bit compression):

<table>
<thead>
<tr>
<th>15</th>
<th>14</th>
<th>13</th>
<th>12</th>
<th>11</th>
<th>10</th>
<th>9</th>
<th>8</th>
<th>7</th>
<th>6</th>
<th>5</th>
<th>4</th>
<th>3</th>
<th>2</th>
<th>1</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>$v_{13}$</td>
<td>$v_{12}$</td>
<td>$v_{11}$</td>
<td>$v_{10}$</td>
<td>$v_9$</td>
<td>$v_8$</td>
<td>$v_7$</td>
<td>$v_6$</td>
<td>$v_5$</td>
<td>$v_4$</td>
<td>$v_3$</td>
<td>$v_2$</td>
<td>$v_1$</td>
<td>$v_0$</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

↗Willhalm et al. SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units. *VLDB 2009.*
**Step 1:** Bring data into proper 32-bit words:

<table>
<thead>
<tr>
<th>15</th>
<th>14</th>
<th>13</th>
<th>12</th>
<th>11</th>
<th>10</th>
<th>9</th>
<th>8</th>
<th>7</th>
<th>6</th>
<th>5</th>
<th>4</th>
<th>3</th>
<th>2</th>
<th>1</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>V_{13}</td>
<td>V_{12}</td>
<td>V_{11}</td>
<td>V_{10}</td>
<td>V_9</td>
<td>V_8</td>
<td>V_7</td>
<td>V_6</td>
<td>V_5</td>
<td>V_4</td>
<td>V_3</td>
<td>V_2</td>
<td>V_1</td>
<td>V_0</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

shuffle mask

| FF | FF | 4  | 3  | FF | FF | 3  | 2  | FF | FF | 2  | 1  | FF | FF | 1  | 0  |

- Use **shuffle instructions** to move **bytes** within SIMD registers.
- `{m128i out = _mm_shuffle_epi8 (in, shufmask);}`
Step 2: Make all four words identically bit-aligned:

- Shift 0 bits: $v_3$
- Shift 1 bits: $v_2$
- Shift 2 bits: $v_1$
- Shift 3 bits: $v_0$

SIMD shift instructions do not support variable shift amounts!
Step 3: Word-align data and mask out invalid bits:

\[ \text{\_\_m128i shifted} = \text{\_\_mm\_srli\_epi32(in, 3)}; \]
\[ \text{\_\_m128i result} = \text{\_\_mm\_and\_si128(shifted, maskval)}; \]
For the bit alignment (i.e. extracting and storing) step discussed in Section 4.1.3, we realized the independent 32-bit SIMD shift.

1. **Data Processing on Modern Hardware**

Time to decompress 1 billion integers (Xeon X5560, 2.8 GHz).

Query time [ms]

<table>
<thead>
<tr>
<th>Compression-bit Case</th>
<th>unoptimized scalar</th>
<th>optimized scalar</th>
<th>vectorized</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1400</td>
<td>1400</td>
<td>1400</td>
</tr>
<tr>
<td>4</td>
<td>1400</td>
<td>1400</td>
<td>1400</td>
</tr>
<tr>
<td>8</td>
<td>1400</td>
<td>1400</td>
<td>1400</td>
</tr>
<tr>
<td>16</td>
<td>1400</td>
<td>1400</td>
<td>1400</td>
</tr>
<tr>
<td>32</td>
<td>1400</td>
<td>1400</td>
<td>1400</td>
</tr>
</tbody>
</table>

Source: Willhalm et al. SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units. VLDB 2009.
Some SIMD instructions require hard-coded parameters. Thus: **Expand** code explicitly for all possible values of $n$.

- There are at most 32 of them.
- Fits with operator specialization in column-oriented DBMSs. 
  \(\uparrow\) slide 54

- Loading **constants** into SIMD registers can be relatively expensive (and the number of registers limited).
  - One register for shuffle mask and one register to shift data (step 2) is enough.

- For larger $n$, a compressed word may span **more than 4 bytes**.
  - Additional tricks needed (shift and blend).
Vectorized Predicate Handling

Sometimes it may be sufficient to decompress only partially.

*E.g.*, search queries $v_i < c$:

![Diagram showing vectorized predicate handling](image)

- Only shuffle and mask (but don’t shift).
Vectorized Predicate Handling: Performance

The performance improvement is generally higher for the bit cases up to 8 bits, where 8 values can be processed in parallel in one SSE register. Therefore, the average speedup factor is 1.58 over all bit cases.

Figure 12. Speedup for decompression by vectorization

The speedup of the SIMD implementation for searching a value (full table scan) in 1B records is shown in Figure 13. The experimental test set for bit case 4 consists of the natural numbers modulo 2^4. Again, the measurements were performed 10 times on a test program executing the search routine as described in Section 4.2, and the median of the 10 runs was used for computing the speedup. For the lower bit compression cases, the search result is very large for a single search value (e.g., if 2 bits are used, a quarter of our test data set is returned). For bit cases 27 onwards, special care is needed to handle compressed values that span across 5 Bytes as shown in Figure 7. As a result, this reduces the performance advantage to the extent that for bit case 31, the vectorized implementation was slower than the scalar version. However, the average speedup factor of a full table scan is still 2.16. In practice, the SIMD implementation is only used in bit cases where it is faster than its scalar counterpart, which is the dominant scenario.

Figure 13. Speedup of full table scan by vectorization

If the result of a full table scan is returned as a bit vector, the running time is independent of the number of hits. However, if a list of indexes is returned, the running time increases for large results as storing the results cannot fully exploit the benefit of storing vector instructions. The best speedup is therefore achieved for very selective queries as graphed in Figure 14, which displays the Speedup vs. Selectivity. Again, entries were processed 10 times and the median was recorded. Each point in the graph displays the average speedup over all bit cases. The overall speedup average is 1.63.

Figure 14. Speedup of full table scan by selectivity

In real world scenarios, and according to our experience at SAP, the compression bits used to compact database columns are mainly in the range of 8 to 16 bits. Figure 15 shows the practical distribution of the compression bit cases against the running time contribution of the table scan routines for a typical customer scenario. Taking this distribution into account, the (weighted) speedup factor for a full table scan is 2.45 over all bit cases.

Figure 15: Running time distribution for customer workload

Finally, we executed the vectorized search in parallel on different processor cores to verify its scalability. Figure 16 shows that the vectorized search scales almost linearly up to eight cores that are installed on the evaluation system. The memory bandwidth leaves sufficient headroom for future processors with

- Speedup versus optimized scalar implementation.

Source: Willhalm et al., SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units. VLDB 2009.
Use Case: Tree Search

Another SIMD application: in-memory **tree lookups**.

Base case: **binary tree**, scalar implementation:

```c
for (unsigned int i = 0; i < n_items; i++) {
    k = 1; /* tree[1] is root node */
    for (unsigned int lvl = 0; lvl < height; lvl++)
        k = 2 * k + (item[i] <= tree[k]);
    result[i] = data[k];
}
```

- Represent binary tree as array `tree[·]` such that children of `n` are at positions `2n` and `2n + 1`. 
Can we vectorize the outer loop? (i.e., find matches for four input items in parallel)

- Iterations of the outer loop are independent.
- There is no branch in the loop body.

⚠️ Current SIMD implementations do not support scatter/gather!
Can we vectorize the inner loop?

- **Data dependency** between loop iterations (variable \( k \)).
- Intuitively: Cannot navigate multiple steps at a time, since first navigation steps are not (yet) known.

**But:**

- Could *speculatively* navigate levels ahead.
**“Speculative” Tree Navigation**

**Idea:** Do comparisons for two levels in parallel.

![Tree Diagram]

*E.g.*, 

1. Compare with nodes 1, 2, and 3 in parallel.
2. Follow link to node 6 and compare with nodes 6, 12, and 13.
3. . . .
SIMD Blocking

Pack tree sub-regions into SIMD registers.

Re-arrange data in memory for this.

Kim et al. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. SIGMOD 2010.
SIMD and Scalar Registers

E.g., search key 59:

- SIMD cmp
- movemask
- scalar register

→ SIMD to compare, scalar to navigate, movemask in-between.
Use scalar `movemask` result as **index** in **lookup table**:

![Tree Navigation Diagram](image.png)

Hierarchical Blocking

Blocking is a good idea also beyond SIMD.

Image source: Kim et al. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. SIGMOD 2010.
SIMD Tree Search: Performance

Source: Kim et al. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. SIGMOD 2010.