### Data Processing on Modern Hardware

Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de

Summer 2014

### Part VII

# FPGAs for Data Processing

#### Motivation

Modern hardware features a number of "speed-up tricks":

- caches,
- instruction scheduling (out-of-order exec., branch prediction, ...),
- parallelism (SIMD, multi-core),
- throughput-oriented designs (GPUs).

Combining these "tricks" is essentially an **economic choice**:

- → chip space \(\equiv \in\eq\eta\)
- $\rightarrow$  chip space  $\leftrightarrow$  component selection  $\leftrightarrow$  workload

### Another Constraint: Power

Can use transistors for either logic or caches.



### Heterogeneous Hardware



#### **Large-Core Homogeneous**

| Large-core |     | 1 |  |
|------------|-----|---|--|
| throughput |     |   |  |
| Small-core |     |   |  |
| throughput |     |   |  |
| Total      |     | 6 |  |
| throughput |     |   |  |
|            |     |   |  |
|            | (a) |   |  |



#### **Small-Core Homogeneous**

| Large-core throughput    |                                               |  |  |
|--------------------------|-----------------------------------------------|--|--|
| Small-core<br>throughput | Pollack's Rule<br>(5/25) <sup>0.5</sup> =0.45 |  |  |
| Total<br>throughput      | 13                                            |  |  |
| (b)                      |                                               |  |  |



#### **Small-Core Homogeneous**

| 1                                             |
|-----------------------------------------------|
| Pollack's Rule<br>(5/25) <sup>0.5</sup> =0.45 |
| 11                                            |
|                                               |

(c)

### Field-Programmable Gate Arrays

**Field-Programmable Gate Arrays (FPGAs)** are yet-another point in the design space.

- "Programmable hardware."
- Make (some) design decisions **after** chip fabrication.

### **Promises** of FPGA technology:

- → Build application-/workload-specific circuit.
- → Spend chip space only on functionality that you really need.
- ightarrow Tune for throughput, latency, energy consumption,  $\dots$
- $\sim$  Overcome limits of general-purpose hardware with regard to task at hand (e.g., I/O limits).

## Field-Programmable Gate Arrays



- An array of logic gates
- Functionality fully programmable
- Re-programmable after deployment ("in the field")
- → "programmable hardware"

- FPGAs can be configured to implement **any** logic circuit.
- Complexity bound by available chip space.
  - → Obviously, the effective chip space is less than in custom-fabricated chips (ASICs).

### Basic FPGA Architecture



- chip layout: 2D array
- Components
  - CLB: Configurable Logic Block ("logic gates")
  - IOB: Input/Output Block
  - DCM: Digital Clock Manager
- Interconnect Network
  - signal lines
  - configurable switch boxes

## Signal Routing



## Configurable Logic Block (CLB)



### Programming FPGAs

Programming is usually done using a hardware description language.

- E.g., **VHDL**<sup>6</sup>, Verilog
- High-level circuit description

Circuit description is compiled into a **bitstream**, then loaded into SRAM cells on the FPGA:



<sup>&</sup>lt;sup>6</sup>VHSIC Hardware Description language

### Example: VHDL

HDLs enable programming language-like descriptions of hardware circuits.

```
architecture Behavioral of compare is
begin
  process (A, B)
  begin
    if (A = B) then
      C <= '1':
    else
      C <= '0':
    end if;
  end process;
end Behavioral;
```

VHDL can be synthesized, but also executed in software (simulation).

### Real-World Hardware



- Simplified Virtex-5 XC5VFXxxxT floor plan
- Frequently used high-level components are provided in discrete silicon
- BlockRAM (BRAM): set of blocks that each store up 36 kbits of data
- DSP48 slices: 25x18-bit multipliers followed by a 48-bit accumulator
- CPU: two full embedded PowerPC 440 cores

### Development Board with Virtex-5 FPGA



|                                                          | Virtex-5<br>XC5VLX110T              |
|----------------------------------------------------------|-------------------------------------|
| Lookup Tables (LUTs)<br>Block RAM (kbit)<br>DSP48 Slices | 69,120<br>5,328<br>64               |
| PowerPC Cores<br>max. clock speed<br>release year        | $0 \approx 450  \text{MHz} $ $2006$ |

source: Xilinx Inc., ML50x Evaluation Platform, User Guide.



Low-level speed of configurable gates is slower than in  $\stackrel{\checkmark}{\perp}$  custom-fabricated chips (clock frequencies:  $\sim$  100 MHz).

→ Compensate with efficient circuit for problem at hand.

#### State Machines

The key asset of FPGAs is their inherent **parallelism**.

• Chip areas naturally operate independently and in parallel.

For example, consider finite-state automata.



→ non-deterministic automaton for .\*abc.\*d

#### State Machines

How would you implement an automaton in software?

Problems with state machine implementations in software:

- In **non-deterministic automata**, several states can be active at a time, which requires **iterative** execution on sequential hardware.
- **Deterministic automata** avoid this problem at the expense of a significantly higher **state count**.

#### State Machines in Hardware

Automata can be translated mechanically into hardware circuits.

- each state → flip-flop
   (A flip-flop holds a single bit of information. Just the right amount to keep the 'active' / 'not active' information.)
- transitions:
  - $lue{}$  o **signals** ("wires") between states
  - **conditioned** on current input symbol (~ 'and' gate)
  - multiple sources for one flip-flop input → 'or' gate.

### State Machines in Hardware







### Use Case: Network Intrusion Detection

Analyze network traffic using **regular expressions**.

- Scan for known attack tools.
- Prevent exploitation of known security holes.
- Scan for shell code.

E.g., Snort (http://www.snort.org/)

→ Hundreds of (regular expression-based) rules.

**Idea:** Instantiate a hardware state machine for each rule.

- → Leverage available hardware parallelism.
- → Challenge: optimize for high throughput.

### Predicate Decoding

### **Optimization 1:** Centralized character classification



 $\rightarrow$  Optimizes for **space**, **not** for speed.

#### Character/predicate decoder:

- Use FPGA logic resources **or**
- use on-chip **BRAM** (configure as ROM and use as lookup table).

### Predicate Decoding Factored Out



### Signal Propagation Delay

**Signal propagation delays** determine a circuit's **speed**.

- Here: One state transition per clock cycle.
- Longest signal path → maximum clock frequency



### Propagation Delays and Many State Machines

Straightforward design with many rules and one input:



### **Pipelining**

#### **Optimization 2: Pipelining**

 $\rightarrow$  What matters is longest path between any two registers (flip-flops).



- → Introduce **pipeline registers**.
- $\rightarrow$   $\$  Flip side of the idea?

### Pipelining in Practice



## Multi-Character Matching

In a finite state automaton, the state  $s_{i+1}$  at step i+1 depends on

- the previous state  $s_i$ ,
- the input symbol  $\sigma_i$ , and
- a transition function f:

$$s_{i+1} = f(s_i, \sigma_i) .$$

Consequently:

$$s_{i+2} = f(s_{i+1}, \sigma_{i+1}) = f(f(s_i, \sigma_i), \sigma_{i+1})$$
.

That is, with help of a new transition function

$$F(s_i, \sigma_i, \sigma_{i+1}) \stackrel{\text{def}}{=} f(f(s_i, \sigma_i), \sigma_{i+1})$$
,

an automaton can accept two input symbols per clock cycle.

### Multi-Character Encoding

In hardware:



- Trade-off: space ↔ performance
- **♦** longer signal paths

### Putting it Together (Snort Workload)



for High-Throughput ANCS 2008 Regular Expression Maching on FPGA. Compact Architecture Yang et al.

(Virtex-4 LX100;  $\approx$  100k 4-LUTs;  $\approx$  100k flip-flops)

### Use Case: XML Projection

#### Example:

#### **Projection paths:**

```
keep descendants
{ //regions//item,
   //regions//item/name #,
   //regions//item/incategory }
```

**Challenge:** Avoid explicit synthesis for each query.

### Advantage: FPGA System Integration

Here: In-network filtering



In general: FPGA in the data path.

- disk → CPU
- memory → CPU
- **.** . . .

### $XPath \rightarrow Finite State Automata$

Automaton for //a/b/c//d:



In hardware: (see also earlier slides)



### Compilation to Hardware



#### Skeleton Automaton

### Separate the difficult parts from the latency-critical



### Skeleton Automaton

**Thus:** Build skeleton automaton that can be **parameterized** to implement **any** projection query.



#### Intuitively:

lacksquare Runtime-configuration determines presence of  $\hat{\mathbb{O}}*$  .

### Again: Pipelining



ightarrow Side effect: Can support self and descendant-or-self axes.

# Scalability



## Application Speedup



# Skyline Queries

#### **Problem:**

- Pareto-optimal set of multi-dimensional data points.
- x **dominates** y ( $x \prec y$ ) iff for every dimension i:  $x_i \leq y_i$  and for at least one dimension j:  $x_j < y_j$ .
- Skyline points: all y not dominated by any x.



→ Parallelize, keep on-chip routing distance short

# "Lemming's Got Talent"

- → Lemmings have multiple skills (dimension)
- → Determine "best" Lemmings

#### Let Lemmings battle on a narrow bridge:



- $p_0$  dominates  $q_i \rightarrow q_i$  falls off the bridge.
- $q_i$  dominates  $p_0 \rightarrow p_0$  falls off bridge,  $q_i$  becomes new  $p_0$
- Battle undecided  $\rightarrow$  let  $q_i$  requeue.
- A Lemming that has survived a full round is a "skyline Lemming."

# "Lemming's Got Talent"—Second Year

To speed up the process, let a **set of**  $p_i$  stay on bridge:



- $\rightarrow$  Challengers  $q_i$  fight against multiple  $p_i$  in turn.
- $\rightarrow q_i$  and/or multiple  $p_j$  might fall off the bridge.
- $\rightarrow$  Keep surviving  $q_i$  on bridge if there is space, otherwise requeue.
- → Standard algorithm Block Nested Loops (BNL).

```
foreach Lemming q_i \in queue do
       isDominated = false:
2
       foreach Lemming p_i \in bridge do
 3
            if q_i.timestamp > p_i.timestamp then
4
                bridge.movetoskyline(p_i); /* p_i \in Lemming skyline */
 5
           else if q_i \prec p_i then
6
                bridge.drop(p_i);
 7
            else if p_i \prec q_i then
                isDominated = true:
9
                break:
10
       if not isDominated then
11
            timestamp(q_i);
12
            if bridge.isFull() then
13
                queue.insert(q_i);
14
            else
15
                bridge.insert(q_i);
16
```

# Block Nested Loops Algorithm

Design goal of BNL: Eliminate I/O Bottleneck



ightarrow Compute load remains (mostly) unchanged.

# "Lemming's Got Talent"—Third Year

Let multiple (pairs of) Lemmings battle in parallel.



- Challengers q<sub>i</sub> move from left to right.
- Potential skyline Lemmings  $p_j$  move from right to left.
- Either can fall off the cliff if dominated.
- On the right end, challengers become potential skyline Lemmings (if there is space on the bridge), otherwise they requeue.

### Parallel BNL with FPGAs

Parallel battles can be realized on distinct processing nodes  $\nu_i$ .



- Nodes form a list where  $\nu_j$  only communicates with  $\nu_{j-1}$  and  $\nu_{j+1}$ .
  - $\rightarrow$  Challengers  $q_i$  forwarded from left to right.
  - → Potential skyline tuples forwarded from right to left.
- Effectively,  $q_i$  scans over current window (as in BNL).
- **Trick:** Causality still holds.  $q_i$  "sees" effect of any preceding challenger, but not of any following challenger.

### **Implementation**

- Let all  $\nu_i$  operate in lock-step.<sup>7</sup>
- Process in two alternating phases:
  - **Evaluate:** Compute dominance; drop tuples if need be.
  - **Shift:** Exchange data ("Lemmings") between nodes.
- In practice, exchanging tuples is more tricky. For high dimensionality data can be passed only **one dimension at a time**.



<sup>&</sup>lt;sup>7</sup>We tried to avoid this when we did "handshake joins" on multi-core hardware, because of the high synchronization cost. But on FPGAs this is really cheap.

### **Experiments**

Randomly distributed data; seven dimensions (1.48 % skyline density).



### Experiments

Correlated data; seven dimensions (0.013 % skyline density).



→ FPGA bottlenecked by the memory interface of the particular FPGA board.

## Experiments

Anti-Correlated data; seven dimensions (19.8 % skyline density).



→ Benefit of FPGA solution is greatest when it is most needed (*i.e.*, when running times are very high).

## The Frequent Item Problem

#### **Problem:**

Given an input stream S, which items in S occur most often?

- Exact solution too expensive  $(\mathcal{O}(\min\{|S|, |A|\}))$  space)
- Good **approximate** solutions available.
  - Space-Saving by Metwally et al.
  - In-depth study: Cormode and Hadjieleftheriou (VLDB 2008)

# Space-Saving (Metwally et al., TODS 2006)

Space-Saving tries to "monitor" only items that are frequent.

```
lookup by item
  foreach stream item x \in S do
       find bin b_x with b_x.item = x;
2
       if such a bin was found then
3
            b_x.count \leftarrow b_x.count + 1;
4
                                                       lookup by count
       else
5
            b_{min} \leftarrow \text{bin with minimum } count \text{ value } ;
6
7
            b_{min}.count \leftarrow b_{min}.count + 1;
            b_{min}.item \leftarrow x;
8
```

#### Main complexity:

- Look up bin that monitors the input item x.
- Find bin with minimum count value.

# Space-Saving in Software



Code by Cormode and Hadjieleftheriou, Intel Core2 Duo, 2.66 GHz

## Data-Parallel Frequent Item on FPGAs

**Idea:** Use available (data) parallelism to make searches efficient.

Perform all item searches in parallel:



Find bin with **minimum count** using a tree:



### **Evaluation**



**Problem:** Increasing signal propagation delays.

Teubner, Müller, and Alonso. FPGA Acceleration for the Frequent Item Problem. ICDE 2010.

### Don't Think in Software

■ Organize monitored items as an **array** ( $\rightarrow$  keep things local).



- **Compare** input item  $x_1$  to content of bin  $b_i$  (and **increment** *count* value if a match was found).
- **2 Order** bins  $b_i$  and  $b_{i+1}$  according to *count* values.
- **3** Move  $x_1$  forward in the array and repeat.
- $\rightarrow$  Drop  $x_1$  into **last bin** if no match can be found.

# **Pipelining**

The idea seems **terribly inefficient**:  $\mathcal{O}(\# \text{ bins})$  vs.  $\mathcal{O}(\log(\# \text{ bins}))$ .

#### **But:**

- All sub-tasks are simple, all processing stays local.
- Thus, the processing of multiple input items can be **parallelized**.



Multiple input items  $x_i$  can traverse this **pipeline** if they keep sufficient distance.

## Algorithm

```
foreach stream item x \in S do
        i \leftarrow 1:
        while i < k do
 3
            if b_i.item = x then
 4
                 b_i.count \leftarrow b_i.count + 1;
 5
                 continue foreach;
 6
            else if b_i.count < b_{i+1}.count then
 7
                swap contents of b_i and b_{i+1};
 8
            else
 9
             i \leftarrow i + 1;
10
        /* replace last bin if x was not found
        b_k.count \leftarrow b_k.count + 1:
11
        b_k.item \leftarrow x:
12
```

### **Evaluation**



Teubner, Müller, and Alonso. FPGA Acceleration for the Frequent Item Problem. ICDE 2010.