## Data Processing on Modern Hardware

Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de

Summer 2016

## Part VII

# FPGAs for Data Processing

#### Motivation

Modern hardware features a number of "speed-up tricks":

- caches,
- instruction scheduling (out-of-order exec., branch prediction, ...),
- parallelism (SIMD, multi-core),
- throughput-oriented designs (GPUs).

Combining these "tricks" is essentially an **economic choice**:

- $\rightarrow$  chip space  $\equiv$  €€€
- $\rightarrow$  chip space  $\leftrightarrow$  component selection  $\leftrightarrow$  workload

#### Another Constraint: Power

Can use transistors for either logic or caches.



## Heterogeneous Hardware



#### **Large-Core Homogeneous**

| Large-core | 1 |
|------------|---|
| throughput |   |
| Small-core |   |
| throughput |   |
| Total      | 6 |
| throughput |   |
|            |   |

(a)



#### **Small-Core Homogeneous**

| Large-core<br>throughput |                                               |  |
|--------------------------|-----------------------------------------------|--|
| Small-core<br>throughput | Pollack's Rule<br>(5/25) <sup>0.5</sup> =0.45 |  |
| Total<br>throughput      | 13                                            |  |
| (b)                      |                                               |  |



#### **Small-Core Homogeneous**

| Large-core | 1                   |  |
|------------|---------------------|--|
| throughput |                     |  |
| Small-core | Pollack's Rule      |  |
| throughput | $(5/25)^{0.5}=0.45$ |  |
| Total      | 11                  |  |
| throughput |                     |  |
|            |                     |  |
| (C)        |                     |  |

## Field-Programmable Gate Arrays

**Field-Programmable Gate Arrays (FPGAs)** are yet-another point in the design space.

- "Programmable hardware."
- Make (some) design decisions **after** chip fabrication.

#### **Promises** of FPGA technology:

- → Build application-/workload-specific circuit.
- ightarrow Spend chip space only on functionality that you really need.
- → Tune for throughput, latency, energy consumption, . . .
- $\sim$  Overcome limits of general-purpose hardware with regard to task at hand (e.g., I/O limits).

## Field-Programmable Gate Arrays



- An array of logic gates
- Functionality fully programmable
- Re-programmable after deployment ("in the field")
- ightarrow "programmable hardware"

- FPGAs can be configured to implement **any** logic circuit.
- Complexity bound by available chip space.
  - → Obviously, the effective chip space is less than in custom-fabricated chips (ASICs).

## Field-Programmable Gate Arrays

FPGAs are **not** instruction set processors.

→ Cannot run (sequential) programs.

One **could** build an instruction set processor using an FPGA.

- $\rightarrow$  Bad idea. FPGA  $\approx 14 \times$  slower than equivalent ASIC.
- ightarrow If you want an instruction set processor, buy an instruction set processor.

#### Instead:

- Create arbitrary logic circuits.
- Hardware description language (HDL).

### Basic FPGA Architecture



- chip layout: 2D array
- Components
  - CLB: Configurable Logic Block ("logic gates")
  - IOB: Input/Output Block
  - DCM: Digital Clock Manager
- Interconnect Network
  - signal lines
  - configurable switch boxes

## Signal Routing



## Configurable Logic Block (CLB)



## Programming FPGAs

Programming is usually done using a hardware description language.

- E.q., VHDL<sup>12</sup>, Verilog
- High-level circuit description

Circuit description is compiled into a **bitstream**, then loaded into SRAM cells on the FPGA:



<sup>&</sup>lt;sup>12</sup>VHSIC Hardware Description language

### Example: VHDL

HDLs enable programming language-like descriptions of hardware circuits.

```
architecture Behavioral of compare is
begin
  process (A, B)
  begin
    if (A = B) then
      C <= '1':
    else
      C <= '0':
    end if;
  end process;
end Behavioral;
```

VHDL can be synthesized, but also executed in software (simulation).

#### Real-World Hardware



- Simplified Virtex-5 XC5VFXxxxT floor plan
- Frequently used high-level components are provided in discrete silicon
- BlockRAM (BRAM): set of blocks that each store up 36 kbits of data
- DSP48 slices: 25x18-bit multipliers followed by a 48-bit accumulator
- CPU: two full embedded PowerPC 440 cores

### Development Board with Virtex-5 FPGA



| $\begin{array}{c} \text{Virtex-5} \\ \text{XC5VLX110T} \\ \text{Lookup Tables (LUTs)} \\ \text{Block RAM (kbit)} \\ \text{DSP48 Slices} \\ \text{PowerPC Cores} \\ \text{max. clock speed} \\ \text{release year} \\ \end{array}$ |                                                              | Virtex-5                      |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------|-------------------------------|
| Block RAM (kbit) $5,328$ DSP48 Slices $64$ PowerPC Cores $0$ max. clock speed $\approx 450  \text{MHz}$                                                                                                                           |                                                              | XC5VLX110T                    |
|                                                                                                                                                                                                                                   | Block RAM (kbit) DSP48 Slices PowerPC Cores max. clock speed | 5,328<br>64<br>0<br>≈ 450 MHz |

source: Xilinx Inc., ML50x Evaluation Platform, User Guide,



Low-level speed of configurable gates is slower than in  $\stackrel{\checkmark}{\perp}$  custom-fabricated chips (clock frequencies:  $\sim$  100 MHz).

→ Compensate with efficient circuit for problem at hand.

#### State Machines

The key asset of FPGAs is their inherent **parallelism**.

• Chip areas naturally operate independently and in parallel.

For example, consider finite-state automata.



→ non-deterministic automaton for .\*abc.\*d

#### State Machines

How would you implement an automaton in software?

Problems with state machine implementations in software:

- In **non-deterministic automata**, several states can be active at a time, which requires **iterative** execution on sequential hardware.
- **Deterministic automata** avoid this problem at the expense of a significantly higher **state count**.

#### State Machines in Hardware

Automata can be translated mechanically into hardware circuits.

- each state → flip-flop
   (A flip-flop holds a single bit of information. Just the right amount to keep the 'active' / 'not active' information.)
- transitions:
  - ightharpoonup ightharpoonup signals ("wires") between states
  - **conditioned** on current input symbol (~ 'and' gate)
  - multiple sources for one flip-flop input → 'or' gate.

### State Machines in Hardware







## Use Case: XML Projection

#### Example:

#### **Projection paths:**

```
keep descendants
{ //regions//item,
   //regions//item/name #,
   //regions//item/incategory }
```

Challenge: Avoid explicit synthesis for each query.

## Advantage: FPGA System Integration

Here: In-network filtering



In general: FPGA in the data path.

- disk → CPU
- memory → CPU
- **.** . . .

#### XPath → Finite State Automata

Automaton for //a/b/c//d:



In hardware: (see also earlier slides)



## Compilation to Hardware



#### Skeleton Automaton

### Separate the difficult parts from the latency-critical



#### Skeleton Automaton

**Thus:** Build skeleton automaton that can be **parameterized** to implement **any** projection query.



#### Intuitively:

■ Runtime-configuration determines presence of 0\*.

## **Pipelining**



ightarrow Side effect: Can support self and descendant-or-self axes.

## Scalability



## Application Speedup

