





## Vector Accelerator for RISC-V

### **Oded Eini and Lionn Bruckstein**

### **Instructor: Professor Ran Ginosar**

June 1<sup>st</sup> , 2020

## Background

- Integer Instruction mul x2 x5 x8  $\# \gamma = \alpha \times \beta$
- Vector Instruction

**v.madd** x6 x7 x10  $\# V3 = V1 \cdot V2$ 

• Vector Processor

 $\begin{array}{c|c} \alpha & 4 \\ \times & \\ \beta & -2 \\ = & \\ \gamma & -8 \end{array}$ 



## Motivation

- Growing need for vector processors: machine-learning, DSP, etc.
- Nowadays implemented into Pipeline = Complicated
- Multi-Cycle Microarchitecture

## **Has Never Been Explored Before**

## Let Us Begin From the End



## 1. Vector processor is faster and more energy efficient

2. Easier to implement with Multi-Cycle than with Pipeline

## **RISC-V Multi Cycle FSM**



## **RISC-V Multi Cycle FSM**



## **Project Scope**

- Explore "Vector Extension" in the environment of a RISC-V Multi-Cycle core
- RISC-V Vector Processor From A to Z:
  - Design
  - Implement
  - Validate
  - Evaluate



## **Project Milestones**

- 1) Study and Research
- 2) Integer Core Simulative  $\rightarrow$  FPGA
- 3) Design and Implement a Vector Processing Unit
- 4) Integration: Vector Unit into the Integer Core
- 5) Develop a Performance Measurement System
- 6) Integer Core vs. Vector Processor Comparative Evaluation



## **Project Platforms**

#### **NetFPGA SUME**



#### **Power Meter**











## **Project Milestones**

1) Study and Research

## 2) Integer Core – Simulative $\rightarrow$ FPGA

- 3) Design and Implement a Vector Processing Unit
- 4) Integration: Vector Unit into the Integer Core
- 5) Develop a Performance Measurement System
- 6) Integer Core vs. Vector Processor Comparative Evaluation



## **Convert Simulative Core into Synthesizable Core**

- Challenges:
  - Memory Block Implementation
  - Syntax Issues
  - Setup and Hold time



#### **Appeared Simple – Reality Proved Us Wrong**

## **Project Milestones**

- 1) Study and Research
- 2) Integer Core Simulative  $\rightarrow$  FPGA
- 3) Design and Implement a Vector Processing Unit
- 4) Integration: Vector Unit into the Integer Core
- 5) Develop a Performance Measurement System
- 6) Integer Core vs. Vector Processor Comparative Evaluation



#### VectorExtension



## **Project Milestones**

- 1) Study and Research
- 2) Integer Core Simulative  $\rightarrow$  FPGA
- 3) Design and Implement a Vector Processing Unit
- 4) Integration: Vector Unit into the Integer Core
- 5) Develop a Performance Measurement System
- 6) Integer Core vs. Vector Processor Comparative Evaluation



## **VPU Integration**





#### Large Scale Design + Widespread Integration

## **Project Milestones**

1) Study and Research

- 2) Integer Core Simulative  $\rightarrow$  FPGA
- 3) Design and Implement a Vector Processing Unit
- 4) Integration: Vector Unit into the Integer Core

### 5) Develop a Performance Measurement System

6) Integer Core vs. Vector Processor Comparative Evaluation



## **Comparison Parameters**

- Hardware Complexity
- Program Execution Time
- Power and Energy

#### **Theoretical Evaluation is Not Enough – True Measurement Required**

## **Performance Measurement System**

- Actual Power Measurement
- Creating Artificial GPIOs Communication Interfaces
- Study the NetFPGA-SUME Schematics
- Study the Power-Manager datasheet
- Study and implement  $I^2C$  Interface



## **Physical Components**



## **Measurement System**





## **Project Milestones**

1) Study and Research

- 2) Integer Core Simulative  $\rightarrow$  FPGA
- 3) Design and Implement a Vector Processing Unit
- 4) Integration: Vector Unit into the Integer Core
- 5) Develop a Performance Measurement System
- 6) Integer Core vs. Vector Processor Comparative Evaluation



## **Comparative Evaluation**

- Hardware Complexity
- Experimental Results
  - ✓Power
  - ✓ Execution Time



# Hardware Complexity

## Layout – Mapping Onto FPGA Logic

| Integer Core |             | Vector Processor |             |
|--------------|-------------|------------------|-------------|
| Resource     | Utilization | Resource         | Utilization |
| LUT          | 1350        | LUT              | 4154        |
| FF           | 1295        | FF               | 5037        |
| BRAM         | 1.50        | BRAM             | 1.50        |
| DSP          | 3           | DSP              | 6           |
| ю            | 10          | ю                | 10          |
| MMCM         | 1           | MMCM             | 1           |
|              |             | LUTRAM           | 48          |

×4 Hardware Complexity

# **Experimental Results**

**Benchmark:** Inner-Product

500-element-long-vectors × 1,048,575 times (0xFFFF)

## **Experiment No. 1: Single-Core Power Consumption**

|                  | Integer Processor | Vector Processor |
|------------------|-------------------|------------------|
| Power            | 3.4 [W]           | 3.41 [W]         |
| Time Per Program | 325 [µs]          | 31 [µs]          |

#### Who are the consumers?

## **FPGA Resources Utilization**



#### Single-Core



#### 71 Cores



## **Isolating the Background Consumption**

• Hypothesis: Consumption is Linear to Number of Cores

• Power = 
$$P_{background} + [P_{single core} \times (number of cores)]$$

1) 
$$\underbrace{P_{70 \ Cores}}_{calculation} = \underbrace{P_{71 \ Cores} - P_{1 \ Cores}}_{measurement}$$
  
(2)  $P_{Single \ Core} \approx \frac{P_{70 \ Cores}}{70}$ 



## **Experiment No. 2 – Power Consumption Calculation**

|             | Integer Processor       | Vector Processor |  |
|-------------|-------------------------|------------------|--|
| Single-Core | <b>3.4</b> [ <b>W</b> ] | 3.41 [W]         |  |
| 71 Cores    | <b>4.9</b> [ <b>W</b> ] | 5.13 [W]         |  |

| Power per Core          | <b>21</b> [ <i>mW</i> ] | 24.6 [ <i>mW</i> ] |  |
|-------------------------|-------------------------|--------------------|--|
| <b>Background Power</b> | 3.378 [W]               | 3.385[W]           |  |

## **Final Experiment - Hypothesis Confirmation**



#### **Indeed Linear Model**

## **Summary of Results**

| Parameter                      | Integer Core -RV32I | Vector Processor -RV32V | Integer<br>Vector |
|--------------------------------|---------------------|-------------------------|-------------------|
| Hardware Complexity            |                     |                         | 0.25              |
| Power                          | 21 [mW]             | 26 [mW]                 | 0.8               |
| Time Per Program               | 325 [µs]            | 31 [µs]                 | 10.5              |
| Energy Per Program             | 6,825 [nJ]          | 806 [nJ]                | 8.5               |
| Figure of Merit <sup>[1]</sup> | 88.8                |                         |                   |

Figure of Merit = 
$$\frac{(Time \times Energy)_{RV32I}}{(Time \times Energy)_{RV32V}}$$

[1] - Also known as Energy-Delay Product - 
$$EDP = E \times D = P \times D^2$$

## Innovation

- We offer a new vector processor structure which has never been explored
- Utilization of a "marginal-advantage" of the non-efficient multi-cycle

microarchitecture showed undisputed results



## **Engineering Difficulties**

- Multidisciplinary Project
- Hands-On Experience
- Real Engineering Difficulties Require Real Engineering Solutions









# Special thanks to:

# Prof. Ran Ginosar

# and HS DSL's Staff