Multi-threaded Systolic Array on FPGA

Systolic arrays are a latency-efficient way to compute matrix calculations - specifically matrix multiplication. Sparse matrices cause under utilization of the array because of the zeros. In an attempt to boost utilization, these empty cycles can be used to compute results of a different calculation thread thus also boosting overall performance. This project goal implementing a regular systolic array as well as a multi threaded systolic array in FPGA to see the performance benefits vs the overhead with a physical design.

Systolic arrays are a homogeneous network of computation units (nodes) connected in a grid. Each node computes a partial result, stores it and passes on the input data to its neighboring nodes. Different connectivity schemes between the nodes dictate the calculation the systolic array does, in this project we focused on matrix multiplication a use case that is prevalent in many evolving fields today (like deep learning). Systolic arrays are highly efficient at computing matrix multiplication because of the way they work, but they also demand a lot of resources to build. When there is a zero in the input data, it causes the whole row/column it is inserted into to not use its calculation unit (multiplying by zero is 0, no need to calculate) – meaning the utilization of the array drops. In sparse matrices, this under utilization has a high impact. A way to use these “dead” cycles is to compute input data from a different calculations (a different thread). This means gaining back utilization as well as speeding up computation. This project goal implementing a regular systolic array as well as a multi threaded systolic array in FPGA to see the performance benefits vs the overhead with a physical design.