close
close

FuriosaAI RNGD processor for sustainable AI computing

FuriosaAI RNGD processor for sustainable AI computing

FuriosaAI RNGD without cooler_1
FuriosaAI RNGD without cooler_1

We are hearing more and more about sustainable AI computation and FuriosaAI has a solution for that with RNGD. This is almost the opposite of many AI compute platforms we have heard about today. Instead of going for maximum performance, this is a lower power compute solution.

This is the last talk of the day after over a dozen, and it will be conducted live, so please excuse any typos.

FuriosaAI RNGD processor for sustainable AI computing

Here are the specs of the card. It is not specifically designed to be the fastest AI chip on the market.

FuriosaAI RNGD Hot Chips 2024_Page_05
FuriosaAI RNGD Hot Chips 2024_Page_05

Here’s a look at the card with its cooler.

FuriosaAI RNGD without cooler and with cooler
FuriosaAI RNGD without cooler and with cooler

The target TDP for air-cooled data centers is only 150 W.

FuriosaAI RNGD Hot Chips 2024_Page_06
FuriosaAI RNGD Hot Chips 2024_Page_06

The structure is made using 12-layer HBM3 and TSMC CoWoS-S and a 5 nm process.

FuriosaAI RNGD Hot Chips 2024_Page_07
FuriosaAI RNGD Hot Chips 2024_Page_07

Instead of focusing on the H100 or B100, FuriosaAI is targeting the NVIDIA L40S. We wrote a big article about the L40S a while back. The goal is not only to offer similar performance, but also to deliver that performance while consuming less power.

FuriosaAI RNGD Hot Chips 2024_Page_08
FuriosaAI RNGD Hot Chips 2024_Page_08

Efficiency comes from hardware, software and algorithm.

FuriosaAI RNGD Hot Chips 2024_Page_09
FuriosaAI RNGD Hot Chips 2024_Page_09

One of the challenges for FuriosaAI was to work on the abstraction layer between hardware and software.

FuriosaAI RNGD Hot Chips 2024_Page_11
FuriosaAI RNGD Hot Chips 2024_Page_11

Tensor contraction is one of FuriosaAI’s big operations. In BERT, it accounted for over 99% of FLOPS.

FuriosaAI RNGD Hot Chips 2024_Page_12
FuriosaAI RNGD Hot Chips 2024_Page_12

Normally we have matrix multiplication as a primitive instead of tensor contraction.

FuriosaAI RNGD Hot Chips 2024_Page_13
FuriosaAI RNGD Hot Chips 2024_Page_13

Instead, the abstraction occurs at the level of tensor contraction.

FuriosaAI RNGD Hot Chips 2024_Page_14
FuriosaAI RNGD Hot Chips 2024_Page_14

Furiosa adds a low-level summation for his primitive.

FuriosaAI RNGD Hot Chips 2024_Page_15
FuriosaAI RNGD Hot Chips 2024_Page_15

Here the matrices A and B are multiplied to produce C.

FuriosaAI RNGD Hot Chips 2024_Page_16
FuriosaAI RNGD Hot Chips 2024_Page_16

Furiosa then takes this over and plans it on the actual architecture with storage and computing units.

FuriosaAI RNGD Hot Chips 2024_Page_17
FuriosaAI RNGD Hot Chips 2024_Page_17

From here, an entire tensor contraction can be a primitive.

FuriosaAI RNGD Hot Chips 2024_Page_18
FuriosaAI RNGD Hot Chips 2024_Page_18

By taking spatial and temporal orchestration into account, they can increase efficiency and utilization.

FuriosaAI RNGD Hot Chips 2024_Page_19
FuriosaAI RNGD Hot Chips 2024_Page_19

Furiosa says it has flexible reconfiguration, which is important to keep performance high with varying batch sizes.

FuriosaAI RNGD Hot Chips 2024_Page_20
FuriosaAI RNGD Hot Chips 2024_Page_20

Here is a look at the RNGD implementation.

FuriosaAI RNGD Hot Chips 2024_Page_21
FuriosaAI RNGD Hot Chips 2024_Page_21

Here are the connection networks, also for accessing the RAM.

FuriosaAI RNGD Hot Chips 2024_Page_22
FuriosaAI RNGD Hot Chips 2024_Page_22

Furiosa uses PCIe Gen5 xq6 for chip-to-chip communication. It also uses P2P via a PCIe switch for direct GPU-to-GPU communication, so if XConn gets it right, this is a fantastic product.

FuriosaAI RNGD Hot Chips 2024_Page_23
FuriosaAI RNGD Hot Chips 2024_Page_23

Furiosa supports SR-IOV for virtualization.

FuriosaAI RNGD Hot Chips 2024_Page_24
FuriosaAI RNGD Hot Chips 2024_Page_24

The company has worked on signal and power integrity to ensure reliability.

FuriosaAI RNGD Hot Chips 2024_Page_25
FuriosaAI RNGD Hot Chips 2024_Page_25

This is how Furiosa LLM works in the form of a flow chart.

FuriosaAI RNGD Hot Chips 2024_Page_27
FuriosaAI RNGD Hot Chips 2024_Page_27

The compiler compiles each partition that is assigned to multiple devices.

FuriosaAI RNGD Hot Chips 2024_Page_28
FuriosaAI RNGD Hot Chips 2024_Page_28

The compiler optimizes the model for performance improvements and energy efficiency.

FuriosaAI RNGD Hot Chips 2024_Page_29
FuriosaAI RNGD Hot Chips 2024_Page_29

For example, the serving framework performs continuous batching to achieve better utilization.

FuriosaAI RNGD Hot Chips 2024_Page_30
FuriosaAI RNGD Hot Chips 2024_Page_30

The company has a graph-based automation tool to support quantization. Furiosa supports a number of different formats, including FP8 and INT4.

FuriosaAI RNGD Hot Chips 2024_Page_31
FuriosaAI RNGD Hot Chips 2024_Page_31

Here is the company’s development methodology.

FuriosaAI RNGD Hot Chips 2024_Page_32
FuriosaAI RNGD Hot Chips 2024_Page_32

Closing words

There was a lot to say here. The short summary is that the company is using its compilers and software to map AI inference into its lower-power SoC to provide lower-power AI inference.

Leave a Reply

Your email address will not be published. Required fields are marked *