Microsoft has recently released BitNet.cpp, a new inference framework designed specifically for 1-bit large language models (LLMs). This framework enables fast and efficient inference for models like BitNet b1.58, which was introduced earlier this year in a comprehensive paper published by Microsoft. The framework includes a suite of optimized kernels that currently support lossless inference on CPU, with plans for NPU and GPU support in the future.
The key innovation of BitNet.cpp lies in its representation of model parameters, also known as weights, using only 1.58 bits. This is a significant reduction compared to traditional LLMs, which often use 16-bit floating-point values (FP16) or FP4 by NVIDIA for weights. BitNet b1.58 restricts each weight to one of three values: -1, 0, or 1, resulting in a substantial decrease in bit usage. Despite this reduction, the model performs just as well as traditional LLMs with the same size and training data in terms of end-task performance.
Support authors and subscribe to content
This is premium stuff. Subscribe to read the entire article.