Non quia difficilia sunt non audemus, sed quia non audemus difficilia sunt
Home -> Publications
Home
  Publications
    
edited volumes
  Awards
  Research
  Teaching
  Miscellaneous
  Full CV [pdf]
  BLOG






  Events








  Past Events





Publications of Torsten Hoefler
Tim Dettmers, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh:

 SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

(In The Twelfth International Conference on Learning Representations, May 2024)

Abstract

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. However, quantization down to 3-4 bits per parameter usually leads to moderate-to-high accuracy losses, especially for smaller models in the 1-10B parameter range, which are well-suited for edge deployments. To address this accuracy issue, we introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique which enables for the first time near-lossless compression of LLMs across model scales, while reaching similar compression levels to previous methods. SpQR works by identifying and isolating outlier weights, which cause particularly-large quantization errors, and storing them in higher precision, while compressing all other weights to 3-4 bits, and achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup thus making powerful LLMs available to consumer without any downsides. SpQR comes with efficient algorithms for both encoding weights into its format, as well as decoding them efficiently at runtime. Specifically, we provide an efficient GPU inference algorithm for SpQR which yields faster inference than 16-bit baselines at similar accuracy, while enabling memory compression gains of more than 4x.

Documents

download article:     
 

BibTeX

@inproceedings{,
  author={Tim Dettmers and Ruslan A. Svirschevski and Vage Egiazarian and Denis Kuznedelev and Elias Frantar and Saleh Ashkboos and Alexander Borzunov and Torsten Hoefler and Dan Alistarh},
  title={{SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression}},
  year={2024},
  month={May},
  booktitle={The Twelfth International Conference on Learning Representations},
  source={http://www.unixer.de/~htor/publications/},
}


serving: 44.200.140.218:54004© Torsten Hoefler