Home / Technology / Google AI: Memory Strain Slashed, Accuracy Unchanged
Google AI: Memory Strain Slashed, Accuracy Unchanged
30 Mar
Summary
- New TurboQuant reduces LLM memory use by six times.
- System achieves efficiency without additional training.
- Attention computations run up to eight times faster.

Large language models (LLMs) face significant memory challenges due to their reliance on key-value caches for intermediate data. Google has introduced TurboQuant, a system designed to alleviate this bottleneck.
TurboQuant employs a two-stage approach. Initially, PolarQuant converts data vectors into polar coordinates, drastically reducing memory needs by focusing on radius and angle instead of multiple directional components. This transformation minimizes normalization overhead.
The second stage, Quantized Johnson-Lindenstrauss (QJL), refines the compression. QJL compresses vector elements to single bits, preserving crucial data relationships and correcting minor errors from the first stage. This process enhances attention scores.
Testing indicates TurboQuant can decrease key-value cache memory usage by a factor of six while maintaining downstream accuracy. It allows for quantization to as little as three bits without requiring model retraining. Furthermore, attention computations have been observed to run up to eight times faster than standard 32-bit operations on high-end hardware.