Home / Technology / Google Gemma Speeds Up AI Locally
Google Gemma Speeds Up AI Locally
6 May
Summary
- Google's Gemma open models now feature MTP for faster local AI.
- MTP uses speculative decoding to guess future tokens, speeding generation.
- Gemma 4 models are tuned for local hardware, unlike cloud-based Gemini.

Google has enhanced its Gemma open models, released this spring, with experimental Multi-Token Prediction (MTP) drafters aimed at boosting local AI performance. These MTP models employ a speculative decoding technique to predict future tokens, which accelerates the generation process compared to traditional methods. The Gemma models are built on the same foundational technology as Google's advanced Gemini AI but are optimized for local execution, enabling users to run them on consumer GPUs. This approach allows for AI experimentation on personal hardware, reducing reliance on cloud-based systems and enhancing data privacy. Google has also updated the Gemma 4 license to the more permissive Apache 2.0. Traditional large language models generate tokens one by one, a process that can be slow on consumer hardware due to memory bandwidth limitations. MTP addresses this by using smaller, faster drafter models to speculatively generate tokens while the main model works on context. These drafters, such as the 74 million parameter Gemma 4 E2B, share the main model's key-value cache and use sparse decoding to efficiently predict likely next tokens, thereby improving overall generation speed.