Google's Gemma 4 AI models get 3x speed boost by predicting future tokens
Summary
Google has improved the speed of its local AI models called Gemma 4 by using a new technique called Multi-Token Prediction (MTP). This method guesses several future words at once, allowing the AI to respond up to three times faster on various devices while keeping the same level of accuracy.Key Facts
- Google released the Gemma 4 AI models for local use, allowing users to run AI on their own devices.
- Gemma 4 models use the same technology as Google's larger Gemini AI but are optimized for personal hardware.
- Multi-Token Prediction (MTP) helps the AI generate multiple tokens (words or pieces of words) ahead of time to speed up responses.
- MTP uses smaller, faster "drafter" models to guess future tokens, which are then checked by the main, larger model.
- This process reduces the waiting time by producing tokens in parallel rather than one by one.
- The speed increase varies by device, with some phones seeing nearly a 3x boost and Apple’s M4 processor seeing 2.5x faster performance.
- Google changed the license for Gemma 4 to Apache 2.0, making it easier for developers to use and modify.
- MTP helps overcome hardware limits in memory speed by reducing the need to move data repeatedly during AI processing.
Read the Full Article
This is a fact-based summary from The Actual News. Click below to read the complete story directly from the original source.