I tried a new 8B local LLM, and its design might be the biggest shift since DeepSeek R1
Most of the small reasoning models that have shipped in the past year are variations on a theme. A familiar transformer backbone, a Mixture-of-Experts wrapper, grouped-query attention or something like Gated DeltaNet in Qwen's case for a smaller KV cache, and a heavy reinforcement learning stage at the end. Performance improves year on year, but the architecture of what's actually running is similar to the shape it was when DeepSeek R1 arrived.
Most of the small reasoning models that have shipped in the past year are variations on a theme. A familiar transformer backbone, a Mixture-of-Experts wrapper, grouped-query attention or something like Gated DeltaNet in Qwen’s case for a smaller KV cache, and a heavy reinforcement learning stage at the end. Performance improves year on year, but the architecture of what’s actually running is similar to the shape it was when DeepSeek R1 arrived.
Boye Vlasblom
Netherlands
Netherlands
Published by: aplhsindia.in
