Most of the small reasoning models that have shipped in the past year are variations on a theme. A familiar transformer backbone, a Mixture-of-Experts wrapper, grouped-query attention or something like Gated DeltaNet in Qwen’s case for a smaller KV cache, and a heavy reinforcement learning stage at the end. Performance improves year on year, but the architecture of what’s actually running is similar to the shape it was when DeepSeek R1 arrived.