Running large language models on local hardware not only lets you avoid paying monthly subscriptions to cloud providers, but also prevents large corporations from gaining access to your private data. But unless you’re willing to spend thousands of dollars on a top-of-the-line graphics card, you’re bound to run out of VRAM when attempting to run large language models with over 15B parameters. Sure, 7B and 9B models can get the job done when it comes to productivity tasks, but sub-10B LLMs (or even their sub-20B counterparts, for that matter) aren’t the best for hardcore coding workloads or tasks involving precise output.