ariya.io About Talks Articles

LLM Inference Machine for $300

3 min read

You can absolutely run Qwen-2.5 32B. And of course, Llama-3.1 8B and Llama-3.2 Vision 11B are no problem at all.

Now, before you get too excited, there’s a catch: this rig won’t break any speed records (more on that later). But if you’re after a budget-friendly way to do LLM research, this build might be just what you need.

Here’s a breakdown of the parts and the amazing prices I got them for:

The motherboard was a crazy good find, a broken PCIe latch got me a killer deal. The Ryzen 3400G is outdated by today’s standards, with only 4 cores and 8 threads, but for a GPU-focused inference rig, it’s more than enough. Bonus: its Vega iGPU frees up the PCIe slot for the real star of the show, the M40 GPU.

Speaking of the GPU, it’s a Maxwell-era data center card with a massive 24GB of VRAM. That much memory is essential for running hefty 32B models (quantized, of course).

While you can find a used M40 on eBay for around $90 these days, I had to buy an additional cooling solution (two small fans in a 3D-printed shroud), since data center GPUs usually don’t come with coolers or blowers like their consumer counterparts.

Here are the token generation speeds for several instruction-tuned models, quantized to 4-bit (Q4_K_M), measured with llama-bench:

Local LLM Machine

Performance is all relative. Compared to the latest RTX 3000 series, the M40 is definitely the slower sibling: about 5x slower, to be exact. But then again, an RTX 3090 is roughly 10x more expensive. Meanwhile, a more affordable RTX 3080 might limit your options with its 10GB (or 12GB for the enthusiast version) of VRAM.

An RTX 2080 Ti with 11GB VRAM could be a nice upgrade. Prices in the used market are dropping ($250 or less at the time of writing), and it delivers a solid 3x speed boost compared to the M40. Double the cost for triple the speed? That’s a pretty sweet deal!

How about Apple Silicon? The M2 Pro with its Metal GPU is roughly 25% faster than the M40. It wins easily in areas like portability, efficiency, and noise levels, but it comes with a significantly higher cost.

Coding assistance is a proven home-run use case for powerful LLMs. This is where the M40’s massive 24GB VRAM shines, enabling you to run the fantastic Qwen-2.5 Coder 32B model. Pair it with Continue.dev as your coding assistant, and you’ve got a powerful combo that could replace tools like GitHub Copilot or Codeium, particularly for medium-complexity projects.

The best part? Privacy and data security. With local LLM inference, your precious source code stays on your machine.

Now, should I go all in? Is it time to add a second M40?

Related posts:

♡ this article? Explore more articles and follow me Twitter.

Share this on Twitter Facebook