What would run Qwen3-0.6B faster on CPU: a Surface Go from 2018 or a Raspberry Pi 5 from 2023?
What if you knew the respective CPUs: an Intel Pentium 4415Y versus an ARM Cortex A76; did your guess change?
The 4415Y processor is clocked at 1.6 GHz and the A76 is around 3.0 GHz. How much faster would it run? 2x? 5x?
Answer: it’s the Raspberry Pi 5 and by a difference of at least 10x.
Finding a mobile LLM runner
While spring cleaning my collection of hardware, I found the two aforementioned devices. These are equipped with mobile CPUs, but are formidable homelab server machines IMO. Both
- have 8 GB RAM,
- have 128 GB of storage (NVMe on the Surface Go, MicroSD on the Pi 5),
- take < 7 W of idle power, and
- are Wi-Fi enabled.
Paired with tiny contemporary models like Qwen3-0.6B, a Tailnet VPN, and the hype surrounding OpenClaw and friends, I thought I’d set up my own home agent without succumbing to the Mac Mini fervor.
Benchmarking
To keep experiments controlled, I used llama.cpp (a wonderful piece of software). The only constraint is that it must run on the CPU. Whatever optimizations are enabled on those chips are fair game.
Both devices are on Debian (Pi 5 on Raspberry Pi OS, which is based on Debian). Pro tip: if you own a Surface and want to repurpose it as a Linux server, check out linux-surface. I wouldn’t do this benchmark on Windows.
Setup
I got a copy of the llama.cpp repo, checked out to a recent version, and built the suite of tools.
# Get the code.
git pull https://github.com/ggml-org/llama.cpp.git
cd llama.cpp/
# Go to a recent release.
git checkout b8407
# Configure.
cmake -S ~/llama.cpp \
-B ~/llama.cpp/build \
-G Ninja \
-DCMAKE_BUILD_TYPE=Release
# Build.
cmake --build ~/llama.cpp/build \
--config Release \
-j4
Once those were ready, I ran llama-benchmark with Qwen3-0.6B Q4_K_M from Unsloth. To keep it fair, I limited the threads to 2 (since the Surface Go only has 2 cores).
./build/bin/llama-bench -t 2 -m ~/Models/Qwen3-0.6B-Q4_K_M.gguf
Results
I’ve formatted the output to be a little nicer. The metrics measured there are tokens per second (t/s) for two processes in an LLM inference pipeline: pp512 and tg128.
| Device | # threads | pp512 (t/s) | tg128 (t/s) |
|---|---|---|---|
| Surface Go | 2 | 2.87 ± 0.00 | 2.39 ± 0.00 |
| Raspberry Pi 5 | 2 | 68.35 ± 0.19 | 25.71 ± 0.13 |
| Raspberry Pi 5 | 4 | 122.48 ± 1.36 | 25.08 ± 0.79 |
pp512 stands for prompt processing (512 tokens). It measures how long the device takes to process inputs (a.k.a. the prefill stage). A lower pp512 means a lower time to first token (TTFT).
tg128 stands for text generation (128 tokens). It measures how fast the device generates tokens once the prefill phase is complete. A lower tg128 means a lower time per output token (TPOT).
Analysis
The numbers translate to completely unusable on the Surface and pretty decent on the Pi 5! My gut had been vouching for the Pentium chip to run circles around the Cortex but it’s the opposite. We should expect about ~40x snappier response and ~10x more t/s on the Pi 5 compared to the Surface.
So what gave the Pi 5 such a leg up over the Surface Go?
It’s where the corners are cut
I initially thought it was the era in which the chips were released. The Pentium 4415Y launched in Q2 2017, about 9 years ago. The ARM Cortex A76? 2018; just the year after. “AI” was absent from the layman’s vocab at this time, so it wasn’t that.
Remember how I decided that all CPU optimizations are fair game? That turns out to be the Surface’s Achilles heel. Scientific computing thrives on single instruction multiple data (SIMD) features in a CPU. These essentially short-circuit arithmetic operations, making the CPU run multiple computations in a single machine opcode. Having a CPU with SIMD features turbocharges LLM inference since LLMs are just one giant math equation.
The intense corner-cutting on the Pentium 4415Y chip (to make it not guzzle power) left only SSE2 as its saving grace. llama.cpp will leverage SSE2, but only provides baseline performance. The clock speed doesn’t help much either. Conversely, the Cortex A76 CPU is packed with SIMD features. Among the available SIMD instructions in the Cortex A76 are:
- dot products (
__ARM_FEATURE_DOTPROD), - fused multiply-accumulate (
__ARM_FEATURE_FMA), and - 16-bit float vector math (
__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) with native 16-bit float types (__ARM_NEON_FP).
Big props to the ARM team for designing such a power-efficient chip with all these capabilities.
EOF
I’ll continue to benchmark for my local LLM device and model. Qwen3-0.6B is a good start. If I can get better quality outputs at ~ 30 t/s with another model, I’d pivot. As of right now, Qwen3.5-0.8B is a bit sluggish on the Pi 5, although there is some promise in the outputs.
Also, PicoClaw crossed my radar recently. It’s a clone of OpenClaw, lean on resources, and suits my experimental setup. Hosting the LLM close to the agent seems like a good idea.
I want to benchmark on more modern chips. AI compute capabilities have certainly risen to first-class citizens in modern chips. I wouldn’t be surprised if these new CPUs knock the benchmarks out of the park. What matters now is how we get the best bang for our buck.