The Fastest Llama: Uncovering the Speed of LLMs

Curious about LLM Speed? I Tested Local vs Cloud GPUs (and CPUs too!)

I've been itching to compare the speed of locally-run LLMs against the big players like OpenAI and Anthropic. So, I decided to put my curiosity to the test with a series of experiments across different hardware setups.

I started with LM Studio and Ollama on my trusty laptop, but then I thought, "Why not push it further?" So, I fired up my PC with an RTX 3070 GPU and dove into some cloud options like RunPod, AWS, and vast.ai. I wanted to see not just the speed differences but also get a handle on the costs involved.

Now, I'll be the first to admit my test wasn't exactly very scientific. I used just two prompts for inference, which some might argue is a bit basic. But hey, it gives us a solid starting point to compare speeds across different GPUs and understand the nuances between prompt evaluation (input) and response (output) speeds.

Check out this table of results.

Device	Cost/hr	Phi-3 input (t/s)	Phi-3 output (t/s)	Phi-3 IO ratio	Llama3 input (t/s)	Llama3 output (t/s)	Llama3 IO ratio
M1 Pro	-	96.73	30.63	3.158	59.12	25.44	2.324
RTX 3070	-	318.68	103.12	3.090	167.48	64.15	2.611
g5g.xlarge (T4G)	$0.42	185.55	60.85	3.049	88.61	42.33	2.093
g5.12xlarge (4x A10G)	$5.672	266.46	105.97	2.514	131.36	68.07	1.930
A40 (runpod)	$0.49 (spot)	307.51	123.73	2.485	153.41	79.33	1.934
L40 (runpod)	$0.69 (spot)	444.29	154.22	2.881	212.25	97.51	2.177
RTX 4090 (runpod)	$0.49 (spot)	470.42	168.08	2.799	222.27	101.43	2.191
2x RTX 4090 (runpod)	$0.99 (spot)	426.73	40.95	10.4	168.60	111.34	1.51
RTX 3090 (vast.ai)	$0.24	335.49	142.02	2.36	145.47	88.99	1.63

Setup and Specs: For the Tech-Curious

I ran tests on a variety of setups, from cloud services to my local machines. Below is a quick rundown of the hardware. I wrote about running LLMs in the cloud more in detail here.

The benchmarks are run using these python scripts https://github.com/MinhNgyuen/llm-benchmark.git which lean on ollama for the inference. Hence on any environment we need to set up ollama and python, pull the models we want to test and prepare to run the tests.

On runpod (starting from ollama/ollama Docker template):

# basic setup (on ubuntu)
apt-get update
apt install pip python3 git python3.10-venv -y

# pull models we want to test
ollama pull phi3; ollama pull llama3

python3 -m venv venv
source venv/bin/activate

# download benchmarking script and install dependencies 
git clone https://github.com/MinhNgyuen/llm-benchmark.git
cd llm-benchmark
pip install -r requirements.txt

# run benchmarking script with installed models and these prompts
python benchmark.py --verbose --skip-models nomic-embed-text:latest --prompts "Why is the sky blue?" "Write a report on the financials of Nvidia"

Systems specs

Environment	Hardware Specification	VRAM	Software
AWS EC2 -	g5g.xlarge, Nvidia T4G	16GB VRAM	ollama
AWS EC2 -	g5.12xlarge, 4x Nvidia A10G	96GB VRAM	ollama
runpod	Nvidia A40	48GB VRAM	ollama
runpod	Nvidia L40	48GB VRAM	ollama
runpod	Nvidia RTX 4090	24GB VRAM	ollama
runpod	2x Nvidia RTX 4090	48GB VRAM	ollama
vast.ai	Nvidia RTX 3090	24GB VRAM	ollama
Local Mac	M1 Pro 8 CPU Cores (6p + 2e) + 14 GPU cores	16GB (V)RAM	LM Studio
Local PC	Nvidia RTX 3070, LLM on GPU	8GB VRAM	LM Studio
Local PC	Ryzen 5500 6 CPU Cores, LLM on CPU	64GB RAM	LM Studio

But Wait, What About CPUs?

Curious about CPU performance compared to GPUs? I ran a quick test to give you an idea. I used a single prompt across three different setups:

A Mac, which uses its integrated GPUs
A PC with an Nvidia GPU, which expectedly gave the best speed results
A PC running solely on its CPU

For this test, I used LM Studio, that gives you flexibility on where to load the LLM layers, conveniently letting you choose whether to use your system's GPU or not. I ran the tests with temperature set to 0, using the prompt Who is the president of the US?

Here are the results:

Model	Device	TTFT	Speed
Phi3 mini 4k instruct q4	M1 Pro	0.04s	~35 tok/s
	RTX 3070	0.01s	~97 tok/s
	Ryzen 5	0.07s	~13 tok/s
Meta Llama 3 Instruct 7B	M1 Pro	0.17s	~23 tok/s
	RTX 3070	0.02s	~64 tok/s
	Ryzen 5	0.13s	~7 tok/s
Gemma It 2B Q4_K_M	M1 Pro	0.02s	~63 tok/s
	RTX 3070	0.01s	~170 tok/s
	Ryzen 5	0.05s	~23 tok/s

My takeaways

Dedicated GPUs are speed demons: They outperform Macs when it comes to inference speed, especially considering the costs.
Size matters (for models): Smaller models can provide a viable experience even on lower-end hardware, as long as you've got the RAM or VRAM to back it up.
CPUs? Not so hot for inference: Your average desktop CPU is still vastly slow compared to dedicated GPUs.
Gaming GPUs for the win: a beastly gaming GPU like the 4090 is quite cost-effective and can deliver top-notch results, comparable to an H100. Multiple GPUs didn't necessarily make things faster in this scenario.

This little experiment has been a real eye-opener for me, and I'm eager to dive deeper. I'd love to hear your thoughts! What other tests would you like to see? Any specific hardware or models you're curious about?

Blog

The Fastest Llama: Uncovering the Speed of LLMs

Francesco Mattia

Setup and Specs: For the Tech-Curious

Systems specs

But Wait, What About CPUs?

My takeaways

Join Our Newsletter. No Spam, Only the good stuff.

Related