The Fastest Llama: Uncovering the Speed of LLMs
Francesco Mattia
Posted on September 1, 2024
Curious about LLM Speed? I Tested Local vs Cloud GPUs (and CPUs too!)
I've been itching to compare the speed of locally-run LLMs against the big players like OpenAI and Anthropic. So, I decided to put my curiosity to the test with a series of experiments across different hardware setups.
I started with LM Studio and Ollama on my trusty laptop, but then I thought, "Why not push it further?" So, I fired up my PC with an RTX 3070 GPU and dove into some cloud options like RunPod, AWS, and vast.ai. I wanted to see not just the speed differences but also get a handle on the costs involved.
Now, I'll be the first to admit my test wasn't exactly very scientific. I used just two prompts for inference, which some might argue is a bit basic. But hey, it gives us a solid starting point to compare speeds across different GPUs and understand the nuances between prompt evaluation (input) and response (output) speeds.
Check out this table of results.
Device | Cost/hr | Phi-3 input (t/s) | Phi-3 output (t/s) | Phi-3 IO ratio | Llama3 input (t/s) | Llama3 output (t/s) | Llama3 IO ratio |
---|---|---|---|---|---|---|---|
M1 Pro | - | 96.73 | 30.63 | 3.158 | 59.12 | 25.44 | 2.324 |
RTX 3070 | - | 318.68 | 103.12 | 3.090 | 167.48 | 64.15 | 2.611 |
g5g.xlarge (T4G) | $0.42 | 185.55 | 60.85 | 3.049 | 88.61 | 42.33 | 2.093 |
g5.12xlarge (4x A10G) | $5.672 | 266.46 | 105.97 | 2.514 | 131.36 | 68.07 | 1.930 |
A40 (runpod) | $0.49 (spot) | 307.51 | 123.73 | 2.485 | 153.41 | 79.33 | 1.934 |
L40 (runpod) | $0.69 (spot) | 444.29 | 154.22 | 2.881 | 212.25 | 97.51 | 2.177 |
RTX 4090 (runpod) | $0.49 (spot) | 470.42 | 168.08 | 2.799 | 222.27 | 101.43 | 2.191 |
2x RTX 4090 (runpod) | $0.99 (spot) | 426.73 | 40.95 | 10.4 | 168.60 | 111.34 | 1.51 |
RTX 3090 (vast.ai) | $0.24 | 335.49 | 142.02 | 2.36 | 145.47 | 88.99 | 1.63 |
Setup and Specs: For the Tech-Curious
I ran tests on a variety of setups, from cloud services to my local machines. Below is a quick rundown of the hardware. I wrote about running LLMs in the cloud more in detail here.
The benchmarks are run using these python scripts https://github.com/MinhNgyuen/llm-benchmark.git
which lean on ollama for the inference. Hence on any environment we need to set up ollama and python, pull the models we want to test and prepare to run the tests.
On runpod (starting from ollama/ollama Docker template):
# basic setup (on ubuntu)
apt-get update
apt install pip python3 git python3.10-venv -y
# pull models we want to test
ollama pull phi3; ollama pull llama3
python3 -m venv venv
source venv/bin/activate
# download benchmarking script and install dependencies
git clone https://github.com/MinhNgyuen/llm-benchmark.git
cd llm-benchmark
pip install -r requirements.txt
# run benchmarking script with installed models and these prompts
python benchmark.py --verbose --skip-models nomic-embed-text:latest --prompts "Why is the sky blue?" "Write a report on the financials of Nvidia"
Systems specs
Environment | Hardware Specification | VRAM | Software |
---|---|---|---|
AWS EC2 - | g5g.xlarge, Nvidia T4G | 16GB VRAM | ollama |
AWS EC2 - | g5.12xlarge, 4x Nvidia A10G | 96GB VRAM | ollama |
runpod | Nvidia A40 | 48GB VRAM | ollama |
runpod | Nvidia L40 | 48GB VRAM | ollama |
runpod | Nvidia RTX 4090 | 24GB VRAM | ollama |
runpod | 2x Nvidia RTX 4090 | 48GB VRAM | ollama |
vast.ai | Nvidia RTX 3090 | 24GB VRAM | ollama |
Local Mac | M1 Pro 8 CPU Cores (6p + 2e) + 14 GPU cores | 16GB (V)RAM | LM Studio |
Local PC | Nvidia RTX 3070, LLM on GPU | 8GB VRAM | LM Studio |
Local PC | Ryzen 5500 6 CPU Cores, LLM on CPU | 64GB RAM | LM Studio |
But Wait, What About CPUs?
Curious about CPU performance compared to GPUs? I ran a quick test to give you an idea. I used a single prompt across three different setups:
- A Mac, which uses its integrated GPUs
- A PC with an Nvidia GPU, which expectedly gave the best speed results
- A PC running solely on its CPU
For this test, I used LM Studio, that gives you flexibility on where to load the LLM layers, conveniently letting you choose whether to use your system's GPU or not. I ran the tests with temperature set to 0, using the prompt Who is the president of the US?
Here are the results:
Model | Device | TTFT | Speed |
---|---|---|---|
Phi3 mini 4k instruct q4 | M1 Pro | 0.04s | ~35 tok/s |
RTX 3070 | 0.01s | ~97 tok/s | |
Ryzen 5 | 0.07s | ~13 tok/s | |
Meta Llama 3 Instruct 7B | M1 Pro | 0.17s | ~23 tok/s |
RTX 3070 | 0.02s | ~64 tok/s | |
Ryzen 5 | 0.13s | ~7 tok/s | |
Gemma It 2B Q4_K_M | M1 Pro | 0.02s | ~63 tok/s |
RTX 3070 | 0.01s | ~170 tok/s | |
Ryzen 5 | 0.05s | ~23 tok/s |
My takeaways
Dedicated GPUs are speed demons: They outperform Macs when it comes to inference speed, especially considering the costs.
Size matters (for models): Smaller models can provide a viable experience even on lower-end hardware, as long as you've got the RAM or VRAM to back it up.
CPUs? Not so hot for inference: Your average desktop CPU is still vastly slow compared to dedicated GPUs.
Gaming GPUs for the win: a beastly gaming GPU like the 4090 is quite cost-effective and can deliver top-notch results, comparable to an H100. Multiple GPUs didn't necessarily make things faster in this scenario.
This little experiment has been a real eye-opener for me, and I'm eager to dive deeper. I'd love to hear your thoughts! What other tests would you like to see? Any specific hardware or models you're curious about?
Posted on September 1, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024