Getting Started with llama.cpp on Linux! (Updated+)

shanu-kumawat

Shanu Kumawat

Posted on March 19, 2024

Getting Started with llama.cpp on Linux! (Updated+)

Introduction

llama.cpp is a wonderful project for running llms locally on your system. It is lightweight and provide state-of-the-art performance. It comes with GPU offloading support, allowing you to use your GPU capabilities to run llms.

I am personally using it for running llms on my arch system and I found it better than ollama in terms of performance, while trying to set this it up, I found their documentation confusing and there were no guides specifically for arch linux, that's why I decided to write an article after figuring things out.
So, lets gets started with it.

So, there are 4 ways to use it :-

  • Method 1: Clone this repository and build locally, see how to Build
  • Method 2: If you are using macOS or Linux, you can install llama.cpp via brew, flox or nix
  • Method 3: Use a Docker image, see documentation for Docker
  • Method 4: Download pre-built binary from releases

Guide

Method 1

  1. let's start by cloning the repo and cd into that
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
Enter fullscreen mode Exit fullscreen mode
  1. After that we need to build this, it can be done by using make, but like me if you have Nvidia GPU then for using it to for offloading you will need to build it with cuBLAS and for that we need cuda toolkit, You can download it from your Linux distro's package manager. (e.g. apt install nvidia-cuda-toolkit) here I am using AUR helper, you can also install it manually.
paru -S cuda
Enter fullscreen mode Exit fullscreen mode

Now to build project with cuda run

make GGML_CUDA=1
Enter fullscreen mode Exit fullscreen mode

Using CMake:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
Enter fullscreen mode Exit fullscreen mode

It might take a while, after build is finished we can now finally run llms.

The environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 can be used to enable unified memory on Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted


Method 2

Homebrew

On Mac and Linux, the homebrew package manager can be used via

brew install llama.cpp
Enter fullscreen mode Exit fullscreen mode

The formula is automatically updated with new llama.cpp releases. More info: https://github.com/ggerganov/llama.cpp/discussions/7668

Nix

On Mac and Linux, the Nix package manager can be used via

nix profile install nixpkgs#llama-cpp
Enter fullscreen mode Exit fullscreen mode

For flake enabled installations.

Or

nix-env --file '<nixpkgs>' --install --attr llama-cpp
Enter fullscreen mode Exit fullscreen mode

For non-flake enabled installations.

This expression is automatically updated within the nixpkgs repo.

Flox

On Mac and Linux, Flox can be used to install llama.cpp within a Flox environment via

flox install llama-cpp
Enter fullscreen mode Exit fullscreen mode

Flox follows the nixpkgs build of llama.cpp.


Method 3

Follow the documentation here
It's well documented, you will not have problem following it.

How to use

In llama.cpp folder you will find a executable named llama-server and llama-cli
which can be used to run the model
for cli mode just run

./llama-cli -m path-to-model -n no-of-tokens -ngl no-of-layers-to-offload-to-gpu
Enter fullscreen mode Exit fullscreen mode

And for server run this

./llama-server -m path-to-model -n no-of-tokens -ngl no-of-layers-to-offload-to-gpu
Enter fullscreen mode Exit fullscreen mode

You can now open your browser and in the URL section type http://localhost:8080/ and a web UI will appear.

lamma web ui

Now you can have fun with your local llm.
Hope you will find this article helpful.

💖 💪 🙅 🚩
shanu-kumawat
Shanu Kumawat

Posted on March 19, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related