Introducing GPUStack: An open-source GPU cluster manager for running LLMs

What is GPUStack?

We are thrilled to launch GPUStack, an open-source GPU cluster manager for running Large Language Models (LLMs). Even though LLMs are widely available as public cloud services, organizations cannot easily host their own LLM deployments for private use. They need to install and manage complex clustering software such as Kubernetes and then figure out how to install and manage the AI tool stack on top. Popular ways to run LLMs locally, such as LMStudio and LocalAI, works on a single machine.

GPUStack allows you to create a unified cluster from any brand of GPUs in Apple MacBooks, Windows PCs, and Linux servers. Administrators can deploy LLMs from popular repositories such as Hugging Face. Developers can then access LLMs just as easily as accessing public LLM services from vendors like OpenAI or Microsoft Azure.

For more details about GPUStack, visit:

GitHub repo: https://github.com/gpustack/gpustack

User guide: https://docs.gpustack.ai

Why GPUStack?

Today, organizations who want to host LLMs on a cluster of GPU servers have to do a lot of work to integrate a complex software stack. By using GPUStack, organizations no longer need to worry about cluster management, GPU optimization, LLM interference engines, usage and metering, user management, API access, and dashboard UI. GPUStack is a complete software platform for building your own LLM-as-a-Service (LLMaaS).

As the following figure illustrates, the admin deploys models into GPUStack from a repository like HuggingFace, and then developers can connect to GPUStack to use these models in their applications.

Key features of GPUStack

GPU cluster setup and resource aggregation

GPUStack aggregates all GPU resources within a cluster. It is designed to support all GPU vendors, including Nvidia, Apple, AMD, Intel, Qualcomm, and others. GPUStack is compatible with a laptops, desktops, workstations, and servers running MacOS, Windows, and Linux.

The initial release of GPUStack supports Windows PCs and Linux servers with Nvidia graphics cards, and Apple Macs.

Deployment and Inference for Models

GPUStack supports distributed deployment and inference of LLMs across a cluster of GPU machines.

GPUStack selects the best inference engine for running the given LLM on the given GPU. The first LLM inference engine supported by GPUStack is LLaMA.cpp, which allows GPUStack to support GGUF models from Hugging Face and all models listed in the ollama library (ollama.com/library).

You can run any model on GPUStack by first converting it to GGUF format and uploading it to Hugging Face or Ollama library.

Support of other inference engines, such as vLLM, is on our roadmap and will be provided in the future.

Note: GPUStack will automatically schedule the model you select to run on machines with appropriate resources, relieving you of manual intervention. If you want to assess the resource consumption of your chosen model, you can use our GGUF Parser project: https://github.com/gpustack/gguf-parser-go. We intend to provide more detailed tutorials in the future.

Although GPU acceleration is recommended for inference, we also support CPU inference, though the performance isn't as good as GPU. Alternatively, using a mix of GPU and CPU for inference can maximize resource utilization, which is particularly useful in edge or resource-constrained environments.

Easy integration with your applications

GPUStack offers OpenAI-compatible APIs and provides an LLM playground along with API keys. The playground enables AI developers to experiment with and customize your LLMs, and seamlessly integrate them into AI-enabled applications.

Additionally, you can use the metrics GPUStack provides to understand how your AI applications utilize various LLMs. This helps administrators manage GPU resource consumption effectively.

Observability metrics for GPUs and LLMs

GPUStack provides comprehensive metrics performance, utilization, and status monitoring.

For GPUs, administrators can use GPUStack to monitor real-time resource utilization and system status. Based on these metrics:

Administrators perform scaling, optimization, and other maintenance operations.
GPUStack adjusts its model scheduling algorithm.

For LLMs, developers can use GPUStack to access metrics like token throughput, token usage, and API request throughput. These metrics help developers evaluate model performance and optimize their applications. GPUStack plans to support auto-scaling based on these inference performance metrics in future releases.

Authentication and access control

GPUStack also provides authentication and role-based access control (RBAC) for enterprises. Users on the platform can have either admin or regular user roles. This guarantees that only authorized administrators can deploy and manage LLMs and that only authorized developers can utilize them.

GPUStack Use Cases

GPUStack unlocks a world of possibilities for running LLMs on any GPU vendors. Here are just a few examples of what you can achieve with GPUStack:

Aggregate existing MacBooks, Windows PCs, and other GPU resources to offer a low-cost LLMaaS for a development team.
In limited resource environments, aggregate multiple edge nodes to provide LLMaaS on CPU resources.
Create your own enterprise-wide LLMaaS in your own data center for highly sensitive workloads that cannot be hosted in a cloud.

Getting Started with GPUStack

Installation

Linux or MacOS

GPUStack provides a script to install it as a service on systemd or launchd based systems. To install GPUStack using this method, execute:

curl -sfL https://get.gpustack.ai | sh -

Now you have deployed and started the GPUStack server, which serves as the first worker node. You can access the GPUStack page via http://myserver (Replace with the IP address or domain of the host you installed).

Log in to GPUStack with username admin and the default password. You can run the following command to get the password for the default setup:

cat /var/lib/gpustack/initial_admin_password

To add additional worker nodes and form a GPUStack cluster, please run the following command on each worker node:

curl -sfL https://get.gpustack.ai | sh - --server-url http://myserver --token mytoken

Replace http://myserver with your GPUStack server URL and mytoken with your secret token for adding workers. To retrieve the token in the default setup from the GPUStack server, use the following command:

cat /var/lib/gpustack/token

Or follow the instructions on GPUStack to add workers:

Windows

Run PowerShell as administrator, then run the following command to install GPUStack:

Invoke-Expression (Invoke-WebRequest -Uri "https://get.gpustack.ai" -UseBasicParsing).Content

You can access the GPUStack page via http://myserver (Replace with the IP address or domain of the host you installed).

Log in to GPUStack with username admin and the default password. You can run the following command to get the password for the default setup:

Get-Content -Path (Join-Path -Path $env:APPDATA -ChildPath "gpustack\initial_admin_password") -Raw

Optionally, you can add extra workers to form a GPUStack cluster by running the following command on other nodes:

Invoke-Expression "& { $((Invoke-WebRequest -Uri "https://get.gpustack.ai" -UseBasicParsing).Content) } -ServerURL http://myserver -Token mytoken"

In the default setup, you can run the following to get the token used for adding workers:

Get-Content -Path (Join-Path -Path $env:APPDATA -ChildPath "gpustack\token") -Raw

For other installation scenarios, please refer to our installation documentation at: https://docs.gpustack.ai/docs/quickstart

Serving LLMs

As an LLM administrator, you can log in to GPUStack as the default system admin, navigate to Resources to monitor your GPU status and capacities, and then go to Models to deploy any open-source LLM into the GPUStack cluster. This enables you to provide these LLMs to regular users for integration into their applications. This approach helps you to efficiently utilize your existing resources and deliver stable LLM services for various needs and scenarios.

Access GPUStack to deploy the LLMs you need. Choose models from Hugging Face (only GGUF format is currently supported) or Ollama Library, download them to your local environment, and run the LLMs:

GPUStack will automatically schedule the model to run on the appropriate Worker:

You can manage and maintain LLMs by checking API requests, token consumption, token throughput, resource utilization status, and more. This helps you decide whether to scale up or upgrade LLMs to ensure service stability.

Integrating with your applications

As an AI application developer, you can log in to GPUStack as a regular user and navigate to Playground from the menu. Here, you can interact with the LLM using the UI playground.

Next, visit API Keys to generate and save your API key. Return to Playground to customize your LLM by adjusting the system prompt, adding few-shot learning examples, or resizing prompt parameters. When you're done, click View Code and select your preferred code format (curl, Python, Node.js) along with the API key. Use this code in your applications to enable communication with your private LLMs.

you can access the OpenAI-compatible API now, for example, use curl as the following:

export GPUSTACK_API_KEY=myapikey
curl http://myserver/v1-openai/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $GPUSTACK_API_KEY" \
  -d '{
    "model": "llama3",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ],
    "stream": true
  }'

Join Our Community

Please find more information about GPUStack at: https://gpustack.ai.

If you encounter any issues or have suggestions for GPUStack, feel free to join our Community for support from the GPUStack team and to connect with fellow users globally.

We are actively enhancing the GPUStack project and plan to introduce new features in the near future, including support for multimodal models, additional accelerators like AMD ROCm or Intel oneAPI, and more inference engines. Before getting started, we encourage you to follow and star our project on GitHub at gpustack/gpustack to receive instant notifications about all future releases. We welcome your contributions to the project.

About Us

GPUStack is brought to you by Seal, Inc., a team dedicated to enabling AI access for all. Our mission is to enable enterprises to use AI to conduct their business, and GPUStack is a significant step towards achieving that goal.

Quickly build your own LLMaaS platform with GPUStack! Start experiencing the ease of creating GPU clusters locally, running and using LLMs, and integrating them into your applications.

Blog