Jozu Hub vs. Docker Hub? Which One Works Best for AI/ML?

jwilliamsr

Jesse Williams

Posted on November 22, 2024

Jozu Hub vs. Docker Hub? Which One Works Best for AI/ML?

Container registries like Jozu Hub and Docker Hub offer developers a way to manage their container images and facilitate the distribution and management of software applications packaged in containers. These container registries support modern software development practices, including continuous integration/continuous deployment (CI/CD), microservices, cloud-native applications, and more recently machine learning (ML).

Docker Hub is by far the most popular registry for developers and DevOps teams. It was built for application development, and helped to drive the adoption of containers and microservices, serving as the starting point for countless projects. It’s not surprising that when LLM development took off, many teams looked to Docker Hub (and Dockerfiles) as a jump off point for their projects, only to discover that these projects come with a set of very unique challenges.

We experienced this first-hand prior to creating KitOps, and in this post I’ll go over why we created KitOps (and Jozu Hub, the only OCI registry purpose built for hosting KitOps ModelKits) and why it’s uniquely suited for AI and ML development.

But first, let's go over the basics of container registries.

What is a container registry?

A container registry is a centrally managed repository or collection of repositories developers use to store, access, and manage container images. A container image contains all the entities, files, and components developed within an application's various stages. By storing and managing your container images, you can take on an agile, continuous integration and continuous deployment, and DevOps development approach to building.

KitOps

One key feature of container registries is that users can upload their images to various public repositories, pull them for deployment, and seamlessly connect to container orchestration platforms like Kubernetes and Docker. Besides the ability to upload and connect to various container orchestration, here are a few other features:

  • Provide ML engineers and developers with a centralized image management platform for tracking and using their docker images.
  • Perform vulnerability scanning and automated tests for security checks.
  • Supports container image versioning, making it easier for developers to manage multiple application iterations.
  • Integrate with CI/CD tools to streamline developers' workflow for building, testing, and deploying container images. This allows teams to commit code changes and automatically build and deploy new container images.
  • Support the creation of public or private repositories. Public repositories are accessible to everyone; however, private repositories are restricted to a few.
  • Manage access and permissions for different team members' accounts to allow users to pull and push images with a third-party client.
  • Scale to accommodate the organization’s image storage needs and are adaptable to deployment options.

Why do ML engineers need a container registry?

For starters, ML engineers and AI developers need to store all the components used in a machine learning (ML) project, not just the codes but also the data and dependencies, to maintain consistency and reproducibility across different stages. Therefore, once every component and asset has been packaged and stored in an image, container registries organize and manage the versions of this container image.

Taxonomy of Docker hub terms and concepts (from MS learn: https://learn.microsoft.com/en-us/dotnet/architecture/microservices/container-docker-introduction/docker-containers-images-registries)

Besides reproducibility, container registries enable sharing of the entire development environment, from dependencies and codebase to configurations and runtime settings. This promotes collaboration and scalability within the team, allowing developers to share resources easily and have the agility to pull a container image and run it to reproduce a result.

Where Docker Hub fails in ML workflow

For ML engineers and AI developers, Docker Hub is handy because it provides an environment for every stage of the ML lifecycle, from development to deployment. It allows you to seamlessly encapsulate your entire dependency stack down to the host OS, packaged dependencies, and source code as a single docker image. Using Docker Desktop, developers can collaborate without compromising on consistency and portability.

Dockerfile within Docker images: source:https://www.geeksforgeeks.org/what-is-docker-hub/

However, as great as that sounds, Docker Hub fails to meet the ML project demand in the following ways.

  • Iterative and complex process
  • Storage and bandwidth limits
  • Graphics Processing Unit (GPU) optimization

Iterative and complex process
For one, the whole ML process is iterative, with each step often requiring a different set of dependencies and having something new to keep track of. Because of this, a new and separate container is always needed. While Docker Hub can effectively track changes to your image's content (an important aspect of your ML project) the frequent updates in ML workflows still require careful management of dependencies and versioning, which Jozu Hub was designed for.

Storage and bandwidth limits
Another issue is bandwidth. Large-scale projects can have very large project components—dependencies, models, and datasets—which can be a problem for Docker Hub. This is mainly because Docker Hub has storage and bandwidth limits, especially if you're using a free account. If your components exceed these limits, pushing and pulling images will be challenging.

Graphics Processing Unit (GPU) optimization
The last issue is that your Docker Hub's base images aren't pre-configured and optimized for GPU dependencies. Most ML projects, especially AI projects, require GPU support. GPUs are important for AI projects because they allow your models to train faster and more efficiently. However, a manual dependency setup requires developers to install and configure Compute Unified Device Architecture (CUDA) and CUDA Deep Neural Network Library (cuDNN) dependencies needed for GPU processing.

This manual setup can be a burden, especially for teams that need a consistent environment across different machines.

While this article doesn't go into detail, you should definitely check out this guide on when to Dockerize vs. when to use ModelKit to know when Docker Hub works for your projects.

How Jozu Hub differs from Docker Hub for ML workflow

Docker Hub is a dependable choice if you primarily focus on containerization or want to use general-purpose images like a base Linux image. However, it lacks some specialized features.

Jozu Hub is a SaaS registry (and on-prem soon!) with a ModelKit-first experience designed for storing, tracking, and deploying your large language models (LLMs), AI, and ML projects. It operates effectively through KitOps, an open-source collaborative tool for data teams, powered by ModelKits and Kitfile.

The ModelKits bundle is an OCI-compliant packaging format that packages projects into one trackable and shareable artifact (meaning you can deploy a ModelKit through the same pipeline you deploy a Dockerfile, like Jenkins or Dagger.io). Besides standardizing the project, it also ensures that collaboration with the team and deployment is straightforward. The ModelKit works closely with the Kitfile, a YAML-based configuration file for managing ModelKit projects. Kitfile can be further divided into five parts as shown in the image below: package for metadata, model for serialized model details, docs for documentation information, code for project codebases, and datasets.

Making your own Kitfile

Each part of Kitfile ensures that the ModelKits configuration is handled efficiently and correctly, maintaining security throughout the process. Unlike Docker files, it particularly does this by defining everything from the model to the dataset, code, metadata, and artifacts.

To better understand how this works in practice, the YAML snippet below shows a minimal ModelKit configuration for distributing a pair of datasets. It includes metadata under the package for the author and the two datasets: one for training and another for validation, each with its name, file path, license, and description when applicable. You can learn more and see more ModelKit examples from the official KitOps documentation.

A minimal ModelKit for distributing a pair of datasets

What makes Jozu Hub a great choice for your AI/ML projects?

Jozu Hub addresses the issue mentioned earlier through its ModelKit system. This container registry offers developers a streamlined approach to managing models and dependencies.

Let’s explore a couple more benefits of ModelKits.

  • Model versioning.
  • Integration with MLOps platforms and tools.
  • Resources optimized for ML workloads.

Model Versioning
Jozu Hub has an advanced versioning and tagging system tailored for the ML space. ModelKit tag creates the artifacts for reproducibility in the ML workflow via the tag command, thus approaching managing versions without compromising the model's integrity within the ML lifecycle. The exciting part of the process is that developers can pack and unpack an entire package using a single command and simply run.

kit unpack \[flags\] [registry/]repository[:tag|@digest]
Enter fullscreen mode Exit fullscreen mode

Integration with MLOps platforms and tools
Jozu Hub comes with integration features, which are important because each step of your ML project requires a specialized tool. In addition to supporting integration tools, it supports GPUs, integrates with AI frameworks, and works with other compliant OCI registries.

ModelKits also allow you to use your favorite model, experimentation, version control tool (GitHub) and MLOps tool. You can use the tool you're most familiar with without wondering if you can store everything in a container. This also speaks to how fast this tool and service will probably be adopted within the team.

Resources optimized for ML workloads
Unlike Docker Hub, ModelKits includes only the essential assets, resulting in lightweight packages that are quick and easy to deploy. Each ModelKit is immutable, with a unique digest to ensure the correct data or artifacts are pulled each time. Hence, as a developer, you can be reassured that you are deploying the model you want. Furthermore, you can specify and easily manage dependencies since everything is shipped together. Similar to Docker CLI, KitOps comes with a KitOps CLI tool.

I hope that helps, if you have questions about integrating Jozu Hub or KitOps, join the conversation on Discord, drop an issue on our GitHub repository, or start using KitOps today.

💖 💪 🙅 🚩
jwilliamsr
Jesse Williams

Posted on November 22, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related