Understanding SafeTensors: A Secure Alternative to Pickle for ML Models

lukehinds

Luke Hinds

Posted on October 23, 2024

Understanding SafeTensors: A Secure Alternative to Pickle for ML Models

As machine learning models become increasingly leveraged in areas such as CodeGen, the security of model serialization formats has never been more pressing. The introduction of safetensors by Hugging Face represents a decent security centric step forward in addressing really significant security concerns, inherited from the Python world, while also bringing some performance improvements along for the ride. Let's dive deep into why safetensors exist, the security issues they solve, and why you should consider using them in your ML projects.

The Pickle Problem

Python's pickle format has long been the default serialization choice for many ML frameworks, including PyTorch. While convenient, pickle has a fundamental security flaw: it can execute arbitrary code during deserialization. This occurs because pickle was designed to serialize Python objects along with their behavior, not just their data.
Consider this malicious example:

import pickle
import os

class MaliciousPayload:
    def __reduce__(self):
        return (os.system, ('rm -rf /',))

# Creating and serializing malicious data
malicious_data = MaliciousPayload()
with open('model.pkl', 'wb') as f:
    pickle.dump(malicious_data, f)
Enter fullscreen mode Exit fullscreen mode

When this pickle file is deserialized, it would attempt to execute a pretty harmful command (granted it would need the root user to be effective). It can of course do anything it pleases and has its own shell execution abilities, so anything goes! exfiltrate sensitive data, install malware, execute remote code - you get the picture!

This vulnerability is particularly concerning in the ML ecosystem, where sharing pre-trained or fine trained models is now a common practice. Users often download models from various sources and allow them to run often unfettered, potentially exposing themselves to security risks. For example, Huggingface now has over 1 million models and very few of them are vetted for safety.

Enter SafeTensors

Safetensors was created specifically to address these security concerns while also providing additional benefits. The key stand out points of SafeTensors is pretty substantial and illustrates well why you should be using them, if you're in the business of shipping any sort of machine learning model (and not just the large language variety)

Secure by Design

The fundamental security advantage of safetensors lies in its limited scope. Unlike pickle, which can serialize arbitrary Python objects and code, safetensors is purpose-built to store only numerical tensors and their metadata. This restricted capability means that even if an attacker crafts a malicious safetensors file, it cannot execute arbitrary code during deserialization.
The format uses a simple header-content structure:

  • A JSON header containing metadata and tensor information
  • The actual tensor data in a flat binary format

This separation makes it impossible to embed executable code or malicious payloads within the file structure.

Performance Improvements

Beyond security, safetensors brings significant performance benefits. Safetensors supports memory mapping (mmap) of model files, which in turn gets us faster model loading, and reduced memory usage during loading.

The performance impact is particularly noticeable when loading LLMs. For example, loading a BERT model with safetensors can be up to 3x faster than with pickle-based formats.

Framework Agnostic

While initially developed for the Hugging Face ecosystem, safetensors is designed to be framework-agnostic and can therefore be used in PyTorch, TensorFlow, JAX and other ML frameworks

How to use'em

So let's jump right on in and look at how to use safetensors in practice, including conversion from pickle.

The design allows for partial loading of models, which can be particularly useful when working with huge models under memory constraints.

from safetensors import safe_open
from safetensors.torch import save_file, load_file
import torch

# Creating and saving tensors
tensors = {
    "weight": torch.randn(1000, 1000),
    "bias": torch.randn(1000)
}

# Saving to safetensors format
save_file(tensors, "model.safetensors")

# Loading specific tensors efficiently
with safe_open("model.safetensors", framework="pt") as f:
    tensor = f.get_tensor("weight")  # Only loads this tensor
Enter fullscreen mode Exit fullscreen mode

Migration and Compatibility

Converting existing pickle-based models to safetensors is straightforward.

from transformers import AutoModel
import torch
from safetensors.torch import save_file

# Load existing pickle-based model
model = AutoModel.from_pretrained("bert-base-uncased")

# Convert to state dict
state_dict = model.state_dict()

# Save as safetensors
save_file(state_dict, "converted_model.safetensors")
Enter fullscreen mode Exit fullscreen mode

When to Use SafeTensors

Anytime you can essentially, but especially expecting users to load models from untrusted sources

Anytime you need faster model loading times

You're working with large models, memory efficiency is important.

Wrapping up

Safetensors represents a decent security lift in ML model serialization, offering a not only secure, but also efficient, and flexible alternative to the alternative, pickle. While pickle remains widely used (and is often safe in certain circumstances when understood), the benefits of safetensors make it a no-brainer choice for new projects and a worthwhile migration target for existing ones.

Security in ML systems isn't just about model accuracy and safety, data privacy—it's also about the security of the entire ML-Pipeline and AI-Supply Chain, and that includes model serialization. Safetensors helps close a significant security gap while bringing additional performance benefits to your ML workflow.

💖 💪 🙅 🚩
lukehinds
Luke Hinds

Posted on October 23, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related