Sending complex structs to Ruby from Rust

leehambley

Lee Hambley

Posted on July 20, 2020

Sending complex structs to Ruby from Rust

Actually this works for any language using the Foreign Function Interface (FFI), so the same can be said of Node.js, Python, or anything else (even lower level languages such as C and C++).

C's Application Binary Interface (ABI) is the standard by which nearly all languages interoperate. C defines how memory should be laid out, and if two languages can agree on that, and on the calling conventions (how and where arguments and return values are provided in memory) then you can call one from the other.

This isn't absolutely universal, for example Go, with it's runtime and garbage collector makes this much, much more difficult. (my theory is that Google does so much with gRPC and Protobufs, that C interop in the same memory space wasn't a design goal of Go because the "Google calling convention" is at a much higher abstraction level.)

Anyway, I was working on a piece of tooling that should be consumed from Node.js and Ruby, and I thought Rust would be an amazing place to start, I can code the logic once, and simply use it from both languages. I've had minimal experience with FFI before, and minimal experience building C libraries to consume (from other C programs, at the time) so I wanted to tackle the hard part first.


Getting Started

My test program exposed a struct, a simple thing with three string fields which contain metadata about reserved words (words my program can't allow you to use for your own purposes)

pub struct Word {
    pub word: &'static str,
    pub reason: &'static str,
    pub kind: Kind,
}
Enter fullscreen mode Exit fullscreen mode

Two things to point out:

  • The strings are compiled-in, hard-coded at compile time, the list of reserved words is not dynamic, so the &'static lifetime annotation allows those str pointers to point at a memory location inside the library/binary where those compiled strings land.
  • The kind field is an Enum, there's only two kinds, and I believe using Enums wherever possible over strings is really important, but we do have to do some work to make sure it can behave like a string:
use std::fmt;
#[derive(Debug, PartialEq)]
pub enum Kind {
    Runner,
    Flag,
}
impl fmt::Display for Kind {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        fmt::Debug::fmt(self, f)
    }
}
Enter fullscreen mode Exit fullscreen mode

These derived traits are cool, with the impl fmt::Display.. line, the compiler will teach my Enum how to stringify itself, so the struct we have in the Rust side of things has two string members, and one enum member which we can bend into a string when we need to. To complete the example before moving into the FFI boundary code, let's see the function that exposes these:

pub fn words() -> Vec<Word> {
    return vec![
        Word {
            word: "python",
            reason: "test test test test",
            kind: Kind::Runner,
        },
        Word {
            word: "bash3",
            reason: "Used as an extension to activate the Bash (v3) runner.",
            kind: Kind::Runner,
        },
    ];
}
Enter fullscreen mode Exit fullscreen mode

Here, you can see the strings are hard-coded, and &'static, and the kind is a kindof singleton instance (if you like) of the Runner member of the enum.


C Interoperability

Unfortunately, we can't send this struct over a boundary to any C API compatible programs; the ABI doesn't know what to do with a Rust string slice (&str), it doesn't know how to unpack the meaning of our enum. The biggest problem(s) are that C's ABI expects strings to be terminated with a 0x00 byte (null byte), and that it doesn't really have any concept of enums at all. If we sent our ReservedWord to a Ruby or Node program, it would try to follow the rules of the C ABI, but we're not complying, so it will almost certainly read past some memory limit, looking for the end of a string, or trying to interpret the enum and cause a Segmentation fault, which is fatal for the program.

For the C ABI interop, Rust provides std::ffi::.. and std::os::raw::... which contains types that are C ABI compatible.

With that in mind, then I elected to keep all my Rust code Rustish.. and define a boundary function to take my Rust data structures and make them work with C, here's what I came up with for the types, we'll talk about the function right after:

use std::os::raw::c_char;
#[repr(C)]
pub struct ReservedWord {
  word: *mut c_char,
  reason: *mut c_char,
  kind: *mut c_char,
}
Enter fullscreen mode Exit fullscreen mode

Reminder, in C, strings are pointers to the first char (8 bit bytes) at an address, were you can simply iterate by taking the initial address, and adding addr+sizeof(char) until the character that you read is a 0x00 null byte. Whether it's *mut or *const (in Rust) doesn't make a difference, but the APIs we'll use to provide a raw pointer happen to return *mut not *const, so we can do the same, it doesn't hurt anything (and, as soon as we send a pointer to another program we are outside Rust's safety guarantees, so *mut somewhat encodes that to remind us)

The #[repr(C)] is interesting, that tells Rust's compiler to lay this struct out as C would expect, in this case I would imagine that sizeof(ReservedWord) is 3 × sizeof(*mut c_char) or about 12 bytes on a 64 bit platform.

We could possibly take a shortcut now, because in the example the strings are &'static, and we wouldn't run into lifecycle/lifetime issues, because as long as our library is loaded, the strings are in scope. But for my use-case, this was only true for the teaching example, I knew that I'd soon need to send data over the C ABI that didn't have a simple static lifetime.

So, we want to write a function with the following signature:

#[no_mangle]
extern "C" fn reserved_words() -> *mut ReservedWords {
}
Enter fullscreen mode Exit fullscreen mode

The #[no_mangle] means that we won't randomly generate a symbol name for this in the resulting library, it will be available for anyone to call who loads us.

You'll also notice the extern "C" in the preamble, that's important too, that helps the compiler know what to do with this function's calling convention.

You may have noticed I'm returning a *mut ReservedWords, plural. We haven't mentioned that yet, but because of the way Arrays work in C (hint: they, like it seems everything else are pointers) we need to tell the caller how many ReservedWord to expect, so they can read precisely that many, and stay within the memory boundaries, here's the struct definition:

#[repr(C)]
pub struct ReservedWords {
  len: usize,
  words: *mut ReservedWord,
}
Enter fullscreen mode Exit fullscreen mode

usize is interesting, it's size depends what platform you are compiling on, it can be either 4 or 8 bytes.

Using a size type over a bare int or uint just helps us communicate the intent of this field a bit better.


OK, pack it up!

So there's no real way to sugar coat this, here's 20 lines of code which implement that function correctly, we'll talk about the important parts right after:

#[no_mangle]
extern "C" fn reserved_words() -> *mut ReservedWords {
  let mut v: Vec<ReservedWord> = vec![]; 

  for word in reserved::words() { 
    let w =  CString::new(word.word).expect("boom:word"); 
    let r = CString::new(word.reason).expect("boom:reason");
    let k = CString::new(reserved::Kind::Runner.to_string()).expect("boom:kind");

    v.push(ReservedWord {
      word: w.into_raw(), 
      reason: r.into_raw(),
      kind: k.into_raw(),
    }); 
  }

  let rw = ReservedWords {
    len: v.len(), 
    words: Box::into_raw(v.into_boxed_slice()) as *mut ReservedWord, 
  };

  return Box::into_raw(Box::new(rw)); 
}
Enter fullscreen mode Exit fullscreen mode

The biggest thing we've introduced here is Box::.., if you're new to Rust know this, that Box is a way to store data on the heap, not on the stack. There's a lot more to this, but sufficed to say, it's a good way to bypass the lifetime restrictions for us, because we need this data to exist even after our function has returned (if we stored it on the stack, it would cease to exist when the stack frame was popped and we returned to the caller).

Allocating on the heap isn't for free, and often times when you hear about high performance "alloc free" code, it means the programmers succeeded in never needing to allocate on the heap, and only within the stack (which is pre-allocated for you, and much, much less time-expensive).

So, let's check out the annotated code:

  1. We're using a vector again, Vector<T> doesn't meet the C ABI requirements, but we'll fix that a few lines later.
  2. A simple loop to copy them over, note at this point we still don't know (or care) how many there are, we'll count them later.
  3. CString::new is implemented for a variety of types, this one is quite happy to accept a &'static str. CString will wrap our string, and terminate it with a null byte, it's the type we need to use to expose a string to C properly.
  4. CString::new returns a Result<CString,NulError>, in my case the strings are statically allocated and hard-coded, so I'm taking liberties with the error checking and unwrapping it with .expect("boom:word")
  5. The #[derive(Debug)] trick we did on the enum Kind allows us to call to_string() on our enum, that returns something like Runner as a dynamically allocated string, CString::new is just as happy to accept this.
  6. Two things happening here:
    1. into_raw is really interesting, it essentially causes Rust to forget about the string, this means the string will continue to exist even after our function has returned.
    2. We can't inline the let w = CString::new... line into this line because the lifetimes aren't compatible, by assigning a local variable in the let w.. line w becomes addressable, so we can call w.into_raw() (returns a pointer) on it, if it were in-lined it wouldn't have an address in it's own right, it would just be at a certain offset in the struct.
  7. We just push an instance of the ReservedWord into the vector, at this point a ReservedWord (this is the C ABI compatible one) is ~12 bytes of densely packed pointers to strings that are somewhere in memory.
  8. Calling len() on Vector<T> returns a usize so, that's why we used that in our struct, this is the first time we need to know how many things are in that vector.
  9. v.into_boxed_slice() will take our vector, trim any unused space at the end, and lay it out in a way that C can handle it, it turns a vector into a slice, now that we know the length, we can safely work with a slice, so that's fine. Box::into_raw() will allocate that on the heap, and disassociate that from Rust's memory safety guarantees, so we just successfully "leaked" a bunch of ReservedWords 🚀. Interestingly, this method wants to return a *mut [ReservedWord], but slices aren't defined in the C ABI, so we just cast the pointer to *mut ReservedWord, we know the size, and we're pointing at just the first one anyway, so this is fine, and keeps us compatible with the C ABI.
  10. Lastly, we "leak" the ReservedWords struct, containing the length and the pointer to the first ReservedWord.

My apologies that this was quite in-depth, I fought tooth-and-nail the knowledge in each of those ten bullet points, of course I read the docs, but until I had the mental model, I couldn't really decipher what I was reading. In hindsight, the docs make it crystal clear what's going on here.

Specifically, if you take only one thing away from this post, it is that you must explicitly leak every piece of memory that you want to make available outside Rust, in this case it's three strings each for each ReservedWord, one ReservedWord each for each reserved word, and then finally the ReservedWords that is the entrypoint for any C ABI caller to access this data.

If you don't do this your library may work some or all of the time, I could access most structs most of the time when I was taking liberties with this code, in those cases I had been lucky that the system hadn't reused, or reallocated that memory. Probably in a test script on an otherwise idle machine that will be true more of the time, in a busy production environment I'd expect you to learn this mistake earlier.


Use From Ruby

No exhaustive commentary here, just a block of code to show how to use this from Ruby, check the FFI docs for more:

module Mitre

  class ReservedWord < FFI::Struct
    layout :word, :string,
           :reason, :string,
           :kind, :string
  end

  class ReservedWords < FFI::Struct
    layout :len,  :uint8,
           :words, :pointer
  end

  extend FFI::Library

  ffi_lib begin
    prefix = Gem.win_platform? ? "" : "lib"
    "#{File.expand_path("./target/debug/", __dir__)}/#{prefix}mitre.#{FFI::Platform::LIBSUFFIX}"
  end

  attach_function :reserved_words, [ ], :pointer
  attach_function :free_reserved_words, [:pointer], :void
end

def print_rw(rw)
  puts "Word: #{rw[:word]} | Kind: #{rw[:kind]} | Reason: #{rw[:reason]}"
end

rws = Mitre::ReservedWords.new(Mitre.reserved_words())
puts rws.to_ptr
puts "There are #{rws[:len]} reserved words"
0.upto(rws[:len]-1) do |i|
  rw = Mitre::ReservedWord.new(rws[:words] + (i * Mitre::ReservedWord.size))
  print_rw(rw)
end
Enter fullscreen mode Exit fullscreen mode

Reclaiming memory

Keen eyes who made it this far might have spotted the free_reserved_words function, this is a simple one, to avoid leaking memory, we need to free it. We could possibly call free() from Ruby, and it might even work some of the time, on some systems. However differences in which allocator was used when compiling Rust and Ruby (or anything else) and even what configuration it is running it can be significant.

The rule is always, whoever allocates frees, so let's look at that:

#[no_mangle]
extern "C" fn free_reserved_words(ptr: *mut ReservedWords) {
  if ptr.is_null() {
    eprintln!("free_reserved_words() error got NULL ptr!");
    ::std::process::abort();
  }
  unsafe {
    let w: Box<ReservedWords> = Box::from_raw(ptr);
    let words: Vec<ReservedWord> = Vec::from_raw_parts(w.words, w.len, w.len);
    for word in words {
      CString::from_raw(word.kind);
      CString::from_raw(word.reason);
      CString::from_raw(word.word);
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

No need for exhaustive explanations here, either realy. from_raw is the exact opposite of into_raw, just as from_raw_parts is the opposite of into_boxed_slice for the Vector<T>.

It is sufficient to make Rust aware of all the pointers, to "reclaim" them, and let the function simply return, Rust's memory model will do the rest, any clean-up will happen automagically.


Wrapping Up

I'm a Rust beginner, but I had this working after just a few short hours of experimentation. Some background in the topics helps, but it's a testament to how well designed Rust is as a language.

Happy hacking!

💖 💪 🙅 🚩
leehambley
Lee Hambley

Posted on July 20, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related