Sending complex structs to Ruby from Rust
Lee Hambley
Posted on July 20, 2020
Actually this works for any language using the Foreign Function Interface (FFI), so the same can be said of Node.js, Python, or anything else (even lower level languages such as C and C++).
C's Application Binary Interface (ABI) is the standard by which nearly all languages interoperate. C defines how memory should be laid out, and if two languages can agree on that, and on the calling conventions (how and where arguments and return values are provided in memory) then you can call one from the other.
This isn't absolutely universal, for example Go, with it's runtime and garbage collector makes this much, much more difficult. (my theory is that Google does so much with gRPC and Protobufs, that C interop in the same memory space wasn't a design goal of Go because the "Google calling convention" is at a much higher abstraction level.)
Anyway, I was working on a piece of tooling that should be consumed from Node.js and Ruby, and I thought Rust would be an amazing place to start, I can code the logic once, and simply use it from both languages. I've had minimal experience with FFI before, and minimal experience building C libraries to consume (from other C programs, at the time) so I wanted to tackle the hard part first.
Getting Started
My test program exposed a struct, a simple thing with three string fields which contain metadata about reserved words (words my program can't allow you to use for your own purposes)
pub struct Word {
pub word: &'static str,
pub reason: &'static str,
pub kind: Kind,
}
Two things to point out:
- The strings are compiled-in, hard-coded at compile time, the list of reserved words is not dynamic, so the
&'static
lifetime annotation allows thosestr
pointers to point at a memory location inside the library/binary where those compiled strings land. - The
kind
field is an Enum, there's only two kinds, and I believe using Enums wherever possible over strings is really important, but we do have to do some work to make sure it can behave like a string:
use std::fmt;
#[derive(Debug, PartialEq)]
pub enum Kind {
Runner,
Flag,
}
impl fmt::Display for Kind {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
fmt::Debug::fmt(self, f)
}
}
These derived traits are cool, with the impl fmt::Display..
line, the compiler will teach my Enum how to stringify itself, so the struct we have in the Rust side of things has two string members, and one enum member which we can bend into a string when we need to. To complete the example before moving into the FFI boundary code, let's see the function that exposes these:
pub fn words() -> Vec<Word> {
return vec![
Word {
word: "python",
reason: "test test test test",
kind: Kind::Runner,
},
Word {
word: "bash3",
reason: "Used as an extension to activate the Bash (v3) runner.",
kind: Kind::Runner,
},
];
}
Here, you can see the strings are hard-coded, and &'static
, and the kind
is a kindof singleton instance (if you like) of the Runner
member of the enum.
C Interoperability
Unfortunately, we can't send this struct over a boundary to any C API compatible programs; the ABI doesn't know what to do with a Rust string slice (&str
), it doesn't know how to unpack the meaning of our enum
. The biggest problem(s) are that C's ABI expects strings to be terminated with a 0x00
byte (null byte), and that it doesn't really have any concept of enums at all. If we sent our ReservedWord
to a Ruby or Node program, it would try to follow the rules of the C ABI, but we're not complying, so it will almost certainly read past some memory limit, looking for the end of a string, or trying to interpret the enum and cause a Segmentation fault
, which is fatal for the program.
For the C ABI interop, Rust provides std::ffi::..
and std::os::raw::...
which contains types that are C ABI compatible.
With that in mind, then I elected to keep all my Rust code Rustish.. and define a boundary function to take my Rust data structures and make them work with C, here's what I came up with for the types, we'll talk about the function right after:
use std::os::raw::c_char;
#[repr(C)]
pub struct ReservedWord {
word: *mut c_char,
reason: *mut c_char,
kind: *mut c_char,
}
Reminder, in C, strings are pointers to the first char
(8 bit bytes) at an address, were you can simply iterate by taking the initial address, and adding addr+sizeof(char)
until the character that you read is a 0x00
null byte. Whether it's *mut
or *const
(in Rust) doesn't make a difference, but the APIs we'll use to provide a raw pointer happen to return *mut
not *const
, so we can do the same, it doesn't hurt anything (and, as soon as we send a pointer to another program we are outside Rust's safety guarantees, so *mut
somewhat encodes that to remind us)
The #[repr(C)]
is interesting, that tells Rust's compiler to lay this struct out as C would expect, in this case I would imagine that sizeof(ReservedWord)
is 3 × sizeof(*mut c_char)
or about 12 bytes on a 64 bit platform.
We could possibly take a shortcut now, because in the example the strings are &'static
, and we wouldn't run into lifecycle/lifetime issues, because as long as our library is loaded, the strings are in scope. But for my use-case, this was only true for the teaching example, I knew that I'd soon need to send data over the C ABI that didn't have a simple static lifetime.
So, we want to write a function with the following signature:
#[no_mangle]
extern "C" fn reserved_words() -> *mut ReservedWords {
}
The #[no_mangle]
means that we won't randomly generate a symbol name for this in the resulting library, it will be available for anyone to call who loads us.
You'll also notice the extern "C"
in the preamble, that's important too, that helps the compiler know what to do with this function's calling convention.
You may have noticed I'm returning a *mut ReservedWords
, plural. We haven't mentioned that yet, but because of the way Arrays work in C (hint: they, like it seems everything else are pointers) we need to tell the caller how many ReservedWord
to expect, so they can read precisely that many, and stay within the memory boundaries, here's the struct definition:
#[repr(C)]
pub struct ReservedWords {
len: usize,
words: *mut ReservedWord,
}
usize
is interesting, it's size depends what platform you are compiling on, it can be either 4 or 8 bytes.
Using a size
type over a bare int
or uint
just helps us communicate the intent of this field a bit better.
OK, pack it up!
So there's no real way to sugar coat this, here's 20 lines of code which implement that function correctly, we'll talk about the important parts right after:
#[no_mangle]
extern "C" fn reserved_words() -> *mut ReservedWords {
let mut v: Vec<ReservedWord> = vec![]; ①
for word in reserved::words() { ②
let w = ③ CString::new(word.word).expect("boom:word"); ④
let r = CString::new(word.reason).expect("boom:reason");
let k = CString::new(reserved::Kind::Runner.to_string()⑤).expect("boom:kind");
v.push(ReservedWord {
word: w.into_raw(), ⑥
reason: r.into_raw(),
kind: k.into_raw(),
}); ⑦
}
let rw = ReservedWords {
len: v.len(), ⑧
words: Box::into_raw(v.into_boxed_slice()) as *mut ReservedWord, ⑨
};
return Box::into_raw(Box::new(rw)); ⑩
}
The biggest thing we've introduced here is Box::..
, if you're new to Rust know this, that Box
is a way to store data on the heap, not on the stack. There's a lot more to this, but sufficed to say, it's a good way to bypass the lifetime restrictions for us, because we need this data to exist even after our function has returned (if we stored it on the stack, it would cease to exist when the stack frame was popped and we returned to the caller).
Allocating on the heap isn't for free, and often times when you hear about high performance "alloc free" code, it means the programmers succeeded in never needing to allocate on the heap, and only within the stack (which is pre-allocated for you, and much, much less time-expensive).
So, let's check out the annotated code:
- We're using a vector again,
Vector<T>
doesn't meet the C ABI requirements, but we'll fix that a few lines later. - A simple loop to copy them over, note at this point we still don't know (or care) how many there are, we'll count them later.
-
CString::new
is implemented for a variety of types, this one is quite happy to accept a&'static str
.CString
will wrap our string, and terminate it with a null byte, it's the type we need to use to expose a string to C properly. -
CString::new
returns aResult<CString,NulError>
, in my case the strings are statically allocated and hard-coded, so I'm taking liberties with the error checking and unwrapping it with.expect("boom:word")
- The
#[derive(Debug)]
trick we did on theenum Kind
allows us to callto_string()
on our enum, that returns something likeRunner
as a dynamically allocated string,CString::new
is just as happy to accept this. - Two things happening here:
-
into_raw
is really interesting, it essentially causes Rust to forget about the string, this means the string will continue to exist even after our function has returned. - We can't inline the
let w = CString::new...
line into this line because the lifetimes aren't compatible, by assigning a local variable in thelet w..
linew
becomes addressable, so we can callw.into_raw()
(returns a pointer) on it, if it were in-lined it wouldn't have an address in it's own right, it would just be at a certain offset in the struct.
-
- We just push an instance of the
ReservedWord
into the vector, at this point aReservedWord
(this is the C ABI compatible one) is ~12 bytes of densely packed pointers to strings that are somewhere in memory. - Calling
len()
onVector<T>
returns ausize
so, that's why we used that in our struct, this is the first time we need to know how many things are in that vector. -
v.into_boxed_slice()
will take our vector, trim any unused space at the end, and lay it out in a way that C can handle it, it turns a vector into a slice, now that we know the length, we can safely work with a slice, so that's fine.Box::into_raw()
will allocate that on the heap, and disassociate that from Rust's memory safety guarantees, so we just successfully "leaked" a bunch ofReservedWords
🚀. Interestingly, this method wants to return a*mut [ReservedWord]
, but slices aren't defined in the C ABI, so we just cast the pointer to*mut ReservedWord
, we know the size, and we're pointing at just the first one anyway, so this is fine, and keeps us compatible with the C ABI. - Lastly, we "leak" the
ReservedWords
struct, containing the length and the pointer to the firstReservedWord
.
My apologies that this was quite in-depth, I fought tooth-and-nail the knowledge in each of those ten bullet points, of course I read the docs, but until I had the mental model, I couldn't really decipher what I was reading. In hindsight, the docs make it crystal clear what's going on here.
Specifically, if you take only one thing away from this post, it is that you must explicitly leak every piece of memory that you want to make available outside Rust, in this case it's three strings each for each ReservedWord
, one ReservedWord
each for each reserved word, and then finally the ReservedWords
that is the entrypoint for any C ABI caller to access this data.
If you don't do this your library may work some or all of the time, I could access most structs most of the time when I was taking liberties with this code, in those cases I had been lucky that the system hadn't reused, or reallocated that memory. Probably in a test script on an otherwise idle machine that will be true more of the time, in a busy production environment I'd expect you to learn this mistake earlier.
Use From Ruby
No exhaustive commentary here, just a block of code to show how to use this from Ruby, check the FFI docs for more:
module Mitre
class ReservedWord < FFI::Struct
layout :word, :string,
:reason, :string,
:kind, :string
end
class ReservedWords < FFI::Struct
layout :len, :uint8,
:words, :pointer
end
extend FFI::Library
ffi_lib begin
prefix = Gem.win_platform? ? "" : "lib"
"#{File.expand_path("./target/debug/", __dir__)}/#{prefix}mitre.#{FFI::Platform::LIBSUFFIX}"
end
attach_function :reserved_words, [ ], :pointer
attach_function :free_reserved_words, [:pointer], :void
end
def print_rw(rw)
puts "Word: #{rw[:word]} | Kind: #{rw[:kind]} | Reason: #{rw[:reason]}"
end
rws = Mitre::ReservedWords.new(Mitre.reserved_words())
puts rws.to_ptr
puts "There are #{rws[:len]} reserved words"
0.upto(rws[:len]-1) do |i|
rw = Mitre::ReservedWord.new(rws[:words] + (i * Mitre::ReservedWord.size))
print_rw(rw)
end
Reclaiming memory
Keen eyes who made it this far might have spotted the free_reserved_words
function, this is a simple one, to avoid leaking memory, we need to free it. We could possibly call free()
from Ruby, and it might even work some of the time, on some systems. However differences in which allocator was used when compiling Rust and Ruby (or anything else) and even what configuration it is running it can be significant.
The rule is always, whoever allocates frees, so let's look at that:
#[no_mangle]
extern "C" fn free_reserved_words(ptr: *mut ReservedWords) {
if ptr.is_null() {
eprintln!("free_reserved_words() error got NULL ptr!");
::std::process::abort();
}
unsafe {
let w: Box<ReservedWords> = Box::from_raw(ptr);
let words: Vec<ReservedWord> = Vec::from_raw_parts(w.words, w.len, w.len);
for word in words {
CString::from_raw(word.kind);
CString::from_raw(word.reason);
CString::from_raw(word.word);
}
}
}
No need for exhaustive explanations here, either realy. from_raw
is the exact opposite of into_raw
, just as from_raw_parts
is the opposite of into_boxed_slice
for the Vector<T>
.
It is sufficient to make Rust aware of all the pointers, to "reclaim" them, and let the function simply return, Rust's memory model will do the rest, any clean-up will happen automagically.
Wrapping Up
I'm a Rust beginner, but I had this working after just a few short hours of experimentation. Some background in the topics helps, but it's a testament to how well designed Rust is as a language.
Happy hacking!
Posted on July 20, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024