When Ferrous Metals Corrode, pt. XXII

Intro

This post is a summary of chapter 23 in the Rust book, "Foreign Functions", the final chapter in the book.

Apparently there's lots of code out there that's not written in Rust. Rust's foreign function interface (FFI) allows calling C and C++ functions from Rust, providing access to low-level facilities. As an example the chapter shows linking with libgit2.

Finding Common Data Representations

Rust and C both share a common language denominator - machine language. In order to understand how Rust values are represented in C, or vice versa, you need to consider their machine-level representations. C and Rust have a lot in common there. For example, Rust's usize and C's sizet are identical, and structs are essentially the same concept in both languages.

To explore those similarities between Rust and C types, the book starts with primitives and then moves on to more complex types. However, C has always been quite liberal about its types' representations, which can vary in size or whether they are signed or unsigned. Rust's std::os::raw module defines a set of Rust types that guarantee the same representation as its C counterparts. These types cover primitive integer and character types, which is convenient given Rust's primary use as a systems programming language.

I've already come across the [repr(C)] attribute which asks Rust to arrange struct fields similarly to how a C compiler would.

For instance this C struct:

  typedef struct {
    char *message;
    int klass;
} git_error;

Would be identical to this Rust struct:

use std::os::raw::{c_char, c_int};

#[repr(C)]
pub struct git_error {
    pub message: *const c_char,
    pub klass: c_int
}

Note [repr(C)] affects the struct layout, not the field representation. For this particular struct there probably would not be a lot of difference, but for e.g. enums the difference could be marked as Rusts compiler does some heavy optimization here.

Strings are a bit harder. Passing strings between Rust and C is difficult because they have different representations. For instance Rust strings don't have null terminators but use a length field and can in fact contain null characters, making them incompatible. To address this, Rust uses the CString and CStr types from the std::ffi module to represent owned and borrowed null-terminated arrays of bytes. These types have limited methods and are mainly used for construction and conversion to other types.

Declaring Foreign Functions and Variables

This makes the libc strln func available to Rust:

use std::os::raw::c_char;

extern {
    fn strlen(s: *const c_char) -> usize;
}

Rust assumes that functions inside extern blocks use C conventions and are defined as unsafe functions. The use of unsafe is necessary for functions that take a raw pointer because Rust cannot enforce the contract that C requires. By using an extern block, Rust can call C functions like strlen as if they were regular Rust functions. However, its type will indicate that it is a C function.

use std::ffi::CString;

let rust_str = "I'll be back";
let null_terminated = CString::new(rust_str).unwrap();
unsafe {
    assert_eq!(strlen(null_terminated.as_ptr()), 12);
}

Rust works with C strings by using the CString type, which allows Rust to pass null-terminated C strings as function arguments or return values. The CString::new function is used to build a null-terminated C string by adding a null byte to the end of a Rust string. The CString dereferences to CStr, which can be used with C functions such as strlen. The conversion might entail a copy of the string, so conversion costs can vary.

Global variables in C:

extern char **environ;

Would need to be written like this in Rust:

use std::ffi::CStr;
use std::os::raw::c_char;

extern {
    static environ: *mut *mut c_char;
}

Printing this:

unsafe {
    if !environ.is_null() && !(*environ).is_null() {
        let var = CStr::from_ptr(*environ);
        println!("first environment variable: {}",
                 var.to_string_lossy())
    }
}

Using Functions from Libraries

To use external libraries in Rust, you can use a #[link] attribute to specify the library to link with (see above for the extern keyword), e.g.:

use std::os::raw::c_int;

#[link(name = "git2")]
extern {
    pub fn git_libgit2_init() -> c_int;
    pub fn git_libgit2_shutdown() -> c_int;
}
...

To tell the linker where to search for libs you'll need a build.rs, for example:

fn main() {
    println!(r"cargo:rustc-link-search=native=/home/jimb/libgit2-0.25.1/build");
}

This will tell the linker how to link the program for building it, but it will also need to find the shared lib .so at runtime – either by placing the lib into a canonical location or via LD_LIBRARY_PATH. Alternatively, use static linking.

See the build script docs for details.

A Raw Interface to libgit2

In this section, the text outlines two questions that need to be answered in order to properly use libgit2 in Rust: how to use libgit2 functions in Rust, and how to build a safe Rust interface around them.

First, the book proceeds to answer the first question by writing a program that opens a Git repository and prints out the head commit, using a raw interface that requires a large collection of functions and types from libgit2 – essentially a large unsafe block.

Doing this by hand can be tedious; a tool that helps with creating these interfaces is the bindgen crate which can be used from build scripts.

Then we wrap that raw interface in a safe wrapper to enforce libgit2 rules.

libgit2 uses positive or zero return codes for success and negative codes for failure. For error details, use the giterr_last function which returns a pointer to a git_error structure. This structure is owned by libgit2 and should not be freed by the user. A Rust interface would typically use Result to handle errors, but in the raw interface, a custom function will be used instead:

use std::ffi::CStr;
use std::os::raw::c_int;

fn check(activity: &'static str, status: c_int) -> c_int {
    if status < 0 {
        unsafe {
            let error = &*raw::giterr_last();
            println!("error while {}: {} ({})",
                     activity,
                     CStr::from_ptr(error.message).to_string_lossy(),
                     error.klass);
            std::process::exit(1);
        }
    }

    status
}

// Usage
check("initializing library", raw::git_libgit2_init());

This constructs a string from a C string as before.

Printing out commits:

unsafe fn show_commit(commit: *const raw::git_commit) {
    let author = raw::git_commit_author(commit);

    let name = CStr::from_ptr((*author).name).to_string_lossy();
    let email = CStr::from_ptr((*author).email).to_string_lossy();
    println!("{} <{}>\n", name, email);

    let message = raw::git_commit_message(commit);
    println!("{}", CStr::from_ptr(message).to_string_lossy());
}

Here we create an unsafe function called show_commit for showing author and message of a git commit. The func takes a pointer to a git_commit and uses git_commit_author and git_commit_message funcs from the libgit2 library to get at the necessary data. The API relies on raw pointers – Rust cannot enforce lifetime checks here. This could lead to crashes if dangling pointers are accidentally created. Also the code assumes that the text is in UTF-8 format (but Git allows other encodings as well – leaving out proper handling of text encodings for the sake of brevity here).

Next bits:

let mut repo = ptr::null_mut();
check("opening repository",
      raw::git_repository_open(&mut repo, path.as_ptr()));

In Rust, we use git_repository_open to open a Git repository at a given path, which then initializes a git_repository object. By the libgit2 convention objects returned via pointer-to-pointer are owned by the caller (who is responsible for freeing).

For looking up the object hash of a commit:

let oid = {
    let mut oid = mem::MaybeUninit::uninit();
    check("looking up HEAD",
          raw::git_reference_name_to_id(oid.as_mut_ptr(), repo, c_name));
    oid.assume_init()
};

Find the object hash of the repository's current head commit using git_reference_name_to_id. Rust's MaybeUninit allows for safely handling uninitialized memory without additional overhead.

After obtaining the commit's obj identifier and looking up the actual commit, we call the show_commit fn to display commit information before freeing the commit and repository objects and shutting down the library.

show_commit(commit);

raw::git_commit_free(commit);

raw::git_repository_free(repo);

check("shutting down library", raw::git_libgit2_shutdown());

A Safe Interface to libgit2

The book goes on to present a safe interface to libgit2.

The raw interface to libgit2 is a perfect example of an unsafe feature: it certainly can be used correctly (as we do here, so far as we know), but Rust can’t enforce the rules you must follow. Designing a safe API for a library like this is a matter of identifying all these rules and then finding ways to turn any violation of them into a type or borrow-checking error.

Here, then, are libgit2’s rules for the features the program uses:

  • You must call git_libgit2_init before using any other library function. You must not use any library function after calling git_libgit2_shutdown.

  • All values passed to libgit2 functions must be fully initialized, except for output parameters.

  • When a call fails, output parameters passed to hold the results of the call are left uninitialized, and you must not use their values.

  • A git_commit object refers to the git_repository object it is derived from, so the former must not outlive the latter. (This isn’t spelled out in the libgit2 documentation; we inferred it from the presence of certain functions in the interface and then verified it by reading the source code.)

  • Similarly, a git_signature is always borrowed from a given git_commit, and the former must not outlive the latter. (The documentation does cover this case.)

  • The message associated with a commit and the name and email address of the author are all borrowed from the commit and must not be used after the commit is freed.

  • Once a libgit2 object has been freed, it must never be used again.

As it turns out, you can build a Rust interface to libgit2 that enforces all of these rules, either through Rust’s type system or by managing details internally.

Since even libgit2's initialization function can return an error code, a proper error handling mechanism is needed. An idiomatic Rust interface should have its own Error type that captures the libgit2 failure code, error message, and class from giterr_last. This must implement the standard Error, Debug, and Display traits. Additionally, it requires its own Result type that utilizes the custom Error type:

use std::os::raw::c_int;
use std::ffi::CStr;

fn check(code: c_int) -> Result<c_int> {
    if code >= 0 {
        return Ok(code);
    }

    unsafe {
        let error = raw::giterr_last();

        // libgit2 ensures that (*error).message is always non-null and null
        // terminated, so this call is safe.
        let message = CStr::from_ptr((*error).message)
            .to_string_lossy()
            .into_owned();

        Err(Error {
            code: code as i32,
            message,
            class: (*error).klass as i32
        })
    }
}

The main difference between this function and the check function from the raw version is that it creates an Error value rather than printing an error message and exiting the program.

Next, the initialization of the libgit2 library using Rust. We create a safe interface with a Repository type representing an open Git repository, with methods for resolving references, looking up commits and others.

The Repository struct is defined with a private raw field so only code within the module can access the raw::git_repository pointer:

pub struct Repository {
    raw: *mut raw::git_repository
}

We can guarantee that each Repository points to a valid git_repository object by requiring opening of a new Git repository when creating a Repository:

impl Repository {
    pub fn open<P: AsRef<Path>>(path: P) -> Result<Repository> {
        ensure_initialized();

        let path = path_to_cstring(path.as_ref())?;
        let mut repo = ptr::null_mut();
        unsafe {
            check(raw::git_repository_open(&mut repo, path.as_ptr()))?;
        }
        Ok(Repository { raw: repo })
    }
}

By ensuring that interaction with the safe interface begins with a Repository value, it follows that ensure_initialized is called before any libgit2 functions. The definition of ensure_initialized is:

fn ensure_initialized() {
    static ONCE: std::sync::Once = std::sync::Once::new();
    ONCE.call_once(|| {
        unsafe {
            check(raw::git_libgit2_init())
                .expect("initializing libgit2 failed");
            assert_eq!(libc::atexit(shutdown), 0);
        }
    });
}

The std::sync::Once type allows running the initialization code thread-safe. Only the first thread that calls ONCE.call_once runs the given closure; any subsequent calls block until the first call completes.

The initialization closure calls git_libgit2_init and checks the result, relying on expect for checking successful initialization. To guarantee that the program calls git_libgit2_shutdown, the initialization closure uses the C library atexit function:

extern fn shutdown() {
    unsafe {
        if let Err(e) = check(raw::git_libgit2_shutdown()) {
            eprintln!("shutting down libgit2 failed: {}", e);
            std::process::abort();
        }
    }
}

Previously, in the "Unwinding" section, the text noted that panics crossing language boundaries result in undefined behavior. So, the shutdown function must handle errors from raw::git_libgit2_shutdown without using .expect – and terminate the process using std::process::abort instead of exit in an atexit handler.

Calling git_libgit2_shutdown is the safe API's responsibility so we know existing libgit2 objects are safe to use. To close a repository, the Repository value must be dropped:

impl Drop for Repository {
    fn drop(&mut self) {
        unsafe {
            raw::git_repository_free(self.raw);
        }
    }
}

With this the raw::git_repository pointer can't be used after it's freed.

The Repository::open method uses the private path_to_cstring function with separate definitions for Unix-like systems and Windows:

use std::ffi::CString;

#[cfg(unix)]
fn path_to_cstring(path: &Path) -> Result<CString> {
    use std::os::unix::ffi::OsStrExt;
    Ok(CString::new(path.as_os_str().as_bytes())?)
}

#[cfg(windows)]
fn path_to_cstring(path: &Path) -> Result<CString> {
    match path.to_str() {
        Some(s) => Ok(CString::new(s)?),
        None => {
            let message = format!("Couldn't convert path '{}' to UTF-8",
                                  path.display());
            Err(message.into())
        }
    }
}

Libgit2 complicates matters by accepting null-terminated C strings as paths on all platforms. On Windows, it presumes C strings are valid UTF-8, converting them internally to 16-bit paths. This isn't ideal since Windows allows non-well-formed Unicode filenames, which cannot be represented in UTF-8, making it impossible to pass such names to libgit2.

In Rust, std::path::Path is the appropriate representation for a filesystem path, compatible with Windows or POSIX systems. However, some Path values on Windows aren't well-formed UTF-8 and can't be passed to libgit2. The pathtocstring function's behavior is the best we can achieve given libgit2's limitations.

The path_to_cstring definitions use conversions to the Error type, as shown:

impl From<String> for Error {
    fn from(message: String) -> Error {
        Error { code: -1, message, class: 0 }
    }
}

impl From<std::ffi::NulError> for Error {
    fn from(e: std::ffi::NulError) -> Error {
        Error { code: -1, message: e.to_string(), class: 0 }
    }
}

To resolve a Git reference to an object identifier (Oid), we create an Oid struct, and add a method to the Repository to perform the lookup:


use std::mem;
use std::os::raw::c_char;


/// The identifier of some sort of object stored in the Git object
/// database: a commit, tree, blob, tag, etc. This is a wide hash of the
/// object's contents.
impl Repository {
    // Define a pub method reference_name_to_id that takes a ref to self
    // and a string slice name as input and returns a Result carrying an Oid
    // value or an error.
    pub fn reference_name_to_id(&self, name: &str) -> Result<Oid> {
        // Create a CString from the passed name string slice or return an err
        let name = CString::new(name)?;
        unsafe {
            // Create a MaybeUninit wrapper around oid (which will hold the result)
            // Perform the ref lookup using git_reference_name_to_id and store
            // the result in oid. If the lookup fails, the check function will return
            // an error. Otherwise, oid should be initialized.
            let oid = {
                let mut oid = mem::MaybeUninit::uninit();
                check(raw::git_reference_name_to_id(
                        oid.as_mut_ptr(), self.raw,
                        name.as_ptr() as *const c_char))?;
                oid.assume_init()
            };
            // Return an Oid value wrapped in an Ok.
            Ok(Oid { raw: oid })
        }
    }
}

Define a Commit type and a method Repository.find_commit – annotated code:

use std::marker::PhantomData;

// Define a public struct Commit with a lifetime parameter 'repo.
pub struct Commit<'repo> {
    // This is a ptr to a usable `git_commit` structure.
    raw: *mut raw::git_commit,
    // This PhantomData field indicates that Rust should treat Commit<'repo>
    // as if it held a ref with lifetime 'repo to some Repository.
    _marker: PhantomData<&'repo Repository>
}

// public method find_commit that takes a ref to self and a ref to an Oid and
// returns a Result carrying a Commit value or an error. 
impl Repository {
    pub fn find_commit(&self, oid: &Oid) -> Result<Commit> {
        // Initialize commit to a null ptr
        let mut commit = ptr::null_mut();
        unsafe {
            // Perform the commit lookup using git_commit_lookup and store the result
            // in commit. If the lookup fails, the check function will return an error.
            check(raw::git_commit_lookup(&mut commit, self.raw, &oid.raw))?;
        }
        // Return a Commit value wrapped in an Ok.
        Ok(Commit { raw: commit, _marker: PhantomData })
    }
}

More Commit impl: freeing Commits, getting Signatures and commit messages


// Define a Drop implementation for Commit<'repo> that frees its raw::git_commit when dropped.
impl<'repo> Drop for Commit<'repo> {
    fn drop(&mut self) {
        unsafe {
            // Free the Commit using raw::git_commit_free.
            raw::git_commit_free(self.raw);
        }
    }
}

// Define two public methods for Commit<'repo>.
impl<'repo> Commit<'repo> {
    // Define a public method author that returns a Signature.
    pub fn author(&self) -> Signature {
        unsafe {
            // Return a Signature wrapped in an Ok.
            Signature {
                raw: raw::git_commit_author(self.raw),
                // Again PhantomData field 
                _marker: PhantomData
            }
        }
    }

    // Define a public method message that returns an Option containing the commit message.
    pub fn message(&self) -> Option<&str> {
        unsafe {
            // Get the message from the Commit's raw::git_commit using raw::git_commit_message.
            let message = raw::git_commit_message(self.raw);
            // Convert the char pointer to a &str using char_ptr_to_str.
            char_ptr_to_str(self, message)
        }
    }
}

This is the Signature type mentioned above:

// Define a public struct Signature with a lifetime parameter 'text'.
pub struct Signature<'text> {
    // This is a ptr to a const raw::git_signature object.
    raw: *const raw::git_signature,
    // PhantomData for lifetime -- we borrow everything from elsewhere
    // so we tell Rust it'll need same lifetime as 'text
    _marker: PhantomData<&'text str>
}

Methods for Signatures:


// Define two public methods for Signature<'text>.
impl<'text> Signature<'text> {
    // public method name that returns an Option containing the author's name.
    pub fn name(&self) -> Option<&str> {
        unsafe {
            // Convert the name char ptr to a &str using char_ptr_to_str.
            char_ptr_to_str(self, (*self.raw).name)
        }
    }

    // public method email that returns an Option containing the author's email.
    pub fn email(&self) -> Option<&str> {
        unsafe {
            // Convert the email char ptr to a &str using char_ptr_to_str.
            char_ptr_to_str(self, (*self.raw).email)
        }
    }
}

/// Try to borrow a `&str` from `ptr`, given that `ptr` may be null or
/// refer to ill-formed UTF-8. Give the result a lifetime as if it were
/// borrowed from `_owner`.
///
/// Safety: if `ptr` is non-null, it must point to a null-terminated C
/// string that is safe to access for at least as long as the lifetime of
/// `_owner`.
unsafe fn char_ptr_to_str<T>(_owner: &T, ptr: *const c_char) -> Option<&str> {
    if ptr.is_null() {
        return None;
    } else {
        // Convert the char ptr to a &CStr using CStr::from_ptr, then convert the &CStr
        // to a &str using to_str. If conversion fails, return None.
        CStr::from_ptr(ptr).to_str().ok()
    }
}

CStr::from_ptr returns an unbounded lifetime &CStr, which is usually inaccurate. To obtain a more accurately bounded reference, the _owner parameter can be included to attribute its lifetime to the return value's type.

Finally the new and improved main func:

fn main() {
    let path = std::env::args_os().skip(1).next()
        .expect("usage: git-toy PATH");

    let repo = git::Repository::open(&path)
        .expect("opening repository");

    let commit_oid = repo.reference_name_to_id("HEAD")
        .expect("looking up 'HEAD' reference");

    let commit = repo.find_commit(&commit_oid)
        .expect("looking up commit");

    let author = commit.author();
    println!("{} <{}>\n",
             author.name().unwrap_or("(none)"),
             author.email().unwrap_or("none"));

    println!("{}", commit.message().unwrap_or("(none)"));
}

This concludes the tour. We've gone from a raw interface to a slim and safe API and how it was created with safe interfaces for all the functionality and Rust ensuring the correct usage of the interface.

Coda

This concludes my tour of working through the "Programming Rust" book. Working through the has been a rewarding experience. Rust is likely the most complex language I've ever encountered. However, it was worth it – the abstractions around data ownership make a lot of sense, and the exact thing one needs to be able to create both safe, and efficient code. The Rust compiler here is a great help too providing feedback on possible issues or bugs. I have had conversations with people who don't like "to fight the compiler". But I don't mind that much – it's more like the compiler being a sidekick helping you to code.

I feel like this from Bjarne Stroustrup applies for Rust as well:

In general, C++ implementations obey the zero-overhead principle: What you don’t use, you don’t pay for. And further: What you do use, you couldn’t hand code any better.

Rust aims to be a modern and safe programming language; also providing control of the raw capabilities of the machine with minimal run-time overhead. Its borrow checker and zero-cost abstractions enable safe code to bridge the gap, and when needed, unsafe code and foreign function interface can provide the necessary. Rust emphasizes the use of unsafe features for building safe APIs. The standard library provides safe abstractions, implemented with some unsafe code behind the scenes. It's not simple – but Rust is safe, fast, concurrent, and effective. It is suitable for building large, fast, secure, and robust systems that utilize the full power of the hardware.

I'm not a well versed rust programmer by far; I have only written a few smaller tools in Rust by now. But I'm looking forward to doing more with the language in the future.