Peterlog

Intro

For this post I'm summarizing chapter 22 of the book, "Unsafe Code"

Unsafe code is for bypassing Rusts safety mechanisms. You're telling the compiler "this'll work out, trust me". Not all safety mechanisms are bypassed – but you can now call unsafe funcs, use unsafe pointers, and call functions in C/C++ (which are of course always unsafe ;-)). Other checks still apply: type safety, lifetime checks and such.

Unsafe from What?

Regular Rust code promises automatic safety. With unsafe code, there (possibly) are other rules that users might need to follow to avoid undefined behaviour – those should be outlined explicitly in documentation. Basically, you're in C-land now and are free to shoot yourself in the foot with any size caliber, including but not limited to outright crashes, SEGV, or insidious exploits.

Unsafe Blocks

A block of code that may use unsafe features. The value of the block is the val of its final expression or ()

fn main() {
    let mut a: usize = 0;
    let ptr = &mut a as *mut usize;
    unsafe {
        *ptr.offset(3) = 0x7ffff72f484c;
    }
}

Within an unsafe block one may:

Call unsafe funcs
Deref raw pointers
Access fields of unions
Access mutable static variables
Access FFI funcs and variables

Example: An Efficient ASCII String Type

The book presents an instructive example, an efficient String type that guarantees valid ASCII. The key is that although the type presents a clean and safe API, internally it uses some unsafe code for efficiency.

mod my_ascii {
    /// An ASCII-encoded string.
    /// It derives some standard features automatically
    #[derive(Debug, Eq, PartialEq)]
    pub struct Ascii(
        // This must hold only well-formed ASCII text:
        // bytes from `0` to `0x7f`. -- This is our promise
        Vec<u8>
    );

    impl Ascii {
        /// Create an `Ascii` from the ASCII text in `bytes`.
        /// As a safeguard, we return a
        /// `NotAsciiError` error if `bytes` contains any non-ASCII
        /// characters.
        pub fn from_bytes(bytes: Vec<u8>) -> Result<Ascii, NotAsciiError> {
            if bytes.iter().any(|&byte| !byte.is_ascii()) {
                return Err(NotAsciiError(bytes));
            }
            Ok(Ascii(bytes))
        }
    }

    // When conversion fails, we give back the vector we couldn't convert.
    // This should implement `std::error::Error`; omitted for brevity.
    #[derive(Debug, Eq, PartialEq)]
    pub struct NotAsciiError(pub Vec<u8>);

    // Safe, efficient conversion, implemented using unsafe code.
    impl From<Ascii> for String {
        fn from(ascii: Ascii) -> String {
            // If this module has no bugs, this is safe, because
            // well-formed ASCII text is also well-formed UTF-8.
            // W/o that condition this would not be safe 
            unsafe { String::from_utf8_unchecked(ascii.0) }
        }
    }
    ...
}

Some key points:

The type is public but the data store byte vector is not, i.e. protected from outside access
In the constructor, the given data is checked for ASCII-ness
Because at this point we know we have only valid ASCII data we can safely forgo further checking when converting to String – this is done in an unsafe { ... } block
In conclusion, we have a type with a safe API, which uses unsafe methods internally for efficiency but wrap those in a way that makes them safe

Wrapping an existing data type with some additional rules – a common Rust pattern – is called a newtype

Unsafe Functions

Functions can be marked unsafe, e.g.: unsafe fn foo() -> Bar { ... }. This means that the caller has to take extra precautions - respectively fulfil an extra contract - otherwise we have undefined behaviour. Rust won't compile if you call an unsafe fn outside of an unsafe block.

For example, the fn we had used above is unsafe:

// This must be placed inside the `my_ascii` module.
impl Ascii {
    /// Construct an `Ascii` value from `bytes`, without checking
    /// whether `bytes` actually contains well-formed ASCII.
    ///
    /// This constructor is infallible, and returns an `Ascii` directly,
    /// rather than a `Result<Ascii, NotAsciiError>` as the `from_bytes`
    /// constructor does.
    ///
    /// # Safety
    ///
    /// The caller must ensure that `bytes` contains only ASCII
    /// characters: bytes no greater than 0x7f. Otherwise, the effect is
    /// undefined.
    pub unsafe fn from_bytes_unchecked(bytes: Vec<u8>) -> Ascii {
        Ascii(bytes)
    }
}

The docs spell it out clearly: to use this fn you have to ensure somehow that the bytes you feed into the fn are valid ASCII; the fn won't handle this for you.

How can unsafe code break? Example:

// Imagine that this vector is the result of some complicated process
// that we expected to produce ASCII. Something went wrong!
let bytes = vec![0xf7, 0xbf, 0xbf, 0xbf];

let ascii = unsafe {
    // This unsafe function's contract is violated
    // when `bytes` holds non-ASCII bytes.
    Ascii::from_bytes_unchecked(bytes)
};

let bogus: String = ascii.into();

// `bogus` now holds ill-formed UTF-8. Parsing its first character produces
// a `char` that is not a valid Unicode code point. That's undefined
// behavior, so the language doesn't say how this assertion should behave.
assert_eq!(bogus.chars().next().unwrap() as u32, 0x1fffff);

Key things here:

bugs outside an unsafe block can spread into unsafe blocks i.e. by violating contracts we thought fulfilled via buggy data
when run, the breakage might well occur outside an unsafe block, again by spreading bogus values around

This also means that unsafe blocks should have very clear boundaries, preferably checking data on ingress/egress. And, unsafe parts need to explain their contracts clearly so users (third party our yourself)

Unsafe Block or Unsafe Function?

Fns should be marked unsafe if it's possible to misuse them. Fns should explain in their docs the contracts that need to be fulfilled then. It doesn't matter if the unsafe fn uses unsafe features in its body or not – the potential for misuse and the necessity of a contract is what matters. If the fn itself is safe even though it might use unsafe features, wrap those in an unsafe block, and don't mark the fn itself unsafe.

Undefined Behavior

Rust defines a set of rules that it expects programs to follow. With normal, "safe" code it'll enforce them via the compiler, and barring any bugs in the compiler this should result in deterministic, "safe" behaviour. With unsafe code however it's the programmers job to make sure these rules are not violated:

Reading uninitialized mem is forbidden
All primitive values must have valid values, e.g. refs mustn't be null, bools either 0 or 1, etc.
Reference rules as per chapter 5, e.g. refs mustn't live longer than their referent, shared refs are r/o, mutable access is exclusive
No following null, incorrectly aligned or dangling pointers
Boundaries for pointers, see below
No data races, i.e. two threads accessing the same mem, and at least one of them writing
No unwinding across a FFI call
Follow the contracts of the std lib

The semantic model for unsafe code is not yet complete so this list might change over time – likely adding new rules.

If those rules are not upheld (in unsafe code) might cause bugs e.g. if Rust makes incorrect assumptions when optimizing code.

Unsafe Traits

These are traits that have some method the compiler can't verify – for example the built-in Send and Sync traits. By creating an unsafe trait we're just saying that there's something we can't check automatically. By implementing an unsafe trait we promise we'll take care of the unsafe aspects. E.g., if we implement Send for a type of our own we have to mark it as unsafe, and thereby promise we'll make the type safe to send across threads.

Raw Pointers

Raw pointers take us firmly into C/C++ land. They allow us to build structures that can't be done with checked pointers, and can also be useful for interfacing with C/C++ programs. Dereferencing raw pointers is confined to unsafe blocks – but creating, assigning, comparing them is fine everywhere.

Raw pointers come in shared/mutable flavors – a *mut T allows modifying its referent, while a *const T only grants read access. Raw pointers to unsized types (e.g. Box) are fat pointers, including e.g. vtables or lenghts.

Raw pointer example:

let mut x = 10;
// create a mutable pointer
let ptr_x = &mut x as *mut i32;

let y = Box::new(20);
// y already is a pointer, we're converting to raw
let ptr_y = &*y as *const i32;

unsafe {
    // deref raw pointers --> unsafe
    *ptr_x += *ptr_y;
}
assert_eq!(x, 30);

Note the conversion with the as operator. Sometimes, the as operator needs to be chained for complex conversions. This can be customized via as_ptr and as_mut_ptr methods, and similarly the into_raw and from_raw methods for owning pointer types.

Example of a null pointer:

fn option_to_raw<T>(opt: Option<&T>) -> *const T {
    match opt {
        None => std::ptr::null(),
        Some(r) => r as *const T
    }
}

assert!(!option_to_raw(Some(&("pea", "pod"))).is_null());
assert_eq!(option_to_raw::<i32>(None), std::ptr::null());

With raw pointers there is no automagic Deref; dereferencing has to be done manually. Rusts raw pointers don't do pointer arithmetic via + – there are specialized methods for doing this: offset, wrapping_offset and others.

Raw pointers are not Send nor Sync. This is not because raw pointers would be unsafe to share between threads automatically but more of a cautious default.

Dereferencing Raw Pointers Safely

Some guidelines for deref raw pointers safely:

no null ptr, no dangling ptr, no uninitialized memory
need to have proper alignment for their referent type
borrowing nees to observe the borrowing rules, ie. refs may not outlive referent, shared access must be r/o, mutable access is exclusive
referents must be well-formed values
offset and wrapping_offset methods must only be used within the originally alloc'ed block
take care to store only well-formed values

Example: RefWithFlag

The book goes on to present an example of storing a bit along with a ref. This is possible for many types which must be aligned to even addresses – and whose address LSB therefore always is zero, freeing it up for storing a flag – for instance to mark it for garbage collection.

Code:

mod ref_with_flag {
    use std::marker::PhantomData;
    use std::mem::align_of;

    /// A `&T` and a `bool`, wrapped up in a single word.
    /// The type `T` must require at least two-byte alignment.
    ///
    /// If you're the kind of programmer who's never met a pointer whose
    /// 2⁰-bit you didn't want to steal, well, now you can do it safely!
    /// ("But it's not nearly as exciting this way...")
    pub struct RefWithFlag<'a, T> {
        ptr_and_bit: usize,
        behaves_like: PhantomData<&'a T> // occupies no space
    }

    impl<'a, T: 'a> RefWithFlag<'a, T> {
        pub fn new(ptr: &'a T, flag: bool) -> RefWithFlag<T> {
            assert!(align_of::<T>() % 2 == 0);
            RefWithFlag {
                ptr_and_bit: ptr as *const T as usize | flag as usize,
                behaves_like: PhantomData
            }
        }

        // Retrieve the ref -- mask out the LSB
        pub fn get_ref(&self) -> &'a T {
            unsafe {
                let ptr = (self.ptr_and_bit & !1) as *const T;
                // deref ptr, then borrow
                &*ptr
            }
        }

        // Retrieve the flag -- return true if LSB is 1
        pub fn get_flag(&self) -> bool {
            self.ptr_and_bit & 1 != 0
        }
    }
}

This type holds some T and a bool, where T must be some type which requires even alignment (so, e.g. u8 won't work). The PhantomData (zero-sized) is necessary to mark the lifetime of the value.

Nullable Pointers

Null ptrs in Rust are the same as in C. Ptrs can be checked for null with the is_null method or the as_ref method which returns an Option (which can be checked for None). Similarly as_mut for mut pointers.

Type Sizes and Alignments

Sized values use a constant number of bytes in mem and need some alignment. Get the size with std::mem::size_of::<T>() , and the align with std::mem::align_of::<T>(). Alignments are always a power of two, and sizes are rounded up to alignments.

For unsized types use std::mem::size_of_val and std::mem::align_of_val.

Pointer Arithmetic

Arrays, slices and vectors are blocks of memory, so doing pointer arithmetic works as you'd expect it – the i'th element of an array is at size_of(e) * i. With something like the below:

struct Iter<'a, T> {
    ptr: *const T,
    end: *const T,
    ...
}

Then iteration should stop if ptr = end=.

It's ok if a ptr points to the byte after an array – this can be useful for bounds checking. The value of that ptr must not be accessed of course. Use the p.offset(n) methods to access the n'th element after p, note though that using offset to refer to a byte after an array is undefined behaviour. Use wrapping_offset instead.

Moving into and out of Memory

When moving values around, a former owner is said to be uninitialized. In reality the former owner probably (at least for a time) still has the same fat pointer – just that Rust prevents us from using it for safety reasons. That is, the former owner is not treated as live.

For unsafe code we can manage our own memory, but similarly we have to ensure that we need to track which values are considered live and which are not.

Library functions that help with this:

std::ptr::read(src): Read a value, giving the caller ownership. It's the responsibility of your program to ensure src is treated as uninitialized
std::ptr::write(dest, value): Write value to dest. Dest must be uninitialized. The referent now owns the value
std::ptr::copy(src, dst, count): Moves an array, reading/writing over count values.
ptr.copy_to(dst, count): Same as above, moving from ptr to dst
std::ptr::copy_nonoverlapping(src, dst, count): Same as above but optimized for non-overlapping regions (caller has to ensure that this contract holds)
ptr.copy_to_nonoverlapping(dst, count): Same as above
read_unaligned, write_unaligned: Reading and writing, but the pointers don't have to be aligned as would normally be the case. Could be slower
read_volatile, write_volatile: Same as volatile reads and writes in C, i.e. for mem-mapped devices and similar

Example: GapBuffer

The book goes on to present a container type called a gap buffer – an array that has it's spare capacity (the "gap") not at the end of the user data but somewhere in the middle (or anywhere really); the motivation is to speed up consecutive insertion operations (think text editor) by placing the gap where the insertion point is.

In the example the container type is sped up by having unsafe code manage the buffer object.

Panic Safety in Unsafe Code

Rusts panic feature itself is not unsafe. However, in unsafe code we might temporarily relax our promises wrt. to code safety – and if we panic before we get a chance to get back to a consistent state we're left with an inconsistency. I.e. take special care in unsafe code with calls that might panic.

Reinterpreting Memory with Unions

Unions in Rust are similar to C unions – i.e. data types that can be interpreted in more than one way.

For example, the below is a data type that can be used either as a float or an int

union FloatOrInt {
    f: f32,
    i: i32,
}

// we create an int value
let mut one = FloatOrInt { i: 1 };

// accessing the union is only possible in
// unsafe code
assert_eq!(unsafe { one.i }, 0x00_00_00_01);

// lets mutate it to a float!
one.f = 1.0;
assert_eq!(unsafe { one.i }, 0x3F_80_00_00);

In the above example the values are 32bit, but it's entirely possible to have unions of mixed size, in which case the value sizes are the max of the two containing types.

Assigning to unions is always possible, but accessing unions must be done in unsafe code – there is no tag like with enums that tells Rust which type of value currently is held. There is no guarantee that a specific value can be interpreted a certain way – it's up to the user to ensure the bit patterns are valid for the intended use.

There is an attribute #[repr(C)] that ensures values in a union are laid out a certain way – i.e. that all fields start at offset 0.

Unions don't know how their fields should be dropped, therefore all the fields must be Copy. The std::mem::ManuallyDrop function offers a workaround for non-Copy fields though.

Matching Unions

This works like other matches, except we need a pattern for every field (and of course still needs unsafe)


union SmallOrLarge {
    s: bool,
    l: u64
}

unsafe {
    match u {
        SmallOrLarge { s: true } => { println!("boolean true"); }
        SmallOrLarge { l: 2 } => { println!("integer 2"); }
        _ => { println!("something else"); }
    }
}

Borrowing Unions

Union fields are not distinct values, so borrowing unions means you can only ever borrow the value as a whole

Coda

This was a tricky chapter. I feel like you can get a lot of value from unsafe code – for performance reasons, interact with external code, custom memory layouts, etc. But you need to be aware of Rust internals and expectations, as the compiler won't help you much – unsafe code will need an extra dose of testing and review.

sabaini gmbh

When Ferrous Metals Corrode, pt. XXI