Intro
For this post I'm summarizing chapter 22 of the book, "Unsafe Code"
Unsafe code is for bypassing Rusts safety mechanisms. You're telling the compiler "this'll work out, trust me". Not all safety mechanisms are bypassed – but you can now call unsafe funcs, use unsafe pointers, and call functions in C/C++ (which are of course always unsafe ;-)). Other checks still apply: type safety, lifetime checks and such.
Unsafe from What?
Regular Rust code promises automatic safety. With unsafe code, there (possibly) are other rules that users might need to follow to avoid undefined behaviour – those should be outlined explicitly in documentation. Basically, you're in C-land now and are free to shoot yourself in the foot with any size caliber, including but not limited to outright crashes, SEGV, or insidious exploits.
Unsafe Blocks
A block of code that may use unsafe features. The value of the block is the val of its final expression or ()
fn main() {
let mut a: usize = 0;
let ptr = &mut a as *mut usize;
unsafe {
*ptr.offset(3) = 0x7ffff72f484c;
}
}
Within an unsafe block one may:
Call unsafe funcs
Deref raw pointers
Access fields of unions
Access mutable static variables
Access FFI funcs and variables
Example: An Efficient ASCII String Type
The book presents an instructive example, an efficient String type that guarantees valid ASCII. The key is that although the type presents a clean and safe API, internally it uses some unsafe code for efficiency.
mod my_ascii {
/// An ASCII-encoded string.
/// It derives some standard features automatically
#[derive(Debug, Eq, PartialEq)]
pub struct Ascii(
// This must hold only well-formed ASCII text:
// bytes from `0` to `0x7f`. -- This is our promise
Vec<u8>
;
)
impl Ascii {
/// Create an `Ascii` from the ASCII text in `bytes`.
/// As a safeguard, we return a
/// `NotAsciiError` error if `bytes` contains any non-ASCII
/// characters.
pub fn from_bytes(bytes: Vec<u8>) -> Result<Ascii, NotAsciiError> {
if bytes.iter().any(|&byte| !byte.is_ascii()) {
return Err(NotAsciiError(bytes));
}
Ok(Ascii(bytes))
}
}
// When conversion fails, we give back the vector we couldn't convert.
// This should implement `std::error::Error`; omitted for brevity.
#[derive(Debug, Eq, PartialEq)]
pub struct NotAsciiError(pub Vec<u8>);
// Safe, efficient conversion, implemented using unsafe code.
impl From<Ascii> for String {
fn from(ascii: Ascii) -> String {
// If this module has no bugs, this is safe, because
// well-formed ASCII text is also well-formed UTF-8.
// W/o that condition this would not be safe
unsafe { String::from_utf8_unchecked(ascii.0) }
}
}
...
}
Some key points:
The type is public but the data store byte vector is not, i.e. protected from outside access
In the constructor, the given data is checked for ASCII-ness
Because at this point we know we have only valid ASCII data we can safely forgo further checking when converting to String – this is done in an
unsafe { ... }
blockIn conclusion, we have a type with a safe API, which uses unsafe methods internally for efficiency but wrap those in a way that makes them safe
Wrapping an existing data type with some additional rules – a common Rust pattern – is called a newtype
Unsafe Functions
Functions can be marked unsafe, e.g.: unsafe fn foo() -> Bar { ... }
. This means that the caller has to take extra precautions - respectively fulfil an extra contract - otherwise we have undefined behaviour. Rust won't compile if you call an unsafe fn outside of an unsafe block.
For example, the fn we had used above is unsafe:
// This must be placed inside the `my_ascii` module.
impl Ascii {
/// Construct an `Ascii` value from `bytes`, without checking
/// whether `bytes` actually contains well-formed ASCII.
///
/// This constructor is infallible, and returns an `Ascii` directly,
/// rather than a `Result<Ascii, NotAsciiError>` as the `from_bytes`
/// constructor does.
///
/// # Safety
///
/// The caller must ensure that `bytes` contains only ASCII
/// characters: bytes no greater than 0x7f. Otherwise, the effect is
/// undefined.
pub unsafe fn from_bytes_unchecked(bytes: Vec<u8>) -> Ascii {
Ascii(bytes)}
}
The docs spell it out clearly: to use this fn you have to ensure somehow that the bytes you feed into the fn are valid ASCII; the fn won't handle this for you.
How can unsafe code break? Example:
// Imagine that this vector is the result of some complicated process
// that we expected to produce ASCII. Something went wrong!
let bytes = vec![0xf7, 0xbf, 0xbf, 0xbf];
let ascii = unsafe {
// This unsafe function's contract is violated
// when `bytes` holds non-ASCII bytes.
Ascii::from_bytes_unchecked(bytes)
};
let bogus: String = ascii.into();
// `bogus` now holds ill-formed UTF-8. Parsing its first character produces
// a `char` that is not a valid Unicode code point. That's undefined
// behavior, so the language doesn't say how this assertion should behave.
assert_eq!(bogus.chars().next().unwrap() as u32, 0x1fffff);
Key things here:
bugs outside an unsafe block can spread into unsafe blocks i.e. by violating contracts we thought fulfilled via buggy data
when run, the breakage might well occur outside an unsafe block, again by spreading bogus values around
This also means that unsafe blocks should have very clear boundaries, preferably checking data on ingress/egress. And, unsafe parts need to explain their contracts clearly so users (third party our yourself)
Unsafe Block or Unsafe Function?
Fns should be marked unsafe if it's possible to misuse them. Fns should explain in their docs the contracts that need to be fulfilled then. It doesn't matter if the unsafe fn uses unsafe features in its body or not – the potential for misuse and the necessity of a contract is what matters. If the fn itself is safe even though it might use unsafe features, wrap those in an unsafe block, and don't mark the fn itself unsafe.
Undefined Behavior
Rust defines a set of rules that it expects programs to follow. With normal, "safe" code it'll enforce them via the compiler, and barring any bugs in the compiler this should result in deterministic, "safe" behaviour. With unsafe code however it's the programmers job to make sure these rules are not violated:
Reading uninitialized mem is forbidden
All primitive values must have valid values, e.g. refs mustn't be null, bools either 0 or 1, etc.
Reference rules as per chapter 5, e.g. refs mustn't live longer than their referent, shared refs are r/o, mutable access is exclusive
No following null, incorrectly aligned or dangling pointers
Boundaries for pointers, see below
No data races, i.e. two threads accessing the same mem, and at least one of them writing
No unwinding across a FFI call
Follow the contracts of the std lib
The semantic model for unsafe code is not yet complete so this list might change over time – likely adding new rules.
If those rules are not upheld (in unsafe code) might cause bugs e.g. if Rust makes incorrect assumptions when optimizing code.
Unsafe Traits
These are traits that have some method the compiler can't verify – for example the built-in Send and Sync traits. By creating an unsafe trait we're just saying that there's something we can't check automatically. By implementing an unsafe trait we promise we'll take care of the unsafe aspects. E.g., if we implement Send for a type of our own we have to mark it as unsafe, and thereby promise we'll make the type safe to send across threads.
Raw Pointers
Raw pointers take us firmly into C/C++ land. They allow us to build structures that can't be done with checked pointers, and can also be useful for interfacing with C/C++ programs. Dereferencing raw pointers is confined to unsafe blocks – but creating, assigning, comparing them is fine everywhere.
Raw pointers come in shared/mutable flavors – a *mut T
allows modifying its referent, while a *const T
only grants read access. Raw pointers to unsized types (e.g. Box) are fat pointers, including e.g. vtables or lenghts.
Raw pointer example:
let mut x = 10;
// create a mutable pointer
let ptr_x = &mut x as *mut i32;
let y = Box::new(20);
// y already is a pointer, we're converting to raw
let ptr_y = &*y as *const i32;
unsafe {
// deref raw pointers --> unsafe
*ptr_x += *ptr_y;
}
assert_eq!(x, 30);
Note the conversion with the as
operator. Sometimes, the as
operator needs to be chained for complex conversions. This can be customized via as_ptr
and as_mut_ptr
methods, and similarly the into_raw
and from_raw
methods for owning pointer types.
Example of a null pointer:
fn option_to_raw<T>(opt: Option<&T>) -> *const T {
match opt {
None => std::ptr::null(),
Some(r) => r as *const T
}
}
assert!(!option_to_raw(Some(&("pea", "pod"))).is_null());
assert_eq!(option_to_raw::<i32>(None), std::ptr::null());
With raw pointers there is no automagic Deref; dereferencing has to be done manually. Rusts raw pointers don't do pointer arithmetic via +
– there are specialized methods for doing this: offset
, wrapping_offset
and others.
Raw pointers are not Send nor Sync. This is not because raw pointers would be unsafe to share between threads automatically but more of a cautious default.
Dereferencing Raw Pointers Safely
Some guidelines for deref raw pointers safely:
no null ptr, no dangling ptr, no uninitialized memory
need to have proper alignment for their referent type
borrowing nees to observe the borrowing rules, ie. refs may not outlive referent, shared access must be r/o, mutable access is exclusive
referents must be well-formed values
offset and
wrapping_offset
methods must only be used within the originally alloc'ed blocktake care to store only well-formed values
Example: RefWithFlag
The book goes on to present an example of storing a bit along with a ref. This is possible for many types which must be aligned to even addresses – and whose address LSB therefore always is zero, freeing it up for storing a flag – for instance to mark it for garbage collection.
Code:
mod ref_with_flag {
use std::marker::PhantomData;
use std::mem::align_of;
/// A `&T` and a `bool`, wrapped up in a single word.
/// The type `T` must require at least two-byte alignment.
///
/// If you're the kind of programmer who's never met a pointer whose
/// 2⁰-bit you didn't want to steal, well, now you can do it safely!
/// ("But it's not nearly as exciting this way...")
pub struct RefWithFlag<'a, T> {
: usize,
ptr_and_bit: PhantomData<&'a T> // occupies no space
behaves_like}
impl<'a, T: 'a> RefWithFlag<'a, T> {
pub fn new(ptr: &'a T, flag: bool) -> RefWithFlag<T> {
assert!(align_of::<T>() % 2 == 0);
{
RefWithFlag : ptr as *const T as usize | flag as usize,
ptr_and_bit: PhantomData
behaves_like}
}
// Retrieve the ref -- mask out the LSB
pub fn get_ref(&self) -> &'a T {
unsafe {
let ptr = (self.ptr_and_bit & !1) as *const T;
// deref ptr, then borrow
&*ptr
}
}
// Retrieve the flag -- return true if LSB is 1
pub fn get_flag(&self) -> bool {
self.ptr_and_bit & 1 != 0
}
}
}
This type holds some T and a bool, where T must be some type which requires even alignment (so, e.g. u8 won't work). The PhantomData (zero-sized) is necessary to mark the lifetime of the value.
Nullable Pointers
Null ptrs in Rust are the same as in C. Ptrs can be checked for null with the is_null
method or the as_ref
method which returns an Option (which can be checked for None). Similarly as_mut
for mut pointers.
Type Sizes and Alignments
Sized values use a constant number of bytes in mem and need some alignment. Get the size with std::mem::size_of::<T>()
, and the align with std::mem::align_of::<T>()
. Alignments are always a power of two, and sizes are rounded up to alignments.
For unsized types use std::mem::size_of_val
and std::mem::align_of_val
.
Pointer Arithmetic
Arrays, slices and vectors are blocks of memory, so doing pointer arithmetic works as you'd expect it – the i'th element of an array is at size_of(e) * i
. With something like the below:
struct Iter<'a, T> {
: *const T,
ptr: *const T,
end...
}
Then iteration should stop if ptr =
end=.
It's ok if a ptr points to the byte after an array – this can be useful for bounds checking. The value of that ptr must not be accessed of course. Use the p.offset(n)
methods to access the n'th element after p, note though that using offset
to refer to a byte after an array is undefined behaviour. Use wrapping_offset
instead.
Moving into and out of Memory
When moving values around, a former owner is said to be uninitialized. In reality the former owner probably (at least for a time) still has the same fat pointer – just that Rust prevents us from using it for safety reasons. That is, the former owner is not treated as live.
For unsafe code we can manage our own memory, but similarly we have to ensure that we need to track which values are considered live and which are not.
Library functions that help with this:
std::ptr::read(src)
Read a value, giving the caller ownership. It's the responsibility of your program to ensure src is treated as uninitialized
std::ptr::write(dest, value)
Write value to dest. Dest must be uninitialized. The referent now owns the value
std::ptr::copy(src, dst, count)
Moves an array, reading/writing over count values.
ptr.copy_to(dst, count)
Same as above, moving from ptr to dst
std::ptr::copy_nonoverlapping(src, dst, count)
Same as above but optimized for non-overlapping regions (caller has to ensure that this contract holds)
ptr.copy_to_nonoverlapping(dst, count)
Same as above
read_unaligned, write_unaligned
Reading and writing, but the pointers don't have to be aligned as would normally be the case. Could be slower
read_volatile, write_volatile
Same as volatile reads and writes in C, i.e. for mem-mapped devices and similar
Example: GapBuffer
The book goes on to present a container type called a gap buffer – an array that has it's spare capacity (the "gap") not at the end of the user data but somewhere in the middle (or anywhere really); the motivation is to speed up consecutive insertion operations (think text editor) by placing the gap where the insertion point is.
In the example the container type is sped up by having unsafe code manage the buffer object.
Panic Safety in Unsafe Code
Rusts panic feature itself is not unsafe. However, in unsafe code we might temporarily relax our promises wrt. to code safety – and if we panic before we get a chance to get back to a consistent state we're left with an inconsistency. I.e. take special care in unsafe code with calls that might panic.
Reinterpreting Memory with Unions
Unions in Rust are similar to C unions – i.e. data types that can be interpreted in more than one way.
For example, the below is a data type that can be used either as a float or an int
union FloatOrInt {
: f32,
f: i32,
i}
// we create an int value
let mut one = FloatOrInt { i: 1 };
// accessing the union is only possible in
// unsafe code
assert_eq!(unsafe { one.i }, 0x00_00_00_01);
// lets mutate it to a float!
.f = 1.0;
oneassert_eq!(unsafe { one.i }, 0x3F_80_00_00);
In the above example the values are 32bit, but it's entirely possible to have unions of mixed size, in which case the value sizes are the max of the two containing types.
Assigning to unions is always possible, but accessing unions must be done in unsafe code – there is no tag like with enums that tells Rust which type of value currently is held. There is no guarantee that a specific value can be interpreted a certain way – it's up to the user to ensure the bit patterns are valid for the intended use.
There is an attribute #[repr(C)]
that ensures values in a union are laid out a certain way – i.e. that all fields start at offset 0.
Unions don't know how their fields should be dropped, therefore all the fields must be Copy. The std::mem::ManuallyDrop
function offers a workaround for non-Copy fields though.
Matching Unions
This works like other matches, except we need a pattern for every field (and of course still needs unsafe)
union SmallOrLarge {
: bool,
s: u64
l}
unsafe {
match u {
{ s: true } => { println!("boolean true"); }
SmallOrLarge { l: 2 } => { println!("integer 2"); }
SmallOrLarge => { println!("something else"); }
_ }
}
Borrowing Unions
Union fields are not distinct values, so borrowing unions means you can only ever borrow the value as a whole
Coda
This was a tricky chapter. I feel like you can get a lot of value from unsafe code – for performance reasons, interact with external code, custom memory layouts, etc. But you need to be aware of Rust internals and expectations, as the compiler won't help you much – unsafe code will need an extra dose of testing and review.