When Ferrous Metals Corrode, pt. XVI

Intro

All about bytes, chars, unicode, formatting and regexes. Covers chapter 17, "Strings and Text"

The book starts off with an overview of Unicode which I will skip here, except to say that Rusts String and str represent text as utf8

Characters (char)

Rust chars are 32bit Unicode code points

Classifying Characters

Some classification methods:

  • ch.is_numeric()

  • ch.is_alphabetic()

  • ch.is_alphanumeric()

  • ch.is_whitespace()

  • ch.is_control()

  • ch.is_ascii()

  • ch.is_ascii_alphabetic()

  • ch.is_ascii_digit()

  • ch.is_ascii_hexdigit()

  • ch.is_ascii_alphanumeric()

  • ch.is_ascii_control()

  • ch.is_ascii_graphic()

  • ch.is_ascii_uppercase()

  • ch.is_ascii_lowercase()

  • ch.is_ascii_punctuation()

  • ch.is_ascii_whitespace()

Handling Digits

Some functions for handling digits.

ch.to_digit(radix)
Returns Some(u32) if the char is indeed an ASCII digit. For radix larger than 10 ASCII letters (either case) are used
std::char::from_digit(num, radix)

Convert u32 into Some(char), the reverse of the above

ch.is_digit(radix)

Same as ch.to_digit(radix) != None

Case Conversion for Characters

ch.is_lowercase(), ch.is_uppercase()

Test for Unicode lower/uppercase-ness

ch.to_lowercase(), ch.to_uppercase()

Convert to lower/uppercase. This returns an iterator as in Unicode this could result in more than one char, e.g. for 'ß':

// The uppercase form of the German letter "sharp S" is "SS":
let mut upper = 'ß'.to_uppercase();
assert_eq!(upper.next(), Some('S'));
assert_eq!(upper.next(), Some('S'));
assert_eq!(upper.next(), None);

Conversions to and from Integers

The as cast will make any char into an integer of any width, masking off upper bits if required. It will also convert any u8 into a char. Larger ints must use std::char::from_u32; this'll None if the int doesn't describe a valid code point and Some(char) otherwise.

String and str

Creating String Values

String::new()

new empty String

String::with_capacity(n)

new String with pre-alloc buffer

str_slice.to_string()

new String, with contents of the slice

iter.collect()

concat elems from the iterable into a String

slice.to_owned()

copy of slice as a fresh String

Simple Inspection

slice.len()

len in bytes (not chars!)

slice.is_empty()

true if empty

slice[range]

results in a slice of a slice

slice.split_at(i)

split a slice in two and return both parts

slice.is_char_boundary(i)

true if i is at a char boundary

Appending and Inserting Text

Methods:

string.push(ch)

add char to String

string.push_str(slice)

add slice to String

string.extend(iter)

add all elems from iterator to String

string.insert(i, ch)

insert char at i – needs to shift all the later elements

string.insert_str(i, slice)

same as above but for a whole slice

Traits:

std::fmt::Write

String implements and therefore the macros write!/writeln! can be used. These return a Result but are actually infallible

Add<&str>, AddAssign<&str>

this allows the use of the + operand on Strings and &str. Left-hand Strings are taken by value and the buffer will be reused. Building up text by appending onto the end of a String is efficient; even more so if adequate space has been pre-allocated via .with_capacity()

Removing and Replacing Text

Some methods for removing and replacing (without changing capacity):

string.clear()

make it empty

string.truncate(n)

throw away everything after offset n (noop if it already was at length)

string.pop()

pop off last char as an Option (None if string was empty)

string.remove(i)

remove and return char at byte offset i, shift rest forward

string.drain(range)

return iterator over range. When iterator is done remove chars from range

string.replace_range(range, replacement)

fill in replacement for range (don't have to be same len)

Conventions for Searching and Iterating

Some naming conventions from the stdlib

r*

functions that start with r process from end to begin

*n

iterators ending in n return a max. number of matches

*indices

iterators that also return byte offsets

Patterns for Searching Text

When the std lib talks of a pattern one of four things is accepted:

  • a char which matches the char value

  • a string, matching string val

  • a FnMut(char) -> bool closure, matches every char where the closure returns true

  • a &[char] i.e. slice of char values matches every char that's in the slice

The corresponding trait is std::str::Pattern

Searching and Replacing

slice.contains(pattern)

true if there's a match

slice.starts_with(pattern), slice.ends_with(pattern)

true if matches start/end

slice.find(pattern), slice.rfind(pattern)

Return Some(i) with the byte offset of the match

slice.replace(pattern, replacement)

replace pattern with replacement eagerly (as soon as it's found)

slice.replacen(pattern, replacement, n)

as above but max n matches

Iterating over Text

Methods that let you split text up and iterate over it. Most of these are DoubleEndedIterator, ie. they can be reversed (.rev())

slice.chars()

iterator over chars

slice.char_indices()

get chars and corresponding byte offsets

slice.bytes()

iterator over bytes

slice.lines()

iterator over lines (trimming newlines)

slice.split(pattern)

split on pattern; not reversible if pattern is a &str

slice.rsplit(pattern)

as above but scan from end

slice.split_terminator(pattern), slice.rsplit_terminator(pattern)

as above but pattern is treated as a separator – if there's a match at then this method will not include an empty element

slice.splitn(n, pattern), slice.rsplitn(n, pattern)

as above but return max n elements

slice.split_whitespace(), slice.split_ascii_whitespace()

split on whitespace (either ascii or unicode whitespace)

slice.matches(pattern), slice.rmatches(pattern)

iterator over matches of pattern

slice.match_indices(pattern), slice.rmatch_indices(pattern)

as above but return (offset, match) tuples

Trimming

slice.trim()

trim leading and trailing whitespace

=slice.trimmatches(pattern), .trimstartmatches(), .trimendmatches() =

trim pattern from start and/or end

slice.strip_prefix(pattern), slice.strip_suffix(pattern)

return Some value with trimmed slice, if there's a match at beginning or end of slice, otherwise None

Case Conversion for Strings

The slice.to_uppercase() and slice.to_lowercase() methods create a new string

Parsing Other Types from Strings

The std::str::FromStr trait implements parsing from a string blob. The usual machine types like usize, f64, bool etc. all implement it. Char also does, for single-character strings; as well as IpAddr.

String slices have a parse method for converting to a specific (but calling .from_str() might be just as readable).

Converting Other Types to Strings

std::fmt::Display trait

for the formatting macros, e.g. assert_eq!(format!("{}, wow", "doge"), "doge, wow");

.to_string()

the ToString trait defines this method,

std::fmt::Debug trait

format a value for debugging, e.g. via the {:?} format specifier

Borrowing as Other Text-Like Types

Reminder about the AsRef<str>, AsRef<[u8]>, AsRef<Path>, AsRef<OsStr> traits so std lib functions can take strings which will get autoconverted to the correct type.

Also the Borrow<str> trait so that Strings can be used as keys in HashMap and BTreeMap

Accessing Text as UTF-8

slice.as_bytes()

get a shared ref to a byte slice

string.into_bytes()

take ownership and convert to a slice of bytes

Producing Text from Bytes

These create unicode encoded text from bytes

str::from_utf8(byte_slice)

make a string; returns either Ok(&str) or an error if the bytes were not well-formed utf8

String::from_utf8(vec)

construct a string from a byte vec, taking ownership, return either Ok or Err, with accessor methods of the failing data

String::from_utf8_lossy(byte_slice)

construct a string with unicode replacement chars where necessary, return a Cow str.

String::from_utf8_unchecked(vec)

just wrap up your byte vec without checking if it's well-formed; unsafe.

str::from_utf8_unchecked(byte_slice)

similar to above

Putting Off Allocation

Quick example of using Cow in a common situation: either calculate some value but have a static value as a fallback:

use std::borrow::Cow;

// get a name, return a Cow enum
fn get_name() -> Cow<'static, str> {
    std::env::var("USER")        // we try to grab from the env
        .map(|v| Cow::Owned(v))  // and construct a Cow
        // but if getting user from env fails, borrow from a stratic str
        .unwrap_or(Cow::Borrowed("whoever you are"))
}

With this in place we only alloc if we really need it.

There's some special support Cow values for strings, providing From and Into conversions, which makes it possible to just write:

fn get_name() -> Cow<'static, str> {
    std::env::var("USER")
        .map(|v| v.into())
        .unwrap_or("whoever you are".into())
}

Strings as Generic Collections

Strings share some traits with collections, such as Default (an empty string) and Extend, tacking on chars, slices, … at the end.

Formatting Values

These macros interpret the formatting mini lang we already encountered:

  • format! for building strings

  • println!, print! macros for printing to stdout

  • writeln!, write! macros for any output stream

  • panic! for termination messages

For supporting user defined types in formatting, look the std::fmt module.

The formatting macros all work with shared refs, they never take ownership.

The formatting mini lang uses {...} as "format parameters"; their form is {which:how}, both of those is optional, so {} is the default which and how. "Which" refers to how to select data – either by index or name, while "how" refers to the display formatting.

Formatting Text Values

This applies to text values (string slices, Strings, chars). For these the "how" has several optional parts:

  • text length limit

  • minimum field width

  • alignment

  • padding character

Formatting Numbers

Optional "how"-params for numeric values:

  • padding and alignment, as above

    • character to force sign output
  • 0 character to output leading zeroes

  • minimum field width

  • precision for floating-point

  • notation, b for binary, o for octal, or x or X for hexadecimal…

Formatting Other Types

These types have formatting built in as well: Errors, IpAddr, SocketAddr, booleans. Length limit, field width, and alignment controls work as expected for these.

Formatting Values for Debugging

The {:?} parameter formats values for debugging. Using {:#?} adds some pretty-printing to the output. Including an "x" in the format params will print numbers in hex instead of dec, e.g.: {:02x?}

Formatting Pointers for Debugging

Typically refs, boxes, Rc values and similar will follow to their referent when formatted. To get at the actual pointer addresses, use the {:p} format.

Referring to Arguments by Index or Name

You can access formatting arglists out-of-order by providing indexes as in format!("{1},{0},{2}", "zeroth", "first", "second")

More useful is the possibility to access format values by name as in the below:

format!("{description:.<25}{quantity:2} @ {price:5.2}",
        price=3.25,
        quantity=3,
        description="Maple Turmeric Latte")

These look like Pyhons keyword args but really only are defined in the macro.

These styles can also be mixed.

Dynamic Widths and Precisions

To determine field widths at run time use something like this:

format!("{:>1$}", content, get_width())

The 1$ tells the macro to use the value of get_width() as a field width.

Instead of num$ can also use name$:

format!("{:>width$.limit$}", content,
        width=get_width(), limit=get_limit())

Finally with * the next positional arg is taken as the width spec:

format!("{:.*}", get_limit(), content)

Formatting Your Own Types

The std::fmt lib has a number of traits you can implement for your own type to gain formatting. The structure of those traits is the same, for instance to implement the Display trait for a (user-defined) Complex type:

use std::fmt;

impl fmt::Display for Complex {
    fn fmt(&self, dest: &mut fmt::Formatter) -> fmt::Result {
        let im_sign = if self.im < 0.0 { '-' } else { '+' };
        write!(dest, "{} {} {}i", self.re, im_sign, f64::abs(self.im))
    }
}

These formatters should not return errors i.e. be infallible.

Using the Formatting Language in Your Own Code

To make user-defined functions accept formatting params use the format_args! macro and the std::fmt::Arguments type.

Example logging fun that takes formatting args:

fn logging_enabled() -> bool { ... }

use std::fs::OpenOptions;
use std::io::Write;

fn write_log_entry(entry: std::fmt::Arguments) {
    if logging_enabled() {
        // Keep things simple for now, and just
        // open the file every time.
        let mut log_file = OpenOptions::new()
            .append(true)
            .create(true)
            .open("log-file-name")
            .expect("failed to open log file");
        // note: File type supports writting formatted data
        log_file.write_fmt(entry)
            .expect("failed to write to log");
    }
}

// usage
write_log_entry(format_args!("Hark! {:?}\n", mysterious_value));

For extra convenience, wrap the latter in a macro (more on macros later):

macro_rules! log { // no ! needed after name in macro definitions
    ($format:tt, $($arg:expr),*) => (
        write_log_entry(format_args!($format, $($arg),*))
    )
}

// usage
log!("O day and night, but this is wondrous strange! {:?}\n",
     mysterious_value);

Regular Expressions

There's an non-std lib regex crate that offers basic matching/searching. It doesn't have backrefs nor lookahead / lookbehind matches but it does guarantee that search time is linear over text and pattern size, and should be safe for untrusted input.

Basic Regex Use

Example usage:

use regex::Regex;

// match a version string
// compile re from &str, panic if compilation fails
let semver = Regex::new(r"(\d+)\.(\d+)\.(\d+)(-[-.[:alnum:]]*)?")?;

// Simple search, with a Boolean result.
let haystack = r#"regex = "0.2.5""#;
assert!(semver.is_match(haystack));

// You can retrieve capture groups:
let captures = semver.captures(haystack)
    .ok_or("semver regex should have matched")?;
assert_eq!(&captures[0], "0.2.5"); // all
assert_eq!(&captures[1], "0"); // first group

// .get() returns an option while the above indexing would panic
assert_eq!(captures.get(4), None);


let haystack = "In the beginning, there was 1.0.0. \
                For a while, we used 1.0.1-beta, \
                but in the end, we settled on 1.2.4.";
// collect all matches and cast to str
let matches: Vec<&str> = semver.find_iter(haystack)
    .map(|match_| match_.as_str())
    .collect();
assert_eq!(matches, vec!["1.0.0", "1.0.1-beta", "1.2.4"]);

Building Regex Values Lazily

Like in other envs compiling regex patterns is expensive; it's best to compile them once and reuse.

The lazy_static module helps with creating static values on demand:

use lazy_static::lazy_static;

lazy_static! {
    static ref SEMVER: Regex
        = Regex::new(r"(\d+)\.(\d+)\.(\d+)(-[-.[:alnum:]]*)?")
              .expect("error parsing regex");
}

This will generate a static SEMVER pattern that is initialized on first use

Normalization

This section discusses the non-std unicode-normalization crate which supports unicode normalization methods, i.e. normalizing strings into a specific form in the presence of unicode combining characters

Coda

So, this was quite a long chapter; I've elided much of the details as conceptually not too much new stuff (and I'm unlikely to recall all the details later anyway). Still, text processing is important and I feel there were some valuable practical hints here.