Intro
All about bytes, chars, unicode, formatting and regexes. Covers chapter 17, "Strings and Text"
The book starts off with an overview of Unicode which I will skip here, except to say that Rusts String and str represent text as utf8
Characters (char)
Rust chars are 32bit Unicode code points
Classifying Characters
Some classification methods:
ch.is_numeric()
ch.is_alphabetic()
ch.is_alphanumeric()
ch.is_whitespace()
ch.is_control()
ch.is_ascii()
ch.is_ascii_alphabetic()
ch.is_ascii_digit()
ch.is_ascii_hexdigit()
ch.is_ascii_alphanumeric()
ch.is_ascii_control()
ch.is_ascii_graphic()
ch.is_ascii_uppercase()
ch.is_ascii_lowercase()
ch.is_ascii_punctuation()
ch.is_ascii_whitespace()
Handling Digits
Some functions for handling digits.
ch.to_digit(radix)
- Returns Some(u32) if the char is indeed an ASCII digit. For radix larger than 10 ASCII letters (either case) are used
std::char::from_digit(num, radix)
Convert u32 into Some(char), the reverse of the above
ch.is_digit(radix)
Same as
ch.to_digit(radix) != None
Case Conversion for Characters
ch.is_lowercase(), ch.is_uppercase()
Test for Unicode lower/uppercase-ness
ch.to_lowercase(), ch.to_uppercase()
Convert to lower/uppercase. This returns an iterator as in Unicode this could result in more than one char, e.g. for 'ß':
// The uppercase form of the German letter "sharp S" is "SS": let mut upper = 'ß'.to_uppercase(); assert_eq!(upper.next(), Some('S')); assert_eq!(upper.next(), Some('S')); assert_eq!(upper.next(), None);
Conversions to and from Integers
The as
cast will make any char into an integer of any width, masking off upper bits if required. It will also convert any u8 into a char. Larger ints must use std::char::from_u32
; this'll None if the int doesn't describe a valid code point and Some(char) otherwise.
String and str
Creating String Values
String::new()
new empty String
String::with_capacity(n)
new String with pre-alloc buffer
str_slice.to_string()
new String, with contents of the slice
iter.collect()
concat elems from the iterable into a String
slice.to_owned()
copy of slice as a fresh String
Simple Inspection
slice.len()
len in bytes (not chars!)
slice.is_empty()
true if empty
slice[range]
results in a slice of a slice
slice.split_at(i)
split a slice in two and return both parts
slice.is_char_boundary(i)
true if i is at a char boundary
Appending and Inserting Text
Methods:
string.push(ch)
add char to String
string.push_str(slice)
add slice to String
string.extend(iter)
add all elems from iterator to String
string.insert(i, ch)
insert char at i – needs to shift all the later elements
string.insert_str(i, slice)
same as above but for a whole slice
Traits:
std::fmt::Write
String implements and therefore the macros
write!/writeln!
can be used. These return a Result but are actually infallibleAdd<&str>, AddAssign<&str>
this allows the use of the
+
operand on Strings and&str
. Left-hand Strings are taken by value and the buffer will be reused. Building up text by appending onto the end of a String is efficient; even more so if adequate space has been pre-allocated via.with_capacity()
Removing and Replacing Text
Some methods for removing and replacing (without changing capacity):
string.clear()
make it empty
string.truncate(n)
throw away everything after offset n (noop if it already was at length)
string.pop()
pop off last char as an Option (None if string was empty)
string.remove(i)
remove and return char at byte offset i, shift rest forward
string.drain(range)
return iterator over range. When iterator is done remove chars from range
string.replace_range(range, replacement)
fill in replacement for range (don't have to be same len)
Conventions for Searching and Iterating
Some naming conventions from the stdlib
- r*
functions that start with r process from end to begin
- *n
iterators ending in n return a max. number of matches
- *indices
iterators that also return byte offsets
Patterns for Searching Text
When the std lib talks of a pattern one of four things is accepted:
a char which matches the char value
a string, matching string val
a
FnMut(char) -> bool
closure, matches every char where the closure returns truea
&[char]
i.e. slice of char values matches every char that's in the slice
The corresponding trait is std::str::Pattern
Searching and Replacing
slice.contains(pattern)
true if there's a match
slice.starts_with(pattern), slice.ends_with(pattern)
true if matches start/end
slice.find(pattern), slice.rfind(pattern)
Return
Some(i)
with the byte offset of the matchslice.replace(pattern, replacement)
replace pattern with replacement eagerly (as soon as it's found)
slice.replacen(pattern, replacement, n)
as above but max n matches
Iterating over Text
Methods that let you split text up and iterate over it. Most of these are DoubleEndedIterator, ie. they can be reversed (.rev()
)
slice.chars()
iterator over chars
slice.char_indices()
get chars and corresponding byte offsets
slice.bytes()
iterator over bytes
slice.lines()
iterator over lines (trimming newlines)
slice.split(pattern)
split on pattern; not reversible if pattern is a
&str
slice.rsplit(pattern)
as above but scan from end
slice.split_terminator(pattern), slice.rsplit_terminator(pattern)
as above but pattern is treated as a separator – if there's a match at then this method will not include an empty element
slice.splitn(n, pattern), slice.rsplitn(n, pattern)
as above but return max n elements
slice.split_whitespace(), slice.split_ascii_whitespace()
split on whitespace (either ascii or unicode whitespace)
slice.matches(pattern), slice.rmatches(pattern)
iterator over matches of pattern
slice.match_indices(pattern), slice.rmatch_indices(pattern)
as above but return (offset, match) tuples
Trimming
slice.trim()
trim leading and trailing whitespace
- =slice.trimmatches(pattern), .trimstartmatches(), .trimendmatches() =
trim pattern from start and/or end
slice.strip_prefix(pattern), slice.strip_suffix(pattern)
return Some value with trimmed slice, if there's a match at beginning or end of slice, otherwise None
Case Conversion for Strings
The slice.to_uppercase()
and slice.to_lowercase()
methods create a new string
Parsing Other Types from Strings
The std::str::FromStr
trait implements parsing from a string blob. The usual machine types like usize, f64, bool etc. all implement it. Char also does, for single-character strings; as well as IpAddr
.
String slices have a parse method for converting to a specific (but calling .from_str()
might be just as readable).
Converting Other Types to Strings
std::fmt::Display
traitfor the formatting macros, e.g.
assert_eq!(format!("{}, wow", "doge"), "doge, wow");
.to_string()
the
ToString
trait defines this method,std::fmt::Debug
traitformat a value for debugging, e.g. via the
{:?}
format specifier
Borrowing as Other Text-Like Types
Reminder about the AsRef<str>, AsRef<[u8]>, AsRef<Path>, AsRef<OsStr>
traits so std lib functions can take strings which will get autoconverted to the correct type.
Also the Borrow<str>
trait so that Strings can be used as keys in HashMap and BTreeMap
Accessing Text as UTF-8
slice.as_bytes()
get a shared ref to a byte slice
string.into_bytes()
take ownership and convert to a slice of bytes
Producing Text from Bytes
These create unicode encoded text from bytes
str::from_utf8(byte_slice)
make a string; returns either
Ok(&str)
or an error if the bytes were not well-formed utf8String::from_utf8(vec)
construct a string from a byte vec, taking ownership, return either Ok or Err, with accessor methods of the failing data
String::from_utf8_lossy(byte_slice)
construct a string with unicode replacement chars where necessary, return a Cow str.
String::from_utf8_unchecked(vec)
just wrap up your byte vec without checking if it's well-formed; unsafe.
str::from_utf8_unchecked(byte_slice)
similar to above
Putting Off Allocation
Quick example of using Cow in a common situation: either calculate some value but have a static value as a fallback:
use std::borrow::Cow;
// get a name, return a Cow enum
fn get_name() -> Cow<'static, str> {
std::env::var("USER") // we try to grab from the env
.map(|v| Cow::Owned(v)) // and construct a Cow
// but if getting user from env fails, borrow from a stratic str
.unwrap_or(Cow::Borrowed("whoever you are"))
}
With this in place we only alloc if we really need it.
There's some special support Cow values for strings, providing From
and Into
conversions, which makes it possible to just write:
fn get_name() -> Cow<'static, str> {
std::env::var("USER")
.map(|v| v.into())
.unwrap_or("whoever you are".into())
}
Strings as Generic Collections
Strings share some traits with collections, such as Default
(an empty string) and Extend
, tacking on chars, slices, … at the end.
Formatting Values
These macros interpret the formatting mini lang we already encountered:
format!
for building stringsprintln!, print!
macros for printing to stdoutwriteln!, write!
macros for any output streampanic!
for termination messages
For supporting user defined types in formatting, look the std::fmt
module.
The formatting macros all work with shared refs, they never take ownership.
The formatting mini lang uses {...}
as "format parameters"; their form is {which:how}
, both of those is optional, so {}
is the default which and how. "Which" refers to how to select data – either by index or name, while "how" refers to the display formatting.
Formatting Text Values
This applies to text values (string slices, Strings, chars). For these the "how" has several optional parts:
text length limit
minimum field width
alignment
padding character
Formatting Numbers
Optional "how"-params for numeric values:
padding and alignment, as above
- character to force sign output
0 character to output leading zeroes
minimum field width
precision for floating-point
notation, b for binary, o for octal, or x or X for hexadecimal…
Formatting Other Types
These types have formatting built in as well: Errors, IpAddr, SocketAddr, booleans. Length limit, field width, and alignment controls work as expected for these.
Formatting Values for Debugging
The {:?}
parameter formats values for debugging. Using {:#?}
adds some pretty-printing to the output. Including an "x" in the format params will print numbers in hex instead of dec, e.g.: {:02x?}
Formatting Pointers for Debugging
Typically refs, boxes, Rc values and similar will follow to their referent when formatted. To get at the actual pointer addresses, use the {:p}
format.
Referring to Arguments by Index or Name
You can access formatting arglists out-of-order by providing indexes as in format!("{1},{0},{2}", "zeroth", "first", "second")
More useful is the possibility to access format values by name as in the below:
format!("{description:.<25}{quantity:2} @ {price:5.2}",
=3.25,
price=3,
quantity="Maple Turmeric Latte") description
These look like Pyhons keyword args but really only are defined in the macro.
These styles can also be mixed.
Dynamic Widths and Precisions
To determine field widths at run time use something like this:
format!("{:>1$}", content, get_width())
The 1$
tells the macro to use the value of get_width()
as a field width.
Instead of num$ can also use name$:
format!("{:>width$.limit$}", content,
=get_width(), limit=get_limit()) width
Finally with *
the next positional arg is taken as the width spec:
format!("{:.*}", get_limit(), content)
Formatting Your Own Types
The std::fmt
lib has a number of traits you can implement for your own type to gain formatting. The structure of those traits is the same, for instance to implement the Display trait for a (user-defined) Complex type:
use std::fmt;
impl fmt::Display for Complex {
fn fmt(&self, dest: &mut fmt::Formatter) -> fmt::Result {
let im_sign = if self.im < 0.0 { '-' } else { '+' };
write!(dest, "{} {} {}i", self.re, im_sign, f64::abs(self.im))
}
}
These formatters should not return errors i.e. be infallible.
Using the Formatting Language in Your Own Code
To make user-defined functions accept formatting params use the format_args!
macro and the std::fmt::Arguments
type.
Example logging fun that takes formatting args:
fn logging_enabled() -> bool { ... }
use std::fs::OpenOptions;
use std::io::Write;
fn write_log_entry(entry: std::fmt::Arguments) {
if logging_enabled() {
// Keep things simple for now, and just
// open the file every time.
let mut log_file = OpenOptions::new()
.append(true)
.create(true)
.open("log-file-name")
.expect("failed to open log file");
// note: File type supports writting formatted data
.write_fmt(entry)
log_file.expect("failed to write to log");
}
}
// usage
format_args!("Hark! {:?}\n", mysterious_value)); write_log_entry(
For extra convenience, wrap the latter in a macro (more on macros later):
macro_rules! log { // no ! needed after name in macro definitions
$format:tt, $($arg:expr),*) => (
(format_args!($format, $($arg),*))
write_log_entry(
)}
// usage
log!("O day and night, but this is wondrous strange! {:?}\n",
; mysterious_value)
Regular Expressions
There's an non-std lib regex
crate that offers basic matching/searching. It doesn't have backrefs nor lookahead / lookbehind matches but it does guarantee that search time is linear over text and pattern size, and should be safe for untrusted input.
Basic Regex Use
Example usage:
use regex::Regex;
// match a version string
// compile re from &str, panic if compilation fails
let semver = Regex::new(r"(\d+)\.(\d+)\.(\d+)(-[-.[:alnum:]]*)?")?;
// Simple search, with a Boolean result.
let haystack = r#"regex = "0.2.5""#;
assert!(semver.is_match(haystack));
// You can retrieve capture groups:
let captures = semver.captures(haystack)
.ok_or("semver regex should have matched")?;
assert_eq!(&captures[0], "0.2.5"); // all
assert_eq!(&captures[1], "0"); // first group
// .get() returns an option while the above indexing would panic
assert_eq!(captures.get(4), None);
let haystack = "In the beginning, there was 1.0.0. \
For a while, we used 1.0.1-beta, \
but in the end, we settled on 1.2.4.";
// collect all matches and cast to str
let matches: Vec<&str> = semver.find_iter(haystack)
.map(|match_| match_.as_str())
.collect();
assert_eq!(matches, vec!["1.0.0", "1.0.1-beta", "1.2.4"]);
Building Regex Values Lazily
Like in other envs compiling regex patterns is expensive; it's best to compile them once and reuse.
The lazy_static
module helps with creating static values on demand:
use lazy_static::lazy_static;
lazy_static! {
static ref SEMVER: Regex
= Regex::new(r"(\d+)\.(\d+)\.(\d+)(-[-.[:alnum:]]*)?")
.expect("error parsing regex");
}
This will generate a static SEMVER pattern that is initialized on first use
Normalization
This section discusses the non-std unicode-normalization
crate which supports unicode normalization methods, i.e. normalizing strings into a specific form in the presence of unicode combining characters
Coda
So, this was quite a long chapter; I've elided much of the details as conceptually not too much new stuff (and I'm unlikely to recall all the details later anyway). Still, text processing is important and I feel there were some valuable practical hints here.