Clojure: Collecting things

Last time I've explored scalar data types. There are a lot of details to discover there, but on the whole that was dry stuff.

Now for something a bit more complex: collection data types (putting the List in Lisp).

Persistence and Terminology

Immutable sequences are called persistent in The Book. This means that changes to a vector or list result in a new sequence which contains the modified data, whereas the previous version of the sequence is still around.

For example, let's create a vector v:

user> (def v [1 2 3])
#'user/v

Now replace 1 with 7, and bind the result to v1, which now contains [7 2 3]:

user> (def v1 (replace {1 7} v))
#'user/v1
user> v1
[7 2 3]

The original vector v is unchanged however:

user> v
[1 2 3]

I.e. the replace function didn't modify the original.

It is possible, however, to create Java arrays that do allow in-place modification, however. The function into-array creates such an array, and aset performs in-place modification:

user> (def v (into-array [1 2 3]))
#'user/v
user> (seq v)
(1 2 3)
user> (aset v 1 666)
666
user> (seq v)
(1 666 3)

The book promises that persistent listst are efficient by only doing a kind of copy-on-write, ie. constructing new versions of lists by doing shallow copies where possible. I remember that Erlang seemed to do a lot of mem copying when working with lists, I hope Clojure is more frugal here. A quick tests seems to indicate that indeed Clojure doesn't copy whole lists for simple cases:

user> (def l (vec (range 10000000)))
#'user/l
user> (time
       (dotimes [n 1000]
         (count l)))
"Elapsed time: 0.143 msecs"
nil
user> (time
       (dotimes [n 1000]
         (count (conj l n))))
"Elapsed time: 2.666 msecs"
nil

Some Definitions

Apropos sequences. To straighten the terminology, here is an overview of terms relating to collections and how they're meant to be used in Clojure. Important for me, because Python does use some of these terms, but with different semantics.

Term Brief Example
Collection A composite datatype (fn 1 2), [1], {a: 2, :b 3}, Sequential Ordered [2 3], (:a :b) Sequence A sequential that may or may not exist yet (map fn coll)

Equality

There are three basic categories or partitions: sequentials, sets, and maps. One important property of those partitions are that two values belonging to different partitions will never compare equal, even if their values might be.

Two sequentials that are equal:

user> (= [] '())
true
user> (= [1] '(1))
true

Vectors and sets never are however:

user> (= [] '#{})
false
user> (= [1] '#{1})
false

Everything that implements java.util.List is included in the sequential partition.

The Sequence

The sequence API comprises two functions: first and next. They should return the first element in a sequential, and the next one respectively. Pretty minimal -- note eg. the absence of previous and last functions. Cutting down to first and next allows to treat streams and consumer queues as sequences; it's really the most minimal interface to sequentials one can think of.

Usually the seq function is used to adapt a collection to a sequence, eg.:

user> (seq [1 2 3])
(1 2 3)
user> (seq {:a 1 :b 2})
([:a 1] [:b 2])
user> (seq #{:a :b :c})
(:a :c :b)

There are more functions that extract a sequence out of a given collection. For instance, maps support keys and values:

user> (def m {:a "a string" :b "b string"})
#'user/m
user> (keys m)
(:a :b)
user> (vals m)
("a string" "b string")

Vectors support the rseq function, which returns a reverse sequence:

user> (rseq [1 2 3])
(3 2 1)

Vectors

I've used vectors a lot already. Vectors are not entirely unlike arrays, ie. they can be indexed by integers and cannot be sparse (next index after n is n+1).

Building vectors

Some functions for building vectors:

user> [1 2] ; literal
[1 2]
user> (vec '(1 2 3)) ; vector from list
[1 2 3]
user> (vector 1 2 3) ; vector from varargs
[1 2 3]
user> (into [1 "string"] '(:a :b)) ; extend a vector with a list
[1 "string" :a :b]

Standard vectors can hold any datatype. However, it's possible to get vectors for primitive types. To do this, one can use the vector-of function with a keyword of :int, :long, :float, :double, :byte, :short, :boolean, or :char.

Construct a vector of integers and fill in some values. Note that some values will be cast automatically, but not all:

user> (def intv (vector-of :int))
#'user/intv
user> intv
[]
user> (into intv [1 2 3.4])
[1 2 3]
user> (into intv [1 2 3.4 "gugu"])
ClassCastException java.lang.String cannot be cast to java.lang.Character  clojure.lang.RT.intCast (RT.java:1087)

Accessing vectors

How to access vectors? Numbered items from a vector can be retrieved by the nth function, the get function, and by using the vector as a function:

user> (def v [\a \b \c])
#'user/v
user> (nth v 0)
\a
user> (get v 1)
\b
user> (v 2)
\c
user> (nth v 23)
IndexOutOfBoundsException   clojure.lang.PersistentVector.arrayFor (PersistentVector.java:107)
user> (nth v 23 "not found")
"not found"
user> (get v 23)
nil
user> (get v 23 "not found")
"not found"
user> (v 23)
IndexOutOfBoundsException   clojure.lang.PersistentVector.arrayFor (PersistentVector.java:107)

There is no support for "not found" values when using the vector as a function.

Items in a vector can be "changed" (they're persistent, so a new vector which shares parts with the old vector is returned) with the assoc function:

user> (assoc v 0 "an A")
["an A" \b \c]

Vectors as Stacks

Vectors can be used as stacks. To this end they support the conj and pop operations (the push/pop operations for classic stacks).

However, since vectors are an immutable data type, these operations do not actually change the vector they operate on, but return a new vector:

user> (def my-plates [])
#'user/my-plates
user> (conj my-plates "big plate")
["big plate"]
user> my-plates
[]
user> (def new-plates (conj my-plates "big plate"))
#'user/new-plates
user> new-plates
["big plate"]
user> (peek new-plates)
"big plate"
user> (pop new-plates)
[]

Subvectors

Not entirely unlike Pythons' slices, subvectors provide efficient access to parts of vectors. The subvec function returns a part of an original vector, starting at, but ending before the given positions. Positions are of course zero-based.

user> (def elements [:water :air :fire :earth])
#'user/elements
user> (subvec elements 1 3)
[:air :fire]
user> (subvec elements 0 1)
[:water]

Vectors and maps

Casting a map to a seq results in a vector of key/value pairs:

user> (def pet {:name "Felix" :age 4})
#'user/pet
user> pet
{:age 4, :name "Felix"}
user> (vec pet)
[[:age 4] [:name "Felix"]]

One can use some of the usual operations on seq as well:

user> (conj pet [:mood "Grumpy"])
{:age 4, :name "Felix", :mood "Grumpy"}
user> (count pet)
2

Sometimes have to cast explicitly though:

user> (nth pet 1)
UnsupportedOperationException nth not supported on this type: PersistentHashMap  clojure.lang.RT.nthFrom (RT.java:798)
user> (nth (seq pet) 1)
[:name "Felix"]

Entries are actually objects of type MapEntry:

user> (class (first pet))
clojure.lang.MapEntry

Maps can be used for destructuring:

user> (doseq [[k v] pet] (prn k) (prn v))
:age
4
:name
"Felix"
nil

They also support key and val operations which return key and value of the individual entry though:

user> (key (first pet))
:age
user> (val (first pet))
4

Vectors: Non-uses

  • Vectors are not sparse, ie. indices are sequential.
  • Vectors are inefficient when used as queues

Lists: Code-Form

Lists are used for code-form, ie. to call funcs and macros, etc. In general, use vectors for data instead of lists. Lists can be used as stacks though, and show slightly different performance characteristics due to different memory layout when compared to vectors. Don't use lists to look something up by index (elements have to walked), and of course don't use lists when you really want a set. Also, lists cannot be used as queues.

Persistent Queues

Persistent queues are collections, and don't lend themselves as workflow or communication mechanisms, for instance as a tool to funnel jobs to worker processes. There are different data structures that deal with those use cases.

There is no special literal syntax. Use the EMPTY queue to instantiate a new one. The printed representation is not very informative either:

user> (def q clojure.lang.PersistentQueue/EMPTY)
#'user/q
user> q
#<PersistentQueue clojure.lang.PersistentQueue@1>

Use conj to add elements to the end:

user> (def q (conj clojure.lang.PersistentQueue/EMPTY :me :myself :i))
#'user/q

By overloading the print method we can give queues a nice printed representation, for instance the "fish":

user> (defmethod print-method clojure.lang.PersistentQueue [q, w]
        (print-method '<- w)
        (print-method (seq q) w)
        (print-method '-< w))
#<MultiFn clojure.lang.MultiFn@7423c7f5>
user> q
<-(:me :myself :i)-<

Persistent Sets

Persistent Sets implement the customary set behaviour, ie. a collection of unique elements. Elements that compare the same are not unique regarding this. Note that the equality rules regarding types mentioned above come into play here, eg. sets, maps and sequentials will never compare equal, even though their elements might be. Use the into function to add elements to sets:

user> (def all #{:void :air :fire :water :earth})
#'user/all
user> (into all [:air])
#{:water :void :fire :air :earth}
user> (into #{[:dust]} '(:dust))
#{[:dust] :dust}
user> (into #{[:dust]} ['(:dust)])
#{[:dust]}

Sets double as functions which return a given element if its contained in the set, and nil otherwise:

user> (#{:red :blue} :red)
:red
user> (#{:red :blue} :green)
nil

A common Clojure idiom to check if a sequential contains any of the given values is to use the some function with a set predicate. The some function will return the first logically true value of the sequential when the predicate is applied, nil otherwise:

user> (some even? [1 2 3])
true
user> (some even? [1 3 5])
nil

With a set predicate:

user> (some #{:a :b} [:b :c :d])
:b
user> (some #{:a :b} [:c :d])
nil

Create sorted sets with the sorted-set function. Nota bene: all elements of the set must be comparable:

user> (sorted-set :x :a :b)
#{:a :b :x}
user> (sorted-set :x :a :b 1)
ClassCastException ...

To test if a map or a set (which are built on maps) contain a key, use the contains? function:

user> (contains? {:a 1 :b 2} :b)
true
user> (contains? #{:a :b :c} :b)
true

The clojure.set namespace holds set operators intersection, union, and difference.

Intersection and union are rather straightforward. Difference is a set subtraction:

user> (clojure.set/intersection #{:a :b :c} #{:a :b :x})
#{:a :b}
user> (clojure.set/union #{:a :b :c} #{:a :b :x})
#{:x :a :c :b}
user> (clojure.set/difference #{:a :b :c} #{:a :b :x})
#{:c}

Thinking in Maps

Maps are at the heart of every Python programmer, so I feel right at home with the extensive support for maps in Clojure.

We had literals already. Note how a map is also a function of its key:

user> (def m {:a 1 :b 2})
#'user/m
user> (m :a)
1

Generic maps are constructed via the hash-map function:

user> (hash-map :a 1 :b 2)
{:a 1, :b 2}

If we got a list of items already in place, use the apply function:

user> (apply hash-map [ :a 1 :b 2])
{:a 1, :b 2}

Any type can be a key for hashes -- eg. functions:

user> (hash-map :a 1 hash-map 2)
{:a 1, #<core$hash_map clojure.core$hash_map@3d03f309> 2}

Casting to a seq gets back a list of items (actually a special sequential type):

user> (seq m)
([:a 1] [:b 2])
user> (type (seq m))
clojure.lang.PersistentHashMap$NodeSeq

Use into to put k,v vectors into a map. If need be, cast them to a vec:

user> (into {} [[:a 1] [:b 2]])
{:a 1, :b 2}
user> (into {} (map vec '[(:a 1) (:b 2)]))
{:a 1, :b 2}

Wrapup

There's a lot of depth to Clojure types, and this is especially true of collection types. The seq abstraction is ubiquitous -- little wonder, it's LISP as in LISt Processing.

Next up: Functional Programming.

 · 
peter