Introducing MicroCeph

Intro

This post aims to introduce MicroCeph, a Ceph packaging that we've been working on extensively the last months. First, I'll talk a bit about Ceph and it's workings, then describe MicroCeph, which problems it solves and why I think it's cool, and then walk through some examples of setting up and using MicroCeph:

  • setting up a single node cluster

  • setting up S3 object storage

  • using pseudo disks or machines for testing

  • and setting up a multinode cluster.

About Ceph

At its heart, Ceph is a distributed storage system engineered to carry out massive data management tasks, which it accomplishes through the use of the RADOS (Reliable Autonomic Distributed Object Store) system. Its architecture is designed to distribute data across various machines in a scalable fashion.

To understand Ceph, one must consider how it processes data. Instead of keeping data in a single block, Ceph disassembles it into smaller pieces, known as objects or "chunks". These objects then get evenly distributed across various machines and disks. This technique is a bit like distributing data across a RAID, except the data is distributed across a cluster of storage machines.

Data Safety and Replication

Data safety is of paramount importance in any storage system. One method Ceph employs to ensure this by copying the objects across the cluster, a process known as data replication. If a given node or a disk fails, the data is safely stored in another location within the cluster. This distributed replication mechanism is core to Ceph's ability to offer fault tolerance and high availability.

The nice thing is: Ceph's data distribution and failover processes are fully automatic, cutting the need for manual management. These automated processes save substantial time and resources. Have a disk go bad, a machine crash or a complete rack go offline? Ceph will, if resources are available, automatically recover from such a situation and you might not even notice, except for increased bandwidth and IOPS usage due to data being reshuffled – no client I/O being interrupted. It's really quite neat!

A key concept here is replication factor: how many copies of your data Ceph will keep around. This is configurable but a good default is RF (replication factor) = 3, i.e. 3 copies of your data are kept (one practical advantage of 3 copies is that it's the minimum number of copies where you can take a majority vote should data copies get out of sync).

Another concept to understand is "failure domain". If your failure domain is at host level, your cluster is guaranteed to survive loss of a machine, if it's at rack level the cluster guarantees there will be no data loss even if a complete rack is lost, etc. In practical terms this determines where Ceph will place your data copies (replicas). E.g. at failure domain host and with replication factor 3 Ceph will ensure copies of your data are placed on 3 different machines. This of course means you will need at a minimum 3 machines! Similarly at other failure domain levels: from failure domain disk (the smallest failure domain) to racks, rooms or complete data centers.

So, Why MicroCeph?

Using Ceph can be demanding due to its inherent complexities. These complexities, designed to make Ceph a one-size-fits-all general-purpose storage solution, can be arduous for small clusters, impacting overall Ceph adoption. Enter MicroCeph.

MicroCeph is focused on the small scale. It simplifies deploying and operating a Ceph cluster, making it easier to consume. The solution fits small but growing clusters, testing environments, home labs, and development instances. It is offering minimal overhead, quick and predictable deployments alongside straightforward operations for bootstrapping, adding OSDs, enabling, and placing services.

Implementation Notes

MicroCeph installation happens via a snap package ensuring the installation is consistent and isolated from the host. This packaging means MicroCeph installations are uniform and separated from the host, for reliability and consistency.

Within the package, MicroCeph manages the clustering of hosts, which in turn supports the underlying cluster of Ceph services. A distributed SQLite store tracks nodes, disks, service placement, and configuration, thus offering centralized administration.

Snaps should work on most Linux distributions. The below examples were done on a Ubuntu Jammy install.

Example I: setting up a single-node MicroCeph cluster

On to some hands-on examples!

MicroCeph can scale down to a single node with 3 disks and 4GB of memory. Note MicroCeph will use complete disk drives, it doesn't work with disk partitions.

Installation and bootstrapping a single node cluster would look like this:

ubuntu@stor-0:~$ sudo snap install microceph
microceph 0+git.02d9e5d from Canonical** installed
ubuntu@stor-0:~$ sudo microceph cluster bootstrap
ubuntu@stor-0:~$ # no output

This sets up the Ceph cluster itself. We can check Ceph status by:

ubuntu@stor-0:~$ sudo ceph -s
  cluster:
    id:     3a22cbc6-4919-42da-b7cf-d13fb434ac1f
    health: HEALTH_WARN
            OSD count 0 < osd_pool_default_size 3

  services:
    mon: 1 daemons, quorum stor-0 (age 98s)
    mgr: stor-0(active, since 92s)
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:       

Our Ceph cluster has a HEALTH_WARN condition because we don't have any disks (OSDs in Ceph parlance) configured yet – so no place to actually store data.

To setup MicroCeph you need at least 3 unused disks. Note the disks will be completely managed by MicroCeph; do not use disks with any kind of valuable data on it!

Adding disks is done with microceph disk add. Assuming our 3 disks are named /dev/sdb, /dev/sdc, /dev/sdd adding them would come down to running a loop like:

ubuntu@stor-0:~$ for d in sdb sdc sdd ; do
    sudo microceph disk add /dev/$d --wipe
done

Note the --wipe flag at the end of the command? This tells MicroCeph to clear any data that might still linger on the disks. Again, don't use disks with any valuable data.

As an aside, if you don't have 3 unused disks but still want to try out MicroCeph see below for some options.

With 3 disks now added to the cluster, Ceph status looks better:

ubuntu@stor-0:~$ sudo ceph -s
  cluster:
    id:     3a22cbc6-4919-42da-b7cf-d13fb434ac1f
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum stor-0 (age 6m)
    mgr: stor-0(active, since 6m)
    osd: 3 osds: 3 up (since 10s), 3 in (since 12s)

  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 577 KiB
    usage:   68 MiB used, 112 GiB / 112 GiB avail
    pgs:     1 active+clean  

This tells us we have a monitor daemon, a mgr daemon, and 3 OSDs configured, and internal health checks are OK.

Congratulations, you just set up a fully functional Ceph cluster!

Example II: Configuring S3 Object Storage

A popular use case for MicroCeph is as a backend for S3-compatible object storage via the RadosGW service. Imagine you are developing an app that can utilize S3 storage and would like to test it from your dev environment – with MicroCeph you could point your app at your local cluster setup for testing S3.

Enabling S3-compatible storage:

ubuntu@stor-0:~$ sudo microceph enable rgw

The rgw here refers to the Ceph RadosGW service.

To make use of this service you will need to configure a user. The below will create myuser and display an S3 access_key and a secret_key

ubuntu@stor-0:~$ sudo radosgw-admin user create --uid=myuser --display-name=myuser

You can now configure your S3-compatible client to access the Ceph RadosGW endpoint. For instance, using the s3cmd commandline client you'd access the endpoint on the stor-0 machine something like below. Pass in the name or ip address of your MicroCeph machine as well as the access and secret keys created when creating myuser:

# Create a bucket
peter@pirx:~$ s3cmd --host stor-0 --host-bucket="stor-0/%(bucket)" --access_key=fooAccessKey --secret_key=fooSecretKey --no-ssl mb s3://testbucket
Bucket 's3://testbucket/' created

# Upload an image and make it publically accessible
peter@pirx:~$ s3cmd --host stor-0 --host-bucket="stor-0/%(bucket)" --access_key=fooAccessKey --secret_key=fooSecretKey --no-ssl put -P robben.jpeg s3://testbucket
upload: 'robben.jpeg' -> 's3://testbucket/robben.jpeg'  [1 of 1]
 216491 of 216491   100% in    1s   111.74 KB/s  done
Public URL of the object is: http://stor-0/testbucket/robben.jpeg

The robben.jpeg would be accessible at http://stor-0/testbucket/robben.jpeg here.

Example III: Pseudo Machines or Disks

If you want to try out MicroCeph but don't have a machine with 3 unused disks around, there are a few options. Naturally they won't offer you the same level of performance and redundancy than running native, still they should work out fine for testing, experimentation or development.

Option 1: Run in a VM

MicroCeph happily runs virtualized. You will need a virtualization software that can add extra block devices though.

If you are running Linux, LXD VMs would be an excellent option; but other virtualization applications like virt-manager, VirtualBox, Parallels or VMware should work as well.

To set up an LXD VM with 3 disks:

peter@pirx:~$ lxc launch images:ubuntu/22.04/cloud vm-0 --vm -c limits.cpu=2 -c limits.memory=4GiB
Creating vm-0
Starting vm-0

peter@pirx:~$ for i in $(seq 1 3); do
    lxc storage volume create default osd-$i --type block size=4GiB
    lxc config device add vm-0 osd-$i disk pool=default source=osd-$i
done

Storage volume osd-1 created
Device osd-1 added to vm-0
Storage volume osd-2 created
Device osd-2 added to vm-0
Storage volume osd-3 created
Device osd-3 added to vm-0


peter@pirx:~$ lxc exec vm-0 bash
root@vm-0:~# lsblk 
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0   24G  0 disk 
├─sda1   8:1    0  100M  0 part /boot/efi
└─sda2   8:2    0 23.9G  0 part /
sdb      8:16   0    4G  0 disk 
sdc      8:32   0    4G  0 disk 
sdd      8:48   0    4G  0 disk 
root@vm-0:~# # ... install MicroCeph and add disks sdb, sdc and sdd

The above shows that you now have a VM with 3 disks sdb, sdc and sdd. Install MicroCeph as in the first example.

Create loopback devices

If you have a suitable machine but lack disks, one option is to run OSDs on loopback devices. These are simulated block devices backed by disk files. Like virtualization above this won't provide real redundancy but might still be fine for your use case.

Loopback devices come standard with Linux. E.g. to create 3 loopback devices with 4G size in the /srv directory:

ubuntu@small-0:~$ for i in $(seq 1 3); do
    loop_file="$(sudo mktemp -p /srv mctest-${i}-XXXX.img)"
    sudo truncate -s 4G "${loop_file}"
    sudo losetup --show -f "${loop_file}"
done

/dev/loop3
/dev/loop4
/dev/loop5

This will create 3 block devices /dev/loopX. One wrinkle here is though that current snapd implementations don't accept block devices with this naming pattern. There's a patch merged but not yet released to address this.

Until this patch is released we need workaround: we need to bind the loopback devices to another device name that's acceptable for snapd.

To do this you would run mknod and pass in a major device number of 7 (for loopback devices) and as a minor number the number of the loopback device. E.g. with the loopback devices I've got in the example above, /dev/loop3, /dev/loop4, /dev/loop5 you'd run:

ubuntu@small-0:~$ sudo mknod -m 0660 /dev/sdia b 7 3
ubuntu@small-0:~$ sudo mknod -m 0660 /dev/sdib b 7 4
ubuntu@small-0:~$ sudo mknod -m 0660 /dev/sdic b 7 5

This would create the block devices /dev/sdia, /dev/sdib, /dev/sdic and map them to /dev/loop3, /dev/loop4, /dev/loop5

With this in place, instruct MicroCeph to create OSDs on the simulated disks:

ubuntu@small-0:~$ for d in sdia sdib sdic ; do
    sudo microceph disk add /dev/$d
done

Example IV: Scaling up MicroCeph to a multi-node cluster

So far we have looked at single-node deployments. For added redundancy, to spread load and deploy more capacity, it's of course often useful to deploy MicroCeph to a cluster of machines.

As the cluster size increases, MicroCeph automatically deploys Ceph service daemons on the first three nodes. This boosts redundancy and availability. If desired, additional instances of services can be added to more machines.

As a test scenario, assume we have 3 machines with one extra disk each which we want to cluster. Our end state then would be a Ceph cluster with services and data in a 3x redundant configuration.

For this example assume our 3 machines will be called node-0, node-1, node-2; assume each has a /dev/sda for the rootfs, and an extra /dev/sdb which we'll use for an OSD.

I'll start by ssh'ing into the nodes and install the MicroCeph snap:

peter@pirx:~$ for i in 0 1 2 ; do ssh node-$i -- sudo snap install microceph ; done
microceph 0+git.02d9e5d from Canonical** installed
microceph 0+git.02d9e5d from Canonical** installed
microceph 0+git.02d9e5d from Canonical** installed

Next, I'll bootstrap on node-0 (could take any node, they are all equal peers):

peter@pirx:~$ ssh node-0 -- sudo microceph cluster bootstrap  

I'll then need to join nodes 1 and 2 to the cluster.

To do this I generate a token for each joining node by ssh'ing into node-0 and running microceph cluster add. This will output a token which I'll store in a var.

Then, join the cluster by ssh'ing into node-1 and node-2 and run microceph cluster join and providing the token. This would look something like:

# Generate tokens for node-1, node-2
peter@pirx:~$ tok1=$( ssh node-0 -- sudo microceph cluster add node-1 )
peter@pirx:~$ tok2=$( ssh node-0 -- sudo microceph cluster add node-2 )

# Use token to join the cluster
peter@pirx:~$ ssh node-1 -- sudo microceph cluster join $tok1
peter@pirx:~$ ssh node-2 -- sudo microceph cluster join $tok2

# Check: we now have 3 nodes (but no OSDs):
peter@pirx:~$ ssh node-0 -- sudo microceph status
MicroCeph deployment summary:
- node-0 (10.0.8.155)
  Services: mds, mgr, mon
  Disks: 0
- node-1 (10.0.8.75)
  Services: mds, mgr, mon
  Disks: 0
- node-2 (10.0.8.42)
  Services: mds, mgr, mon
  Disks: 0

The last command above is microceph status which prints a list of nodes, services and disks known to the cluster (ipaddresses will of course vary).

As we can see above we don't have any OSDs to actually store things, so lets add our spare /dev/sdb on each node:

peter@pirx:~$ for i in 0 1 2 ; do ssh node-$i -- sudo microceph disk add /dev/sdb ; done

Verification, ssh into node-0 and print status:

peter@pirx:~$ ssh node-0
Last login: Thu Aug 10 16:19:34 2023 from 10.0.8.1
ubuntu@node-0:~$ sudo microceph status
MicroCeph deployment summary:
- node-0 (10.0.8.155)
  Services: mds, mgr, mon, osd
  Disks: 1
- node-1 (10.0.8.75)
  Services: mds, mgr, mon, osd
  Disks: 1
- node-2 (10.0.8.42)
  Services: mds, mgr, mon, osd
  Disks: 1
ubuntu@node-0:~$ ceph -s
  cluster:
    id:     775d16b4-159a-4a55-a714-0ae420b5bf70
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum node-0,node-1,node-2 (age 9m)
    mgr: node-0(active, since 13m), standbys: node-1, node-2
    osd: 3 osds: 3 up (since 79s), 3 in (since 81s)

  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 577 KiB
    usage:   68 MiB used, 112 GiB / 112 GiB avail
    pgs:     1 active+clean

Sweet, we got ourselves a 3x node Ceph cluster with 3 OSDs in total!

Adding more nodes and disks works in a similar way. For evenly spreading data and iops it's usually best to add multiples of the replication factor, i.e. with the default RF=3 you'd go for 3, 6, 9, … nodes; each node ideally having the same disk configuration.

Coda

I'm really happy how MicroCeph turned out so far – although it's still a young project and there's lots of things I'd like to add: better support for scaling down clusters, auto-clustering for even more convenient setup, automatically adding disks and automatically creating loopback devices for testing and lab setups.

Let me know what you think – comments and suggestions welcome.

Some helpful links: