Machine Birthing

Richard Gooch

Background

Growing machine capacity in a datacentre environment is often done by rolling in multiple racks of machines, wiring them in and powering them up. Once powered up, it is common for operations staff to use scripts and other automation tools to birth (install and configure) the machines. These automation tools typically build on top of other tools which were designed to birth a single machine (i.e. boot from an installation CD/ISO image). The layers of tools can make the birthing process less reliable and efficient, and leave the machine in a state where it is ready for further configuration rather than being ready for actual useful work. These tools often neglect other aspects of the machine life-cycle, such as automated repairs.

This document describes the design of a fully automated, robust, reliable and efficient architecture for (re)birthing machines at large scale. The design target is that 100 racks of machines can be turned on and within an hour all the new machines are available for real work, without any further human intervention nor any preparatory software configuration. The software system that will implement this architecture is called the Birther.

The Birther system depends on the Dominator system, which is likely to be the limiting factor in how quickly machines can be made available for real work. A more focussed design target for the Birther system is that it can subject more machines per second to Domination than the Dominator can complete a full system update on per second.

High-level Design

The system is comprised of the following components:

The following diagram shows how these components are connected: BirtherSystem Components image

The MDB

The MDB is the sole source of truth which defines the intended state of the fleet. It lists all the known machines in the fleet and records the name, IP address, MAC address, required system image, repair state and so on.

The Birther

The Birther listens for PXE boot requests from any machine, and consults the MDB to determine what kind of response to send. In all cases, a response is sent. The following MDB states are defined:

The Birther stores PXE boot request and response statistics in the MDB so that persistently failing machines can be detected.

The Boot Server

The Boot Server contains a DHCP and TFTP server and serves requests for the Bootstrap Image. It will respond to requests on the private IP network used for temporary addresses as well as requests on the main IP network for permanent addresses. The Hypervisor in SmallStack (part of the Dominator ecosystem) contains a Boot Server which is integrated with the ecosystem (including image building and distribution). Consult your favourite search engine for generic implementations.

The Bootstrap Image

The Bootstrap Image contains:

The configuration tool performs initial setup and then hands the machine over to the Dominator.

The Fast Bootstrap Image

This is the same as the Bootstrap Image except that a burn-in test is not performed.

The Miracle of Birth

Consider the first power on of a machine. The following sequence will ensue:

Repairing (rebirthing) Machines

If a machine is found to be persistently failing (e.g. stuck in a reboot loop), a separate automated system may decide that a rebirthing is required. If so, that system will set the state of the machine in the MDB to rebirth and on the next reboot the Birther will send a PXE boot response to boot the Bootstrap Image. The flow is almost the same as above for birthing machines, with the following exceptions:

The means of detecting unhealthy machines and determining how sick they are and the steps required to heal them is the topic of another paper about Machine Lifecycle Management. The Birther and the Dominator are foundational components in a larger system.

Cleaning Machines

Cleaning a machine is almost identical to rebirthing, except that the burn-in test is not performed. This is useful if a machine is re-assigned to a different owner so that any potentially sensitive data are removed before the machine is available to the new owner. The burn-in test is not needed (the machine is healthy), so it is best to avoid that step (which can take many minutes or even hours, depending on how exhaustive the test is). A fast re-assignment facilitates building responsive Metal as a Service system, if so desired.

In the simplest case, data can be “cleaned” by re-making the file-systems. This limits the potential for data exfiltration to more advanced attackers. If the secure encryption features of the storage media are used, throwing away the old encryption keys is a fast and effective method to effectively erase the storage media.

Calculating Performance Targets

One of the limitations on birthing machines is how quickly they can fetch the Bootstrap Image from the Boot server. Considering the following environment:

the Boot server should be able to service 100 fetches per second. This is much faster than the Dominator can perform full system updates on (its limit is 1 machine per second, assuming it does not have any peer-to-peer enhancements). Clearly, optimising the Birther system would be premature, and will probably never be needed.