A simple Disruption Manager Service.
The disruption-manager may be used to limit the number of concurrent disruptive updates to subs. While most updates are usually non-disruptive, some may cause an essential service to restart. These are marked in the image trigger as HighImpact so that subd can call the Disruption Manager.
This simple Disruption Manager Service reads tags from the MDB data for a machine to determine whether to permit a disruptive update or deny until later. The relevant tags are:
NomadNodes
for Nomad workers, Kubelets
for
Kubernetes nodes and Prometheus
for Prometheus collectors.
If unspecified the value of the RequiredImage
field is used
as the group identifier. If the empty string is specified, the machine
is counted as part of the default global group. If the group identifier
changes while a machine is not in the denied
disruption
state, the behaviour is undefinedpermitted
. This may be used to give a
service instance time to become ready before another instance is
disruptedpermitted
. It must return a HTTP 200 status
code to signify ready before another service instance is disrupted or
until the DisruptionManagerReadyTimeout is reached
(default 15 minutes if unspecified). Go template expansion is
applied to this string, using the MDB Machine
dataThe disruption-manager provides a web interface on port
6979
which provides a status page, access to performance
metrics and logs. If disruption-manager is running on host
myhost
then the URL of the main status page is
http://myhost:6979/
. An RPC over HTTP interface is also
provided over the same port.
disruption-manager is started at boot time, usually by one of the provided init scripts. The disruption-manager process is baby-sat by the init script; if the process dies the init script will re-start disruption-manager. It may be stopped with the command:
service disruption-manager stop
which also kills the baby-sitting init script. It may be started with the command:
service disruption-manager start
There are many command-line flags which may change the behaviour of disruption-manager but the defaults should be adequate for most deployments. Built-in help is available with the command:
disruption-manager -h
RPC access is restricted using TLS client authentication.
Disruption-Manager expects a root certificate in the file
/etc/ssl/CA.pem
which it trusts to sign certificates which
grant access.
The Disruption Manager receives requests with MDB data for a machine and the requested operation. The preferred protocol is SRPC. The supported operations are:
Any other request will return an error.
Regardless of the (valid) argument provided, the (new) disruption state is returned, and may be one of the following:
A machine which is in permitted or requested state for more than an hour since the last request operation will move to the denied state.
As an alternative to the SRPC interface, a POST request may be sent
to the /api/v1/request
endpoint, containing a JSON-encoded
payload with the machine MDB data and the requested operation. For
example, a request for disruption:
{
"MDB": {
"Hostname": "nomad-node-0",
"Tags": {
"BusinessUnit": "core-team",
"DisruptionManagerGroupIdentifier": "NomadNodes"
}
},
"Request": "request"
}
The following response would be returned if disruption is permitted:
{
"Response": "permitted"
}