Maintenance
Shutting/Starting down all instances:
gnt-instance stop|start --all [--no-remember]
Blocking/Unblocking jobs:
gnt-cluster queue [un]drain
Stopping the watcher:
gnt-cluster watcher pause <timespec>|continue
How to use your cluster.
In a big cluster you want to organize nodes into groups. Ganeti will make sure instances' primary and secondary nodes are in the same group.
Rule of thumb: One group per subnet:
# gnt-group add group2 # gnt-group rename default group1 # gnt-group assign-nodes group2 node20 node21 node22 ... # gnt-instance change-group --to group1 instance_name
# on a master candidate gnt-cluster master-failover # use --no-voting on a 2 node cluster
(A linux-HA experimental integration is present in 2.7)
We can remove instances from a node when we want to perform some maintenance.
Drain, move instances, check, set off-line:
gnt-node modify -D yes node2 # mark as "drained" gnt-node migrate node2 # migrate instances gnt-node evacuate node2 # remove DRBD secondaries gnt-node info node2 # check your work gnt-node modify -O yes node2 # mark as "offline"
It is now safe to power off node2
# set the node offline gnt-node modify -O yes node3 # use --auto-promote or manually promote a node # if the node was a master candidate.
(This step can also be automated using linux-HA)
# failover instances to their secondaries gnt-node failover --ignore-consistency node3 # or, for each instance: gnt-instance failover \ --ignore-consistency web
# restore redundancy
gnt-node evacuate -I hail node3
# or, for each instance:
gnt-instance replace-disks \
{-n node1 | -I hail } web
(The autorepair tool in Ganeti 2.7 can automate these two steps)
After a node comes back:
gnt-node add --readd node3
Then it's a good idea to rebalance the cluster:
hbal -L -X
Shutting/Starting down all instances:
gnt-instance stop|start --all [--no-remember]
Blocking/Unblocking jobs:
gnt-cluster queue [un]drain
Stopping the watcher:
gnt-cluster watcher pause <timespec>|continue
Graceful shutdown before powering off nodes:
gnt-cluster verify gnt-cluster queue drain gnt-cluster watcher pause 6000 gnt-instance stop --all --no-remember gnt-job list --running # Check if jobs have completed
Emergency shutdown (faster):
gnt-instance stop --all --no-remember
After a graceful shutdown, return the cluster to service:
gnt-cluster queue undrain gnt-cluster watcher continue
watcher will restart all instances in 10-20 minutes:
gnt-cluster verify
From the master node:
alias gnt-dsh=dsh -cf /var/lib/ganeti/ssconf_online_nodes # Stop Ganeti gnt-dsh /etc/init.d/ganeti stop # Now unpack/upgrade the new version on all nodes. eg gnt-dsh apt-get install ganeti2=2.7.1-1 ganeti-htools=2.7.1-1 # Now upgrade the config and restart /usr/lib/ganeti/tools/cfgupgrade gnt-dsh /etc/init.d/ganeti start gnt-cluster redist-conf
The Ganeti administrator's guide
Questions?