Ganeti (introduction and updates)
A cluster virtualization manager.
© 2010-2011 Google
Use under GPLv2+ or CC-by-SA
Some images borrowed/modified from Lance Albertson
Ganeti at FOSDEM 2012
Saturday, 14:00 Janson, Internals (yesterday afternoon)
Sunday, 10:00 Chavanne, Getting Started (here and now)
Outline
- Introduction to Ganeti
- Latest features
- Using Ganeti in practice
- How we use Ganeti in production
What can it do?
- Manage clusters of physical machines
- Deploy Xen/KVM/lxc virtual machines on them
- Live migration
- Resiliency to failure (data redundancy over DRBD)
- Cluster balancing
- Ease of repairs and hardware swaps
Ideas
- Making the virtualization entry level as low as possible
- Easy to install/manage
- No specialized hardware needed (eg. SANs)
- Lightweight (no "expensive" dependencies)
- Scale to enterprise ecosystems
- Manage symultaneously from 1 to ~200 host machines
- Access to advanced features (drbd, live migration)
- Be a good open source citizen
- Design and code discussions are open
- External contributions are welcome
- Cooperate with other "big scale" Ganeti users
Terminology
- Node: a virtualization host
- Nodegroup: an omogeneous set of nodes
- Instance: a virtualization guest
- Cluster: a set of nodes, managed as a collective
- Job: a ganeti operation
Technologies
- Linux and standard utils (iproute2, bridge-utils, ssh)
- KVM/Xen/LXC
- DRBD, LVM, or SAN
- Python (plus a few modules)
- socat
- Haskell (optional)
Node roles (management level)
- Master Node
- runs ganeti-masterd, rapi, noded and confd
- Master candidates
- have a full copy of the config, can become master
- run ganeti-confd and noded
- Regular nodes
- cannot become master
- get only part of the config
- Offline nodes, are in repair
Node roles (instance hosting level)
- VM capable nodes
- Drained nodes
- Offlined nodes, are in repair
New features in 2.4
The very stable version (since Mar 2011):
- Out of Band management
- vhost net support (KVM)
- hugepages support (KVM)
- initial nodegroups
New features in 2.5
At rc level, due for release soon:
- shared storage (SAN) support
- improved nodegroups (scalability, evacuate, commands)
- master IP turnup customization
- full SPICE support (KVM)
- Node health/power/epo commands (OOB)
New features in 2.6
Soon to be frozen:
- RBD support (ceph)
- initial memory ballooning (KVM, Xen)
- cpu pinning
- OVF export/import support
- support for customizing drbd parameters
- policies for better resource modeling
What to expect
Just ideas, not promises:
- Full dynamic memory support
- Better instance networking customization
- Rolling reboot
- Better automation, self-healing, availability
- Higher scalability
- KVM block device migration
- Better OS installation
Initializing your cluster
The node needs to be set up following our installation guide.
gnt-cluster init [-s ip] ... \
--enabled-hypervisors=kvm cluster
gnt-cluster
Cluster wide operations:
gnt-cluster info
gnt-cluster modify [-B/H/N ...]
gnt-cluster verify
gnt-cluster master-failover
gnt-cluster command/copyfile ...
Adding nodes
gnt-node add [-s ip] node2
gnt-node add [-s ip] node3
Adding instances
# install instance-{debootstrap, image}
gnt-os list
gnt-instance add -t drbd \
{-n node3:node2 | -I hail } \
-o debootstrap+default web
ping i0
ssh i0 # easy with OS hooks
gnt-node
Per node operations:
gnt-node remove node4
gnt-node modify \
[ --master-candidate yes|no ] \
[ --drained yes|no ] \
[ --offline yes|no ] node2
gnt-node evacuate/failover/migrate
gnt-node powercycle
gnt-instance
Instance operations:
gnt-instance start/stop i0
gnt-instance modify ... i0
gnt-instance info i0
gnt-instance migrate i0
gnt-instance console i0
-t drbd
DRBD provides redundancy to instance data, and makes it possible to perform
live migration without having shared storage between the nodes.
Recovering from failure
# set the node offline
gnt-node modify -O yes node3
Recovering from failure
# failover instances to their secondaries
gnt-node failover --ignore-consistency node3
# or, for each instance:
gnt-instance failover \
--ignore-consistency web
Recovering from failure
# restore redundancy
gnt-node evacuate -I hail node3
# or, for each instance:
gnt-instance replace-disks \
{-n node1 | -I hail } web
gnt-backup
Manage instance exports/backups:
gnt-backup export -n node1 web
gnt-backup imoport -t plain \
{-n node3 | -I hail } --src-node node1 \
--src-dir /tmp/myexport web
gnt-backup list
gnt-backup remove
Controlling Ganeti
- Command line (*)
- Ganeti Web manager
- Developed by osuosl.org and grnet.gr
- RAPI (Rest-full http interface) (*)
- On-cluster "luxi" interface (*)
- luxi is currently json over unix socket
- there is code for python and haskell
(*) Programmable interfaces
Job Queue
- Ganeti operations generate jobs in the master (with the exception of queries)
- Jobs execute concurrently
- You can cancel non-started jobs, inspect the queue status, and inspect jobs
gnt-job list
gnt-job info
gnt-job watch
gnt-job cancel
gnt-group
Managing node groups:
gnt-group add
gnt-group assign-nodes
gnt-group evacuate
gnt-group list
gnt-group modify
gnt-group remove
gnt-group rename
gnt-instance change-group
Running Ganeti in production
What should you add?
- Monitoring/Automation
- Check host disks, memory, load
- Trigger events (evacuate, send to repairs, readd node, rebalance)
- Automated host installation/setup (config management)
- Self service use
- Instance creation and resize
- Instance console access
Production cluster
As we use it in a Google Datacentre:
Fleet at Google
Instance provisioning at Google
Auto node repair at Google
Auto node readd at Google
People running Ganeti
- Google (Corporate Computing Infrastructure)
- grnet.gr (Greek Research & Technology Network)
- osuosl.org (Oregon State University Open Source Lab)
- fsffrance.org (according to docs on their website and trac)
- ...
Conclusion
Questions? Feedback? Ideas? Flames?
© 2010-2011 Google
Use under GPLv2+ or CC-by-SA
Some images borrowed/modified from Lance Albertson