Made For Cloud: KVM vCPU Scheduling

I was interested in understanding how cpu resource management in a Linux KVM based solution, such as oVirt worked and drawing some comparisons with other hypervisor technologies such as vSphere ESX. This isn't meant to be an in-depth analysis of every feature of each hypervisor, i'm interested in KVM and the other references are just to help position things in my mind.

Initially, i'm going to start with ESX to create a baseline against which I can compare KVM. For cpu management1 in ESX (3.x+) we get what is affectionately known as relaxed co-scheduling, (co-scheduling is also known as Gang Scheduling2). It's 'relaxed' as prior to ESX 3.x it was strict co-scheduling which meant that in a VM, the rate of progress (work) was tracked for each vcpu. Some would execute at faster rates than others and as such an amount of 'skew' or difference in work rate would be measured and recorded. Once that 'skew' reached a certain limit then a co-stop was performed and the VM could not be re-dispatched (co-start) again until enough pcpus were available to simultaneously schedule all vcpu's in the VM. So, what was 'relaxed' ? Basically, rather than requiring a co-start with all vcpu's needing simultaneous scheduling on pcpu's it is performed for the vcpu's that accrued 'skew' greater than the threshold. This means that a potentially smaller number of pcpu's is required for the co-start. If sufficient pcpu's are available then the co-start defaults back to the strict behaviour but is more relaxed about the situation if there are not sufficient pcpu's available.

With ESX4+ the scheduler is further relaxed. Hypervisor execution time (on behalf of the vcpu) is now no longer tracked as contributing to the skew and the way the skew is measured and tracked has changed. Now, so long as two vcpu's make progress then there is no increase in skew and vcpu's that advance too much are stopped to allow the other vcpu's to catch up. Underneath all of this, if there are sufficient pcpu's to scheduled all vcpu's simultaneously then the original strict co-scheduling is performed.

ESX also has a proportional share based algorithm for cpu that calculates a proportional entitlement of a vcpu to a pcpu as well as the relative priority which is determined from the consumption pattern. A vcpu that consumes less than its entitlement is considered higher priority than one that does. This proportional entitlement is hierarchically based and extends to groups of VM's as part of a resource pool.

In reality there's more to the the ESX scheduler, but i'm actually interested in how it works in the KVM world.

The KVM approach is to re-use as much of the Linux infrastructure as possible - "Don't re-invent the wheel" is the KVM mantra. As such this means that KVM will use the Linux CFS (Completely Fair Scheduler3) by default.

Within KVM, each vcpu is mapped to a Linux process which in turn utilises hardware assistance to create the necessary 'smoke and mirrors' for virtualisation. As such, a vcpu is just another process to the CFS and also importantly to cgroups which, as a resource manager, allows Linux to manage allocation of resources - typically proportionally in order to set constraint allocations. cgroups also apply to Memory, network and I/O. Groups of processes can be made part of a scheduling group to apply resource allocation requirements to hierarchical groups of processes.

Note: In order to get the concept of a cluster resource pool then these cgroup definitions would need to be maintained at the cluster level via something like ovirt-engine and implemented by a policy agent such as VDSM.

As with ESX, the KVM 'share' based resource model only typically applies at times of constraints. In order to provide 'capping' of resources, cgroups and the scheduler use the concept of 'bandwidth controls' to establish upper limits on resource consumption even during periods of time when the system is unconstrained. This feature appeared in the 3.2+ kernels.

You're probably thinking, does this mean that KVM based vcpu's can suffer from problems such as vcpu's in multi-cpu vm's spinning unnecessarily when one of it's vcpu's is holding a lock and isn't being dispatched? The simple answer is yes, the scheduler does not know the relationship between the processes it is scheduling, it could also for example schedule all vcpu's for a multi-cpu vm on the same pcpu.

Does Linux have a Gang Scheduler? Not for production use but there is an experimental gang scheduler that you can try via this patch. 4 This is well explained via this LCA conference presentation by the author 5 . However, doesn't this then mean that you have all the issues that ESX has with skew tracking and finding ways of relaxing the co-scheduling algorithm as strict grang scheduling has issues of it's own. Fundamentally for multi-vcpu vm's, spin locks causing extended spins due to the holding vcpu process not being scheduled can be catered for with lock holder pre-emption. This allows the hypervisor to detect such a spin and basically stop spinning unnecessarily, preferentially allocating the pcpu resources to the vcpu that is holding the spin lock. There are a variety of ways to do this, pv-spinlocks 6 or via directed yield pause loop exits.7

The summary is, KVM due to inheritance of a rapidly expanding feature set from Linux provides for a rich scalable way of managing multi-vcpu without having to resort to strict co-scheduling.

1 VMware® vSphere(TM): The CPU Scheduler in VMware ESX® 4.1 Whitepaper
2 http://en.wikipedia.org/wiki/Gang_scheduling