Total Pageviews

Tuesday, December 25, 2012

Openstack Orchestration?

Orchestration is a critical piece of Cloud Infrastructure.   Automation and workflows across different domains (system level, enterprise level) need to be executed in a coordinated, consistent and reliable manner. They also need to be predictable and auditable to show who did exactly what, when and if possible why.  They must also show the state of each request and deliver metrics related to the progress or otherwise of the execution of those processes.    This is not just simple automation, but equally applies to higher level business processes for managing the cloud infrastructure. such as customer onboarding, billing or provisioning etc.     Openstack utilises to some extent, tools such as Puppet and Chef to perform single domain automation but is this really enough?  

This blog post is a thought experiment.   The thought experiment consists of how BPMN 2.0 could be used to improve the management and capability of openstack clouds.

BPMN for those interested in learning more is Business Process Modelling Notation and to quote   “A standard Business Process Modeling Notation (BPMN) will provide businesses with the capability of understanding their internal business procedures in a graphical notation and will give organizations the ability to communicate these procedures in a standard manner. Furthermore, the graphical notation will facilitate the understanding of the performance collaborations and business transactions between the organizations. This will ensure that businesses will understand themselves and participants in their business and will enable organizations to adjust to new internal and B2B business circumstances quickly.”

To me, this seems like a worthy addition to openstack as it allows a standardised way of providing business rigour that is relatively easy to build on.   It also allows business process specialists to design the way they want their cloud to work and communicate that effectively with engineers who need to build and deliver it.

For the practical side of this thought experiment I chose to use the JBPM5  JBOSS based software from RedHat.  It is a comprehensive and well thought out piece of software and has the necessary tooling to deliver valuable business results.   It’s also Enterprise ready, though it appears the current commercial implementation of JBPM doesn’t yet include the BPMN functionality - this is just a matter of time for this to appear as is the strategy going forward.    

So, how does this thought experiment start.   In our hypothetical private cloud we’d like to have a provisioning process where internal teams can request virtual instances,  but if they require instances with 4 vcpu’s (or more) then approval is required from the cloud administrator.   Additionally instances with >=32Gb of memory AND > 4vcpu require internal financial approval from a designated finance role.   In all cases an external CMDB must be updated and the approvals stored for audit purposes.

In this hypothetical world, the Horizon web application or other third party web applications exploiting the openstack API’s would collect the required provisioning request data and then execute the business process.

For our first hypothetical provisioning request the input parameters are 1 vCPU, 4Gb Ubuntu system.

From the diagram above the workflow executes and passes all the business controls areas due to the request being under the limits.  The ‘provision machine sub-process utilises the openstack API to perform the provisioning request.  Depending on the level of complexity required this can recover from error events or simply pass the error condition along.  

For a second hypothetical provisioning request the user requests a 4vCPU system with 4Gb memory.   In this case we’re required to execute the human task ‘Cloud Admin Approval’.   As this is a human approval task the Cloud Administrator role (could be a team) is notified that the request needs to be reviewed.  At this point the process stops and waits for that approval.  It should be noted that escalation points can also be managed.   The Cloud Administrator either approves or rejects the request and the process continues to either performs the provisioning or notify the requester of the rejection.

For a third hypothetical request we have 4 vCPU AND 32Gb of memory.   In this case and according to our hypothetical business rules, both the Finance approver AND the Cloud Administrator MUST approve the request before the provisioning commences.  Note:  The Cloud Administrator doesn’t get notified until the Finance Department has approved in this process flow.   Again, both of these approval steps are human tasks and the process is suspended while the approvals are sought.

Could you write this logic into straight python code within the Horizon web application.   Yes, of course you could, but it should be readily apparent that using BPMN provides at the simplest level wonderful documentation of the process that is followed.   In addition to that it completely allows the web presentation logic or the provisioning automation to be decoupled from the business or implementation logic.   This is a significant benefit.    Furthermore it allows for dynamic changes to the business logic being applied - with little to no coding effort required.   Not only that, but you can view any request and see where it currently resides within the process.  You may see that many requests are waiting for approvals from finance and perhaps changes to spend limits may be required.   

Now a logical extension of this thought experiment is to encapsulate the openstack API as ‘reusable sub processes’ within the BPMN framework.  This would allow BPMN designer tools to orchestrate any part of openstack as well as managing approval processes etc. This allows complex internal logic to be expressed using a consistent and standard notation which provides valuable benefits.

As you can imagine, these types of business processes are very common.  They’re currently not functions typically performed by openstack deployments out of the box.  The addition of BPMN based orchestration makes all this (and more) possible .

Next steps for me would be to put this into a blueprint on launchpad and then make some reusable processes/code available, but before I do that, does anyone think the addition of BPMN based orchestration into openstack is a good idea?   Thoughts appreciated.

Friday, November 2, 2012

openstack (devstack) horizon broken on fresh ubuntu 12.10

Ok,  so this is a pretty minor issue but perhaps someone else might benefit from a quick solution.   I was setting up a new openstack developer environment using devstack on a brand new Ubuntu 12.10 (ubuntu-server) and oops I got the following error when accessing horizon.

It seems it's a known bug, but it didn't list a fix.   Well it's easy enough to address.  A quick edit of files/apts/horizon to add nodejs-legacy, re-run ./ and normal operation is restored.

Saturday, October 27, 2012

Getting started with oVirt - part 1

Many people will know now that oVirt is an open source platform for deploying and managing intel based virtual machines.   To quote the oVirt website (  :

The oVirt Project is an open virtualization project for anyone who cares about Linux-based KVM virtualization. Providing a feature-rich server virtualization management system with advanced capabilities for hosts and guests, including high availability, live migration, storage management, system scheduler, and more.
 Let's get it installed and running to see what you can do.

The configuration i'm looking to use is three (3) x86_64 machines, each with 4Gb memory, VT-x enabled and 100Gb of local disk.  Additionally I have a NFS server that will provide shared storage.

Starting at the Get oVirt link you can see that oVirt consists of two (2) types of server.

  1. The engine, a jboss based web application deployed on Fedora 17 and
  2. The oVirt node.   In this case i'm deploying the oVirt node image as the hypervisor on the compute nodes.
Starting with the engine you firstly need to get F17 up and running, this shouldn't be too difficult.  I recommend you perform a 'yum upgrade' before proceeding with the engine install.  In my case I also changed the systemd default from graphical mode to multi-user as I didn't feel the need for a GUI on the engine host.

Things to note:

  1. Your DNS needs to work !   
  2. Get your engine and the nodes into your DNS.  In my case I have,, and
Using the instructions at install the engine from the supplied rpms.  Once the rpms have all been installed then execute the engine-setup and answer the questions.

This is the configuration I used :

oVirt Engine will be installed using the following configuration:
http-port:                     80
https-port:                    443
auth-pass:                     ********
default-dc-type:               NFS
db-remote-install:             local
db-local-pass:                 ********
nfs-mp:                        /srv/iso
iso-domain-name:               ISO
config-nfs:                    yes
override-iptables:             yes
Proceed with the configuration listed above? (yes|no): yes

Once the configuration script has executed you can go to the portal URL  and login using the credentials you specified above.

Now it's necessary to build your oVirt node images so that they can be added into a cluster within the engine.   Download the ovirt node image from, burn it to a CD and boot your two hypervisor nodes with it.  

I'll cover that in the next part.

Thursday, August 23, 2012

oVirt 3.1 released - try it, try it now!

Great news, well done to the team.

oVirt 3.1 includes quite a number of new features.  The release notes are over at the ovirt wiki. A notable item is a new 'all in one' mode which is great for demos as it allows you to host vm's on the same system ovirt-engine is running on.   No excuses now to try it out!

Tuesday, April 24, 2012

oVirt 3.1 release date changed

According to Ofer, the oVirt release manager we can expect a small delay in the delivery of oVirt 3.1

Due to multiple integration issues (Java 7, Fedora 17 and JBoss official rpm support) we've decided to postpone the next release of oVirt [1] to June 27th.

This one month delay will hopefully give us enough time to stabilize all the different layers of oVirt, and produce a better release.

Stay tuned,

Ofer Schreiber
oVirt Release Manager


Friday, April 20, 2012

KVM vCPU Scheduling

I was interested in understanding how cpu resource management in a Linux KVM based solution, such as oVirt worked and drawing some comparisons with other hypervisor technologies such as vSphere ESX.   This isn't meant to be an in-depth analysis of every feature of each hypervisor, i'm interested in KVM and the other references are just to help position things in my mind.
Initially, i'm going to start with ESX to create a baseline against which I can compare KVM.   For cpu management1 in ESX (3.x+) we get what is affectionately known as relaxed co-scheduling, (co-scheduling is also known as Gang Scheduling2). It's 'relaxed' as prior to ESX 3.x it was strict co-scheduling which meant that in a VM, the rate of progress (work) was tracked for each vcpu. Some would execute at faster rates than others and as such an amount of 'skew' or difference in work rate would be measured and recorded. Once that 'skew' reached a certain limit then a co-stop was performed and the VM could not be re-dispatched (co-start) again until enough pcpus were available to simultaneously schedule all vcpu's in the VM. So, what was 'relaxed' ? Basically, rather than requiring a co-start with all vcpu's needing simultaneous scheduling on pcpu's it is performed for the vcpu's that accrued 'skew' greater than the threshold. This means that a potentially smaller number of pcpu's is required for the co-start. If sufficient pcpu's are available then the co-start defaults back to the strict behaviour but is more relaxed about the situation if there are not sufficient pcpu's available.
With ESX4+ the scheduler is further relaxed. Hypervisor execution time (on behalf of the vcpu) is now no longer tracked as contributing to the skew and the way the skew is measured and tracked has changed. Now, so long as two vcpu's make progress then there is no increase in skew and vcpu's that advance too much are stopped to allow the other vcpu's to catch up. Underneath all of this, if there are sufficient pcpu's to scheduled all vcpu's simultaneously then the original strict co-scheduling is performed.
ESX also has a proportional share based algorithm for cpu that calculates a proportional entitlement of a vcpu to a pcpu as well as the relative priority which is determined from the consumption pattern. A vcpu that consumes less than its entitlement is considered higher priority than one that does. This proportional entitlement is hierarchically based and extends to groups of VM's as part of a resource pool.
In reality there's more to the the ESX scheduler, but i'm actually interested in how it works in the KVM world.  
The KVM approach is to re-use as much of the Linux infrastructure as possible - "Don't re-invent the wheel" is the KVM mantra. As such this means that KVM will use the Linux CFS (Completely Fair Scheduler3) by default.
Within KVM, each vcpu is mapped to a Linux process which in turn utilises hardware assistance to create the necessary 'smoke and mirrors' for virtualisation. As such, a vcpu is just another process to the CFS and also importantly to cgroups which, as a resource manager, allows Linux to manage allocation of resources - typically proportionally in order to set constraint allocations.   cgroups also apply to Memory, network and I/O. Groups of processes can be made part of a scheduling group to apply resource allocation requirements to hierarchical groups of processes.
Note: In order to get the concept of a cluster resource pool then these cgroup definitions would need to be maintained at the cluster level via something like ovirt-engine and implemented by a policy agent such as VDSM.
As with ESX, the KVM 'share' based resource model only typically applies at times of constraints. In order to provide 'capping' of resources, cgroups and the scheduler use the concept of 'bandwidth controls' to establish upper limits on resource consumption even during periods of time when the system is unconstrained. This feature appeared in the 3.2+ kernels.
You're probably thinking, does this mean that KVM based vcpu's can suffer from problems such as vcpu's in multi-cpu vm's spinning unnecessarily when one of it's vcpu's is holding a lock and isn't being dispatched? The simple answer is yes, the scheduler does not know the relationship between the processes it is scheduling, it could also for example schedule all vcpu's for a multi-cpu vm on the same pcpu.
Does Linux have a Gang Scheduler? Not for production use but there is an experimental gang scheduler that you can try via this patch. 4 This is well explained via this LCA conference presentation by the author . However, doesn't this then mean that you have all the issues that ESX has with skew tracking and finding ways of relaxing the co-scheduling algorithm as strict grang scheduling has issues of it's own. Fundamentally for multi-vcpu vm's, spin locks causing extended spins due to the holding vcpu process not being scheduled can be catered for with lock holder pre-emption. This allows the hypervisor to detect such a spin and basically stop spinning unnecessarily, preferentially allocating the pcpu resources to the vcpu that is holding the spin lock. There are a variety of ways to do this, pv-spinlocks 6 or via directed yield pause loop exits.7
The summary is, KVM due to inheritance of a rapidly expanding feature set from Linux provides for a rich scalable way of managing multi-vcpu without having to resort to strict co-scheduling.

Friday, February 10, 2012

First community release of oVirt

After some time and planning, the first community release of oVirt the KVM based hypervisor and upstream of RHEV is available.

The release notes are here and if you haven't had a look at what promises to be a robust virtualisation platform then I encourage you to a) have a look and b) join in the development.

From the release announcement :

The first release includes:

* All the components required to operate a running oVirt installation
* oVirt Engine is now running on Jboss AS7 as the application server
* A new Python SDK to support the development of software utilizing the ovirt-engine APIs
* Fedora based oVirt Node