Monday, 23 March 2015

Not all cores are created equal

Within CERN's compute cloud, the hypervisors vary significantly in performance. We generally run the servers for around 5 years before retirement and there are around 3 different configurations selected each year through public procurement.

Benchmarking in High Energy Physics is done using a benchmark suite called HEPSpec 2006 (HS06). This is based on the C++ programs within the Spec 2006 suite run in parallel according the number of cores in the server. The performance range is around a factor of 3 between the slowest and the fastest machines [1].  

When machines are evaluated after delivery, the HS06 rating for each hardware configuration is saved into a hardware inventory database.

Defining a flavor for each hardware type was not attractive as there are 15 different configurations to consider and users would not easily find out which flavors have free cores. Instead, users ask for the standard flavors, such as m1.small 1 core virtual machine, you could land on a hypervisor giving 6 HS06 or one giving 16. However, the accounting and quotas is done using virtual cores so the 6 and 16 HS06 virtual cores are considered equivalent.

in order to improve our accounting, we therefore wanted to provide the performance of the VM along with the metering records giving the CPU usage through ceilometer. Initially, we thought that this would require some additional code to be added to ceilometer but this is actually possible using the standard ceilometer functions with transformers and publishers.

The following approach was implemented.
  • On the hypervisor, we added an additional meter 'hs06' which provides the CPU rating of the VM normalised by the HS06 performance of the hypervisor. This value is determined using the HS06 value stored in the Hardware Database which can be provided to the hypervisor via a Puppet Fact.
  • This data is stored, in addition to the default 'cpu' record in ceilometer
The benefits of this approach are
  • There is no need for external lookup to the hardware database to process the accounting
  • No additional rights for the accounting process is required (such as to read the mapping between VM and hypervisor
  • Scenarios such as live migration of VMs from one hypervisor to another of different HS06 are correctly handled
  • No modifications to the ceilometer upstream code are required which both improves deployment time and does not invalidate upstream testing
  • Multiple benchmarks can be run concurrently. This allows a smooth migration from HS06 to a following benchmark HS14 by providing both sets of data.
  • Standard ceilometer behaviour is not modified so existing programs such as Heat which use this data can continue to run
  • This assumes no overcommitment of CPU. Further enhancements to the configuration would be possible in this area but this would require further meters.
  • The information is calculated directly on the hypervisor so it is scalable and it is calculated inline which avoids race conditions when the virtual machine is deleted and therefore the mapping VM to HV is no longer available
The assumptions are
  • The accounting is based on the delivered clock ticks to the hypervisor. This will vary in cases where the hypervisor is running a more recent version of the operating system with a later compiler (and thus probably has a higher HS06 rating). Running older OS versions is therefore corresponding less efficient.
  • The cloud is running at least the Juno OpenStack release
To implement this feature, the pipeline capabilities of ceilometer are used. These are configured automatically by the puppet-ceilometer component into /etc/ceilometer/pipeline.yaml.
The changes required are in several blocks. In the sources section as indicated by
A further source needs to be defined to get the CPU metric available for transformation. This polls every 10 minutes (600 seconds) from the CPU meter and sends the data to the sink for the hs06
    - name: hs06_source
      interval: 600
          - "cpu"
          - hs06_sink
The hs06_sink processing is defined later in the file in the sinks section
The entry below takes the number of virtual cores of the VM and scales by 10 (which is the example HS06 CPU performance per core) and 0.98 (for the virtualisation overhead factor). It is reported in units of HS06s (i.e. HepSpec 2006). The value of 10 would be derived from the Puppet HS06 value for the machine divided by the number of cores in the server (from the Puppet fact processorcount). Puppet can be used to configure a hard-coded value per hypervisor that is delivered to the machine as a fact and used to generate the pipeline.yaml configuration file.
    - name: hs06_sink
          - name: "arithmetic"
                    name: "hs06"
                    unit: "HS06"
                    type: "gauge"
                    expr: "$(cpu).resource_metadata.vcpus*10*0.98"
          - notifier://
Once these changes have been done, the ceilometer daemons can be restarted to get the new configuration.
 service openstack-ceilometer-compute restart
If there are errors, these will be reported to /var/log/ceilometer/compute.log. These can be checked with
egrep "(ERROR|WARNING)" /var/log/ceilometer/compute.log
The first messages like "dropping sample with no predecessor" are to be expected as they are handling differences between the previous values and the current ones (such as cpu utilisation).
After 10 minutes or so, ceilometer will poll the CPU, generate the new hs06 value and this can be queried using the ceilometer CLI.
ceilometer meter-list | grep hs06
will include the hs06 meter
| hs06                                | cumulative | HS06        | c6af7651-5fc5-4d37-bf57-c85238ee098c         | 1cdd42569f894c83863e1b76e165a70c | c4b673a3bb084b828ab344a07fa40f54 |
| hs06                                | cumulative | HS06        | e607bece-d9df-4792-904a-3c4adca1b99c         | 1cdd42569f894c83863e1b76e165a70c | c4b673a3bb084b828ab344a07fa40f54 |
and the last 5 entries in the database can be retrieved
ceilometer sample-list -m hs06 -l 5
produces the output
| Resource ID                          | Name | Type  | Volume | Unit | Timestamp           |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:19:49 |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:16:49 |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:13:49 |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:10:49 |
| b812c69c-3c9f-4146-952e-078a266b11c5 | hs06 | gauge | 11.0   | HS06 | 2015-03-22T08:54:25 |


  1. Ulrich Schwickerath - "VM benchmarking: update on CERN approach"
  2. Ceilometer architecture
  3. Basic introduction to ceilometer using RDO -
  4. Ceilometer configuration guide for transformers
  5. Ceilometer arithmetic guide at

Saturday, 21 March 2015

Nova quota usage - synchronization

Nova quota usage gets frequently out of sync with the real usage consumption.
We are hitting this problem since a couple of releases and it’s increasing with the number of users/tenants in the CERN Cloud Infrastructure.

In nova there are two configuration options (“max_usage” and “until_refresh”) that define when the quota usage should be refreshed. In our case we have configured them with “-1” which means the quota usage must be refreshed every time “_is_quota_refresh_needed” method is called.
For more information about these options you can see a great blog post by Mike Dorman at

This worked well in the releases before Havana. The quota gets out of sync and it’s refreshed next time a tenant user performs an operation (ex: create/delete/…).
However, in Havana with the introduction of “user quotas” ( this problem started to be more frequent even when forcing the quota to refresh every time.

At CERN Cloud Infrastructure a tenant usually has several users. When a user creates/deletes/… an instance and the quota gets out of sync it will affect all users in the tenant. The quota refresh only updates the resources of the user that is performing the operation and not all tenant resources. This means that in a tenant the quota usage will only be fixed if the user owner of the resource out of sync performs an operation.

The source of quota desync is very difficult to reproduce. In fact all our tries have failed to reproduce it consistently.
In order to fix the quota usage the operator needs to manually calculate the quota that is in use and update the database. This process is very cumbersome, time consuming and is can lead to the introduction of even more inconsistencies in the database.

In order to improve our operations we developed a small tool to check which quotas are out of sync and fix them if necessary.
The tool is available in CERN Operations github at:

How to use it?

usage: nova-quota-sync [-h] [--all] [--no_sync] [--auto_sync]
                       [--project_id PROJECT_ID] [--config CONFIG]

optional arguments:
  -h, --help            show this help message and exit
  --all                 show the state of all quota resources
  --no_sync             don't perform any synchronization of the mismatch
  --auto_sync           automatically sync all resources (no interactive)
  --project_id PROJECT_ID
                        searches only project ID

  --config CONFIG       configuration file

The tool calculates the resources in use and compares them with the quota usages.
For example, to see all resources in quota usages that are out of sync:

# nova-quota-sync --no_sync

| Project ID  | User ID  |  Instances   |     Cores      |         Ram          |  Status  |
| 58ed2d48... | user_a   |  657 -> 650  |  2628 -> 2600  |  5382144 -> 5324800  | Mismatch |
| 6f999252... | user_b   |    9 -> 8    |    13 -> 11    |    25088 -> 20992    | Mismatch |
| 79d8d0a2... | user_c   |  232 -> 231  |  5568 -> 5544  |  7424000 -> 7392000  | Mismatch |
| 827441b0... | user_d   |   42 -> 41   |    56 -> 55    |   114688 -> 112640   | Mismatch |
| 8a5858da... | user_e   |    2 -> 4    |     2 -> 4     |     1024 -> 2048     | Mismatch |

The quota usage synchronization can be performed interactively per tenant/project (don’t specify the argument --no_sync) or automatically for all “mismatch” resources with the argument “--auto-sync”.

This tool needs access to nova database. The database endpoint should be defined in the configuration file (it can be nova.conf). Since it reads and updates the database be extremely careful when using it.

Note that quota reservations are not considered in the calculations or updated.

Tuesday, 17 February 2015

Delegation of rights

At CERN, we have 1st and 2nd level support teams to run the computer centre infrastructure. These groups provide 24x7 coverage for problems and initial problem diagnosis to determine which 3rd line support team needs to be called in the event of a critical problem. Typical operations required are
  • Stop/Start/Reboot server
  • Inspect console
When we ran application services on physical servers, these activities could be performed using a number of different technologies
  • KVM switches
  • IPMI for remote maagement
  • Power buttons and the console trolley
With a virtual infrastructure, the applications are now running on virtual machines within a project. These operations are not available by default for the 1st and 2nd level teams since only the members of the project can perform these commands. On the other hand, the project administrator rights contain other operations (such as delete or rebuild servers) which are not needed by these teams.

To address this, we have defined an OpenStack policy for the projects concerned. This is an opt-in process so that the project administrator needs to decide whether these delegated rights should be made available (either at project creation or later).

Define operator role

The first step is to define a new role, operator, for the projects concerned. This can be done through the GUI ( or via the CLI ( In CERN's case, we include it into the workflow in the project creation.

On a default configuration,

$ keystone role-list
|                id                |      name     |
| ef8afe7ea1864b97994451fbe949f8c9 | ResellerAdmin |
| 8fc0ca6ef49a448d930593e65fc528e8 | SwiftOperator |
| 9fe2ff9ee4384b1894a90878d3e92bab |    _member_   |
| 172d0175306249d087f9a31d31ce053a |     admin     |

A new role operator needs to be defined, using the steps from the documentation

 $ keystone role-create --name operator
| Property |              Value               |
|    id    | e97375051a0e4bdeaf703f5a90892996 |
|   name   |             operator             |

and the new role will then appear in the keystone role-list.

Now add a new user operator1

$ keystone user-create --name operator1 --pass operatorpass
| Property |              Value               |
|  email   |                                  |
| enabled  |               True               |
|    id    | f93a50c12c164f329ee15d4d5b0e7999 |
|   name   |            operator1             |
| username |            operator1             |

and add the  operator1 account to the role

$ keystone user-role-add --user operator1 --role operator  --tenant demo
$ keystone user-role-list --tenant demo --user operator1

A similar role is defined for accounting which is used to allow the CERN accounting system read-only access to data about instances so that an accounting report can be produced without needing OpenStack admin rights.

For mapping which users are given this role, we use the Keystone V3 functions available through the OpenStack unified CLI.

$ openstack role add --group operatorGroup --role operator  --tenant demo

Using a group operatorGroup, we are able to define the members in Active Directory and then have those users updated automatically with consistent role sets. The users can also be added explicitly

$ openstack role add --user operator1 --role operator  --tenant demo

Update nova policy

The key file is called policy.json in /etc/nova which defines the roles and what they can do. There are two parts to the rules, firstly a set of groupings which give a human readable description for a complex rule set such as a member is someone who is not an accounting role and not an operator:

    "context_is_admin":  "role:admin",
    "context_is_member": "not role:accounting and not role:operator",
    "admin_or_owner":  "is_admin:True or (project_id:%(project_id)s and rule:context_is_member)",
    "default": "rule:admin_or_owner",
    "default_or_operator": "is_admin:True or (project_id:%(project_id)s and not role:accounting)",

The particular rules are relatively self-descriptive.

The actions can then be defined using these terms

    "compute:get_all": "rule:default_or_operator",
    "compute:get_all_tenants": "rule:default_or_operator",
    "compute:reboot":"rule:default_or_operator", "compute:get_vnc_console":"rule:default_or_operator",
    "compute_extension:console_output": "rule:default_or_operator",
    "compute_extension:consoles": "rule:default_or_operator",

With this, a user group can be defined to allow stop/start/reboot/console while not being able to perform the more destructive operations such as delete.

Thursday, 5 February 2015

Choosing the right image

Over the past 18 months of production at CERN, we have provided a number of standard images for the end users to use when creating virtual machines.
  • Linux
    • Scientific Linux CERN 5
    • Scientific Linux CERN 6
    • CERN CentOS 7
  • Windows
    • Windows 7
    • Windows 8
    • Windows Server 2008
    • Windows Server 2012
To accelerate deployment of new VMs, we also often have
  • Base images which are the minimum subset of packages on which users can build their custom virtual machines using features such as cloud-init or Puppet to install additional packages
  • Extra images which contain common additional packages and accelerate the delivery of a working environment. These profiles are often close to a desktop like PC-on-demand. Installing a full Office 2013 suite on to a new virtual machine can take over one hour so preparing this in advance saves time for the users.
However, with each of these images and additional software, there is a need to maintain the image contents up to date.
  • Known security issues should be resolved within the images rather than relying on the installation of new software after the VM has booted
  • Installation of updates slows do the instantiation of virtual machines
The images themselves are therefore rebuilt on a regular basis and published to the community as public images. The old images, however, should not be deleted as they are needed in the event of a resize or live migration (see Images cannot be replaced in Glance since this would lead to inconsistencies on the hypervisors.

As a result, the number of images in the catalog increases on a regular basis. For the web based end user, this can make navigating the Horizon GUI panel for the images difficult and increase the risk that an out of date image is selected.

 The approach that we have taken is to build on the image properties ( which allow the image maintainer to tag images with attributes. We use the following from the standard list

  • architecture => "x86_64", "i686"
  • os_distro => "slc", "windows"
  • os_distro_major => "6", "5"
  • os_distro_minor => "4"
  • os_edition => "<Base|Extra>" which set of additional packages were installed into the image.
  • release_date => "2014-05-02T13:02:00" for at what date was the image made available to the public user
  • custom_name => "A custom name" allows a text string to override the default name (see below)
  • upstream_provider => URL gives a URL to contact in the event of problems with the image. This is useful where the image is supplied by a 3rd party and the standard support lines should not be used.

We also defined additional fields

With these additional fields, the latest images can be selected and a subset presented for the end user to choose.

The algorithm used is as follows with the sorting sequence

  1. os_distro ASC
  2. os_distro_major DESC
  3. os_distro_minor DESC
  4. architecture DESC
  5. release_date DESC
Images which are from previous releases (i.e. where os_distro, os_distro_major and architecture are the same) are only shown if the 'All' tab is selected.

The code is in preparation to be proposed as an upstream patch. For the moment, it can be found in the CERN github repository (

Tuesday, 27 January 2015

Exceeding tracked connections

As we increase the capacity of the CERN OpenStack cloud, we've noticed a few cases of an interesting problem where hypervisors lose network connectivity. These hypervisors are KVM based running Scientific Linux CERN 6. The cloud itself is running Icehouse using Nova network.

Connection tracking refers to the ability to maintain state information about a connection in memory tables, such as source and destination ip address and port number pairs (known as socket pairs), protocol types, connection state and timeouts. Firewalls that do this are known as stateful. Stateful firewalling is inherently more secure than its "stateless" counterpart .... simple packet filtering.
More details are available at [1].

On busy hypervisors, in the syslog file, we have messages such as

Jan  4 23:14:44 hypervisor kernel: nf_conntrack: table full, dropping packet.

Searching around the internet, we found references to a number of documents [2][3] discussing the limit.

It appears that the default algorithm is pretty simple. For a 64 bit hypervisor,
  • If RAM < 1 GB, the maximum conntrack is set to RAM in bytes  / 32768
  • Otherwise, set to 65536
Our typical hypervisors contain 48GB of memory and 24 cores so a busy server handling physics distributed data access can easily use 1000s of connections, especially if sockets are not being closed correctly. With several instances of these servers on a hypervisor, it is easy to reach the 65536 limit and start to drop new connections.

To keep an eye on the usage, the current and maximum values can be checked using sysctl.

The current usage can be checked using

# sysctl net.netfilter.nf_conntrack_count
net.netfilter.nf_conntrack_count = 6650

The maximum value can be found as follows

# sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 65536

To avoid overload on the hypervisors, the hypervisor conntrack value can be increased and should then be set to the sum of the connections expected from each virtual machine. This can be done using the /etc/sysctl.d directory or with an appropriate configuration management tool.

Note that you'll need to set both net.netfilter.nf_conntrack_max as well as  net.nf_conntrack_max. For the CERN OpenStack cloud we have increased the values from 64k to 512k.

Thursday, 8 January 2015

Using bitnami images with OpenStack

At CERN, we generally use puppet to configure our production services using modules from puppetforge to quickly set up the appropriate parameters and services.

However, it is often interesting to try out a new software package for a quick investigation. For this in the past, people have used Bitnami on test systems or their laptops where they installed the operating system and then installed the Bitnami application packages.

With an OpenStack cloud, deploying Bitnami configurations can be even more quickly achieved. We are running OpenStack Icehouse and KVM or Hyper-V hypervisors.

The steps are as follows
  • Download the cloud image from Bitnami
  • Load the image into Glance
  • Deploy the image
  • Check the console for messages using Horizon
  • Use the application!
Since the operating system comes with the image, it also avoids issues with pre-requisites or unexpected configurations.

Getting the images from Bitnami

Bitnami provides installers which can be run on operating systems that have been previously installed but also cloud images which include the appropriate operating systems in the virtual machine image.

There are a number of public clouds supported such as Amazon and Azure but also private cloud images for VMware and Virtual Box. For this example, we use the images for Virtual Box as there is a single image file.

A wide variety of appliances are available. For this case, we show the use of Wordpress (of course :-). The Wordpress virtual machine private cloud list gives download links. This contains the appliance image of Ubuntu, middleware and Wordpress in a zip file containing

  • A OVF metadata file
  • A VMDK disk image
We only use the VMDK file.

Loading the VMDK into Glance

A VMDK file is a disk image like a QEMU KVM qcow2 one. KVM also supports VMDK so it is possible to load it directly into Glance. Alternatively, it can be converted into qcow2 using qemu-img if this is needed.

glance image-create --name wordpress-bitnami-vmdk  --file bitnami-wordpress-4.1-0-ubuntu-14.04-OVF-disk1.vmdk --disk-format vmdk --container-format=bare 

This creates the entry into Glance so that new VMs can be created.

Creating a new VM

The VM can then be instantiated from this image. 

nova boot --flavor m1.medium --image wordpress-bitnami-vmdk --key-name keypair hostname

The keyname and hostname text should be replaced by your preferred key pair and VM host name.

Check console

Using the graphical interface, you can see the details of the deployment. Since ssh is not enabled by default, you can't log in. Once booted, you will see a screen such as below which gives instructions on how to access the application.

If you wish to log in to the Linux shell, check the account details at the application page on Bitnami.

Use the application

The application can be accessed using the web URL shown on the console above. This allows you to investigate the potential of the application by running a few simple OpenStack and web commands for a working application instance in a few minutes.

Note: it is important that the application is kept up to date. Ensure you follow the Bitnami updates and ensure appropriate security of the server.

Thursday, 6 November 2014

Our cloud in Icehouse

This is going to be a very simple blog to write.

When we upgraded the CERN private cloud to Havana, we wrote a blog post giving some of the details of the approach taken and the problems we encountered.

The high level approach we took rthis time was the same, component by component. The first ones were upgraded during August and Nova/Horizon last of all in October.

The difference this time is that no significant problems were encountered during the production upgrade.

The most time consuming upgrade was Nova. As last time, we took a very conservative approach and disabled the API access to the cloud during the upgrade. With offline backups and the database migration steps taking several hours given the 1000s of VMs and hypervisors, the API unavailability was around six hours in total. All VMs continued to run during this period without issues.

Additional Functions

With the basic functional upgrade, we are delivering the following additional functions to the CERN end users. These are all based off the OpenStack Icehouse release functions and we'll try to provide more details on these areas in future blogs.
  • Kerberos and X.509 authentication
  • Single Sign On login using CERN's Active Directory/ADFS service and the SAML federation functions from Icehouse
  • Unified openstack client for consistent command lines across the components
  • Windows console access with RDP
  • IPv6 support for VMs
  • Horizon new look and feel based on RDO styling
  • Delegated rights to the operators and system administrators to perform certain limited activites on VMs such as reboot and console viewing so they can provide out of working hours support.


Along with the CERN team, many thanks should go to the OpenStack community for delivering a smooth upgrade across Ceilometer, Cinder, Glance, Keystone, Nova and Horizon.