Friday, 9 June 2017

Experiences with Cinder in Production

The CERN OpenStack cloud service is providing block storage via Cinder since Havana days in early 2014.  Users can choose from seven different volume types, which offer different physical locations, different power feeds, and different performance characteristics. All volumes are backed by Ceph, deployed in three separate clusters across two data centres.

Due to its flexibility, the volume concept has become very popular with users and the service has hence grown during the past years to over 1PB of allocated quota, hosted in more than 4'000 volumes. In this post, we'd like to share some of the features we use and point out some of the pitfalls we've run into when running (a very stable and easy to maintain) Cinder service in production.

Avoiding sadness: Understanding the 'host' parameter

With the intent to increase the resiliency, we configured the service from the start to run on multiple hosts. The three controller nodes were set up in an identical way, so all of them ran the API ('c-api'), scheduler ('c-sched') and volume ('c-vol') services.

With the first upgrades, however, we realised that there was a coupling between a volume and the 'c-vol' service that had created it: each volume is associated with its creation host which, by default, is identified by hostname of the controller. So, when the first controller needed to be replaced, the 'c-sched' wasn't able to find the original 'c-vol' service which would be able to execute volume operations. At the time, we fixed this by changing the corresponding volume entries in the Cinder database to point to the new host that was added.

As the Cinder configuration allows the 'host' to be set directly in 'cinder.conf',  we set this parameter to be the same on all controllers with the idea to remove the coupling between the volume and the specific 'c-vol' which was used to create it. We ran like this for quite a while and although we never saw direct issues related to this setting, in hindsight it may explain some of the issues we had with volumes getting stuck in transitional states. The main problem here is the clean-up being done as the daemons start up: as they assume exclusive access to 'their' volumes, volumes in transient states will be "cleaned up", e.g. their state reset, when a daemon starts, so in a setup with identical 'host's, this may cause undesired interferences.

Taking this into account, our setup has been changed to keep the 'c-api' on all three controller, but run the 'c-vol' and 'c-sched' services on one host only. Closely following the recent work of the Cinder team to improve the locking and allow for Active/Active HA we're looking forward to have Active/Active HA 'c-vol' services fully available again.

Using multiple volume types: QoS and quota classes

The scarce resource on our Ceph backend is not space, but IOPS, and after we handed out the first volumes to users, we quickly realised that some resource management was needed. We achieved this by creating a QoS spec and associating it with the one volume type we had at the time:

# cinder qos-create std-iops write_iops_sec=100 read_iops_sec=100
# cinder qos-associate <std_iops_qos_id> <std_volume_type_id>

This setting does not only allow you to limit the amount of IOPS used on this volume type, but also to define different service levels. For instance, for more demanding use cases we added a high IOPS volume type to which access is granted on a per request basis:

# cinder type-create high-iops
# cinder qos-create high-iops write_iops_sec=500 read_iops_sec=500
# cinder qos-associate <high_iops_qos_id> <high_iops_volume_type_id>

Note that both types are provided by the same backend and physical hardware (which also allows for a conversion without data movement between these types using 'cinder retype')! Note also that for attached volumes a detach/re-attach cycle is needed to have QoS changes taking effect.

In order to manage the initial default quotas for these two (and the other five volume types the service offers), we use Cinder's support for quota classes. As apart from the std-iops volume type all other volume types are only available on request, the initial quota is usually set to '0'. So, in order to create the default quotas for a new type, we would hence update the default quota class by running a command like:

# cinder type-create new-type
# cinder quota-class-update --volume-type new-type --volumes 0 --snapshots 0 --gigabytes 0 default

Of course, this method can also be used to define different initial quotas for new volume types, but it is in any case a way to avoid setting the initial quotas explicitly after project creation.

Fixing a long-standing issue: Request timeouts and DB deadlocks

For quite some time, our Cinder deployment had suffered from request timeouts leading to volumes left in error states when doing parallel deletions. Though easily reproducible, this was infrequent (and subsequently received the corresponding attention ...). Recently, however, this became a much more severe issue with the increased use of Magnum and Kubernetes clusters (which use volumes and hence launch parallel volumes deletions at larger scale when being removed). This affected the overall service availability (and, subsequently, received the corresponding attention here as well ...).

In this situations, the 'c-vol' logs showed lines like

"Deadlock detected when running 'reservation_commit': Retrying ..."

and hence indicated locking problem. We weren't able to pinpoint in the code how a deadlock would occur, though. A first change that mitigated the situation was to reduce the 'innodb_lock_wait_timeout' from its default value of 50 seconds to 1 second: the client was less patient and exercised the retry logic decorates the database interactions much earlier. Clearly, this did not address the underlying problem, but at least allowed the service to handle these parallel deletions in a much better way.

The real fix, suggested by a community member, implied to try and change a setting we had carried forward since the initial setup of the service: the connection string in 'cinder.conf' was not specifying a driver and hence using the mysql Python wrapper (rather than the recommended 'pymysql' Python implementation). After changing our connection from

connection = mysql://cinder:<pw>@<host>:<port>/cinder


connection = mysql+pymysql://cinder:<pw>@<host>:<port>/cinder

the problem basically disappeared!

So the underlying reason was the management of the green thread parallelism in the wrapper vs. the native Python implementation: while the former enforces serialisation (and hence eventually deadlocks in SQLAlchemy), the latter allows for proper parallel execution of the requests to the database. The OpenStack oslo team is now looking into issuing a warning when it detects this obsolete setting.

As using the 'pymysql' driver is generally recommended and, for instance, default in devstack deployments, volunteers to help with this issue had a really hard time to reproduce the issues we experienced ... another lesson learnt when keeping services running for a longer period :)

Tuesday, 6 June 2017

OpenStack papers community on Zenodo

At the recent summit in Boston, Doug Hellmann and I were discussing research around OpenStack, both the software itself but also how it is used by applications. There are many papers being published in proceedings of conferences and PhD theses but finding out about these can be difficult. While these papers may not necessarily lead to open source code contribution, the results of this research is a valuable resource for the community.

Increasingly, publications are made with Open Access conditions which are free of all restrictions on access. For example, all projects receiving European Union Horizon 2020 funding are required to make sure that any peer-reviewed journal article they publish is openly accessible, free of charge. Reviewing with the OpenStack scientific working group, Open access was also felt to be consistent with OpenStack's Open principles of Open Source, Open Design, Open Development and Open Community.

There are a number of different repositories available where publications such as this can be made available. The OpenStack scientific working group are evaluating potential approaches and Zenodo looks like a good candidate as it is already widely used in the research community, open source on github and the application also runs in the CERN Data Centre on OpenStack. Preservation of data is one of CERN's key missions and this is included in the service delivery for Zenodo.

The name Zenodo is derived from Zenodotus, the first librarian of the Ancient Library of Alexandria and father of the first recorded use of metadata, a landmark in library history.

Accessing the Repository

The list of papers can be seen at Along with keywords, there is a dedicated search facility is available within the community so that relevant papers can be found quickly.

Submitting New Papers

Zenodo allows new papers to be submitted for inclusion into the OpenStack Papers repository. There are a number of steps to be performed.

Please ensure that these papers are available under open access conditions before submitting them to the repository if published elsewhere. Alternatively, if the papers can be published freely, they can be published in Zenodo for the first time and receive the DOI directly.
  1. Log in to Zenodo. This can be done using your github account if you have one or by registering a new account via the 'Sign Up' button.
  2. Once logged in, you can go to the openstack repository at and upload a new paper.
  3. The submission will then be verified before publishing.
To submit for this repository, you need to provide
  • Title of the paper
  • Author list
  • Description (the Abstract is often a good content)
  • Date of publication
If you know the information, please provide the following also
  • DOI (Digital Object Identifier) used to uniquely identify the object. In general, these will already be allocated to the paper since the original publication will have allocated one. If none is specified, Zenodo will create one, which is good for new publications but bad practice to generate duplicate DOIs for published works. So please try to find the original, which also it helps with future cross referencing.
  • There are optional fields at upload time for adding more metadata (to make it machine readable), such as “Journal” and “Conference”. Adding journal information improves the searching and collating of documents for the future so if this information is known, it is good to enter it.
Zenodo provides the synchronisation facilities for repositories to exchange information (OAI 2.0). Planet OpenStack feeds using this would be an interesting enhancement to consider or adding RSS support to Zenodo would be welcome contributions.

Tuesday, 2 May 2017

Migrating to Keystone Fernet Tokens

For several OpenStack releases, the Identity service in OpenStack offers an additional token format than UUID, called Fernet. This token format has a series of advantages over the UUID, the most prominent for us is that it doesn't need to be persisted. We were also interested in a speedup of the token creation and validation.

At CERN, we have been using the UUID format for the tokens since the beginning of the cloud deployment in 2013. Normally in the keystone database we have around 300,000 tokens with an expiration of 1 day. In order to keep the database size controlled, we run the token_flush procedure every hour.

In the Mitaka release, all remaining bugs were sorted out and since the Ocata release of OpenStack, Fernet is now the default. Right now, we are running keystone in Mitaka and we decided to migrate to the Fernet token format before the upgrade of the Ocata release.

Before reviewing the upgrade from UUID to Fernet, let's have a brief look on the architecture of the identity service. The service resolves into a set of load balancers and then they redirect to a set of backends running keystone under apache. This allows us to replace/increase the backends transparently.

The very first question is how many keys we would need. If we take the formula from [1]:

fernet_keys = validity / rotation_time + 2

If we have a validity of 24 hours and a rotation every 6 hours, we would need 24/6 + 2 = 6 keys

As we have several backends, the first task is to distribute the keys among the backends, for that we are using puppet that provides the secrets in the /etc/keystone/fernet-keys folder. With that we ensure that a new introduced backend will always have the last set of keys available.

The second task is how to rotate them. In our case we are using a cronjob in our rundeck installation that rotates the secrets and introduces a new one. This job is doing exactly the same as the keystone-manage token-flush command. One important aspect is that on each rotation, you need to reload or restart the Apache daemon to load the keys from the disk.

So we prepare all this changes in the production setup quite some time ago, and we were testing that the keys were correctly updated and distributed. On April 5th, we decided to go on. This is the picture of the API messages during the intervention.

There are two distinct steps in the API messages usage. The first one is a peak of invalid tokens. This is triggered by the end users trying to validate UUID tokens after the change. The second peak is related to OpenStack services that are using trusts, like Magnum and Heat. From our past experience, these services can be affected by a massive invalidation of tokens. The trust credentials are cached and you need to restart both services so both services will get their Fernet tokens.

Below is a picture of the size of the token table in the keystone database, as we are now in Fernet is going slowly down to zero, due to the hourly token_flush command I mentioned earlier.

The last picture is the response time of the identity service during the change. As you can see the response time is better than on UUID as stated in [2]

In the Ocata release, more improvements are on the way to improve the response time, and we are working to update the identity service in the near future.


  1. Fernet token FAQ at
  2. Fernet token performance at
  3. Payloads for Fernet token format at
  4. Deep dive into Fernet at

Thursday, 9 February 2017

os_type property for Windows images on KVM

The OpenStack images have a long list of properties which can set to describe the image meta data. The full list is described in the documentation. This blog reviews some of these settings for Windows guests running on KVM, in particular for Windows 7 and Windows 2008R2.

At CERN, we've used a number of these properties to help users filter images such as the OS distribution and version but also added some additional properties for specific purposes such as

  • when the image was released (so the images can be sorted by date)
  • whether the image is the latest recommended one (such as setting the CentOS 7.2 image to not recommended when CentOS 7.3 comes out)
  • which CERN support team provided the image 

For a typical Windows image, we have

$ glance image-show 9e194003-4608-4fe3-b073-00bd2a774a57
| Property          | Value                                                          |
| architecture      | x86_64                                                         |
| checksum          | 27f9cf3e1c7342671a7a0978f5ff288d                               |
| container_format  | bare                                                           |
| created_at        | 2017-01-27T16:08:46Z                                           |
| direct_url        | rbd://b4f463a0-c671-43a8-bd36-e40ab8d233d2/images/9e194003-4   |
| disk_format       | raw                                                            |
| hypervisor_type   | qemu                                                           |
| id                | 9e194003-4608-4fe3-b073-00bd2a774a57                           |
| min_disk          | 40                                                             |
| min_ram           | 0                                                              |
| name              | Windows 10 - With Apps [2017-01-27]                            |
| os                | WINDOWS                                                        |
| os_distro         | Windows                                                        |
| os_distro_major   | w10entx64                                                      |
| os_edition        | DESKTOP                                                        |
| os_version        | UNKNOWN                                                        |
| owner             | 7380e730-d36c-44dc-aa87-a2522ac5345d                           |
| protected         | False                                                          |
| recommended       | true                                                           |
| release_date      | 2017-01-27                                                     |
| size              | 37580963840                                                    |
| status            | active                                                         |
| tags              | []                                                             |
| updated_at        | 2017-01-30T13:56:48Z                                           |
| upstream_provider |   |
| virtual_size      | None                                                           |
| visibility        | public                                                         |

Recently, we have seen some cases of Windows guests becoming unavailable with the BSOD error "CLOCK_WATCHDOG_TIMEOUT (101)".  On further investigation, these tended to occur around times of heavy load on the hypervisors such as another guest doing CPU intensive work.

Windows 7 and Windows Server 2008 R2 were the guest OSes where these problems were observed. Later OS levels did not seem to show the same problem.

We followed the standard processes to make sure the drivers were all updated but the problem still occurred.

Looking into the root cause, the Red Hat support articles were a significant help.

"In the environment described above, it is possible that 'CLOCK_WATCHDOG_TIMEOUT (101)' BSOD errors could be due to high load within the guest itself. With virtual guests, tasks may take more time that expected on a physical host. If Windows guests are aware that they are running on top of a Microsoft Hyper-V host, additional measures are taken to ensure that the guest takes this into account, reducing the likelihood of the guest producing a BSOD due to time-outs being triggered."

These suggested to use the os_type parameter to help inform the hypervisor to use some additional flags. However, the OpenStack documentation explained this was a XenAPI only setting (which would not therefore apply for KVM hypervisors).

It is not always clear which parameters to set for an OpenStack image. Setting os_distro has a value such as 'windows' or 'ubuntu'. While the flavor of the OS could be determined, the setting of os_type is needed to be used by the code.

Thus, in order to get the best behaviour for Windows guests, from our experience, we would recommend setting both the os_distro and os_type as follows.
  • os_distro = 'windows'
  • os_type = 'windows'
When the os_type parameter is set, some additional XML is added to the KVM configuration following the Kilo enhancement.

      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
  <clock offset='localtime'>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='hpet' present='no'/>
    <timer name='hypervclock' present='yes'/>

These changes have led to an improvement when running on a loaded hypervisors, especially for Windows 7 and 2008R2 guests. A bug has been opened for the documentation to explain the setting is not Xen only.


  • Jose Castro Leon performed all of the analysis and testing of the various solutions.


Monday, 9 January 2017

Containers on the CERN cloud

We have recently made the Container-Engine-as-a-Service (Magnum) available in production at CERN as part of the CERN IT department services for the LHC experiments and other CERN communities. This gives the OpenStack cloud users Kubernetes, Mesos and Docker Swarm on demand within the accounting, quota and project permissions structures already implemented for virtual machines.

We shared the latest news on the service with the CERN technical staff (link). This is the follow up on the tests presented at the OpenStack Barcelona (link) and covered in the blog from IBM. The work has been helped by collaborations with Rackspace in the framework of the CERN openlab and the European Union Horizon 2020 Indigo Datacloud project.


At the Barcelona summit, we presented with Rackspace and IBM regarding our additional performance tests after the previous blog post. We expanded beyond the 2M requests/s to reach around 7M where some network infrastructure issues unrelated to OpenStack limited the scaling further.

As we created the clusters, the deployment time increased only slightly with the number of nodes as most of the work is done in parallel. But for 128 node or larger clusters, the increase in time started to scale almost linearly. At the Barcelona summit, the Heat and Magnum teams worked together to develop proposals for how to improve further in future releases, although a 1000 node cluster in 23 minutes is still a good result

Cluster Size (Nodes)
Deployment Time (min)


With the LHC producing nearly 50PB this year, High Energy Physics has some custom storage technologies for specific purposes, EOS for physics data, CVMFS for read-only, highly replicated storage such as applications.

One of the features of providing a private cloud service to the CERN users is to combine the functionality of open source community software such as OpenStack with the specific needs for high energy physics. For these to work, some careful driver work is needed to ensure appropriate access while ensuring user rights. In particular,
  • EOS provides a disk-based storage system providing high-capacity and low-latency access for users at CERN. Typical use cases are where scientists are analysing data from the experiments.
  • CVMFS is used for a scalable, reliable and low-maintenance for read-only data such as software.
There are also other storage solutions we use at CERN such as
  • HDFS for long term archiving of data using Hadoop which uses an HDFS driver within the container.  HDFS works in user space, so no particular integration was required to use it from inside (unprivileged) containers
  • Cinder provides additional disk space using volumes if the basic flavor does not have sufficient. This Cinder integration is offered by upstream Magnum, and work was done in the last OpenStack cycle to improve security by adding support for Keystone trusts.
CVMFS was more straightforward as there is no need to authenticate the user. The data is read-only and can be exposed to any container. The access to the file system is provided using a driver (link) which has been adapted to run inside a container. This saves having to run additional software inside the VM hosting the container.

EOS requires authentication through mechanisms such as Kerberos to identify the user and thus determine what files they have access to. Here a container is run per user so that there is no risk of credential sharing. The details are in the driver (link).

Service Model

One interesting question that came up during the discussions of the container service was how to deliver the service to the end users. There are several scenarios:
  1. The end user launches a container engine with their specifications but they rely on the IT department to maintain the engine availability. This implies that the VMs running the container engine are not accessible to the end user.
  2. The end user launches the engine within a project that they administer. While the IT department maintains the templates and basic functions such as the Fedora Atomic images, the end user is in control of the upgrades and availability.
  3. A variation of option 2., where the nodes running containers are reachable and managed by the end user, but the container engine master nodes are managed by the IT department. This is similar to the current offer from the Google Container Engine and requires some coordination and policies regarding upgrades
Currently, the default Magnum model is for the 2nd option and adding option 3 is something we could do in the near future. As users become more interested in consuming containers, we may investigate the 1st option further


Many applications at use in CERN are in the process of being reworked for a microservices based architecture. A choice of different container engines is attractive for the software developer. One example of this is the file transfer service which ensures that the network to other high energy physics sites is kept busy but not overloaded with data transfers. The work to containerise this application was described at the recent CHEP 2016 FTS poster.

While deploying containers is an area of great interest for the software community, the key value comes from the physics applications exploiting containers to deliver a new way of working. The Swan project provides a tool for running ROOT, the High Energy Physics application framework, in a browser with easy access to the storage outlined above. A set of examples can be found at With the academic paper, the programs used and the data available from the notebook, this allows easy sharing with other physicists during the review process using CERNBox, CERN's owncloud based file sharing solution.

Another application being studied is which allows the general public to run analyses on LHC open data. Typical applications are Citizen Science and outreach for schools.

Ongoing Work

There are a few major items where we are working with the upstream community:
  • Cluster upgrades will allow us to upgrade the container software. Examples of this would be a new version of Fedora Atomic, Docker or the container engine. With a load balancer, this can be performed without downtime (spec)
  • Heterogeneous cluster support will allow nodes to have different flavors (cpu vs gpu, different i/o patterns, different AZs for improved failure scenarios). This is done by splitting the cluster nodes into node groups (blueprint)
  • Cluster monitoring to deploy Prometheus and cAdvisor with Grafana dashboards for easy monitoring of a Magnum cluster (blueprint).


Thursday, 29 September 2016

Hyperthreading in the cloud

The cloud at CERN is used for a variety of different purposes from running personal VMs for development/test, bulk throughput computing to analyse the data from the Large Hadron Collider to long running services for the experiments and the organisation.

The configuration of many of the hypervisors is carefully tuned to maximise the compute throughput, i.e. getting as much compute work done in a given time rather than optimising the individual job performance. Many of the workloads are also nearly all embarrassingly parallel, i.e. each unit of compute can be run without needing to communicate with other jobs. A few workloads, such as QCD, need classical High Performance Computing but these are running on dedicated clusters with Infiniband interconnect compared to the typical 1Gbit/s or 10Gbit/s ethernet for the typical hypervisor.

CERN has a public procurement procedure which awards tenders to the bid with the lowest price for a given throughput compliant with the specifications.  The typical CERN hardware configuration is based on a dual socket configuration and must have at least 2GB/core.

Intel provides a capability for doubling the number of cores on the underlying processor called Simultaneous multithreading or SMT. From the machine perspective, this appears as double the number of cores compared to non-SMT configurations. Enabling SMT requires a BIOS parameter change so resources need to be defined in advance and appropriate capacity planning to define the areas of the cloud which are SMT on or off statically.

The second benefit of an SMT off configuration is the memory per core doubles. A server with 32 SMT on cores and 64GB of memory with hyper-threading has 2GB per core. A change to 16 cores by dropping SMT leads to 4GB per core which can be useful for some workloads.

Setting the BIOS parameters for a subset of the hypervisors causes multiple difficulties
  • With older BIOSes, this is a manual operation. New tools are available on the most recent hardware so this is an operation which can be performed with a Linux program and a reboot.
  • A motherboard replacement requires that the operation is repeated. This can be overlooked as part of the standard repair activities.
  • Capacity planning requires allocation of appropriate blocks of servers. At CERN, we use OpenStack cells to allow the cloud to scale to our needs with each cells having a unique hardware configuration such as particular processor/memory configuration and thus dedicated cells need to be created for the SMT off machines. When these capacities are exceeded, the other unused cloud resources cannot be trivially used but further administration reconfiguration is required.
The reference benchmark for High Energy Physics is HEPSpec06, a subset of the Spec benchmarks which match the typical instruction workload. Using this, run in parallel on each of the cores in a machine, the throughput provided by a given configuration can be measured.

SMT VM configuration Throughput HS06
On2 VMs each 16 cores351
On4 VMs each 8 cores355
Off1 VM of 16 cores284.5

Thus, the total throughput of the server with SMT off is significantly less (284.5 compared to 351) but the individual core performance is higher (284.5/16=17.8 compared to 351/32=11). Where an experiment workflow is serialised for some of the steps, this higher single core performance was a significant gain, but at an operational cost.

To find a cheaper approach, the recent additions of NUMA flavors in OpenStack was used. The hypervisors were configured with SMT on but a flavor was created to only use half of the cores on the server with 4GB/core so that the hypervisors were under committed on cores but committed by memory to avoid another VM being allocated to the unused cores. In our configuration, this was done by adding numa_nodes=2 to the flavor and the NUMA aware scheduler does the appropriate allocation.

This configuration was benchmarked and compared with the SMT On/Off.

SMT VM configuration Throughput HS06
On2 VMs each 16 cores351
Off1 VM of 16 cores284.5
On1 VM of 16 cores with numa_nodes=2283

The new flavor shows similar characteristics to the SMT Off configuration without requiring the BIOS setting change and can therefore be deployed without needing the configuration of dedicated cells with a particular hardware configuration. The Linux and OpenStack schedulers appear to be allocating the appropriate distribution of cores across the processors.

Thursday, 22 September 2016

Our Cloud in Liberty

We have previously posted experiences with upgrades of OpenStack such as
The upgrade to Liberty for the CERN cloud was completed at the end of August. Working with the upstream OpenStack, Puppet and RDO communities, this went pretty smoothly without any issues so there is no significant advice to report. We followed the same approach as the past, gradually upgrading component by component.  With the LHC reaching it's highest data rates so far this year (over 10PB recorded to tape during June), the upgrades needed to be done without disturbing the running VMs.

Some hypervisors are still on Scientific Linux CERN 6 (Kilo) using Python 2.7 in a software collection but the backwards compatibility has allows the rest of the cloud to migrate while we complete the migration of 5000 VMs from old hardware and SLC6 to new hardware on CentOS 7 in the next few months.

After the migration, we did encounter a few problems:
The first two cases are being backported to Liberty so others who have not upgraded may not see these.

We've already started the Mitaka upgrade with Barbican, Magnum and Heat ahead of the other components as we enable a CERN production container service. The other components will follow in the next six months including the experiences of the migration to Neutron which we'll share at the OpenStack summit in October.