Tuesday, 2 May 2017

Migrating to Keystone Fernet Tokens

For several OpenStack releases, the Identity service in OpenStack offers an additional token format than UUID, called Fernet. This token format has a series of advantages over the UUID, the most prominent for us is that it doesn't need to be persisted. We were also interested in a speedup of the token creation and validation.

At CERN, we have been using the UUID format for the tokens since the beginning of the cloud deployment in 2013. Normally in the keystone database we have around 300,000 tokens with an expiration of 1 day. In order to keep the database size controlled, we run the token_flush procedure every hour.

In the Mitaka release, all remaining bugs were sorted out and since the Ocata release of OpenStack, Fernet is now the default. Right now, we are running keystone in Mitaka and we decided to migrate to the Fernet token format before the upgrade of the Ocata release.

Before reviewing the upgrade from UUID to Fernet, let's have a brief look on the architecture of the identity service. The service resolves into a set of load balancers and then they redirect to a set of backends running keystone under apache. This allows us to replace/increase the backends transparently.

The very first question is how many keys we would need. If we take the formula from [1]:

fernet_keys = validity / rotation_time + 2

If we have a validity of 24 hours and a rotation every 6 hours, we would need 24/6 + 2 = 6 keys

As we have several backends, the first task is to distribute the keys among the backends, for that we are using puppet that provides the secrets in the /etc/keystone/fernet-keys folder. With that we ensure that a new introduced backend will always have the last set of keys available.

The second task is how to rotate them. In our case we are using a cronjob in our rundeck installation that rotates the secrets and introduces a new one. This job is doing exactly the same as the keystone-manage token-flush command. One important aspect is that on each rotation, you need to reload or restart the Apache daemon to load the keys from the disk.

So we prepare all this changes in the production setup quite some time ago, and we were testing that the keys were correctly updated and distributed. On April 5th, we decided to go on. This is the picture of the API messages during the intervention.


There are two distinct steps in the API messages usage. The first one is a peak of invalid tokens. This is triggered by the end users trying to validate UUID tokens after the change. The second peak is related to OpenStack services that are using trusts, like Magnum and Heat. From our past experience, these services can be affected by a massive invalidation of tokens. The trust credentials are cached and you need to restart both services so both services will get their Fernet tokens.

Below is a picture of the size of the token table in the keystone database, as we are now in Fernet is going slowly down to zero, due to the hourly token_flush command I mentioned earlier.


The last picture is the response time of the identity service during the change. As you can see the response time is better than on UUID as stated in [2]


In the Ocata release, more improvements are on the way to improve the response time, and we are working to update the identity service in the near future.

References:

  1. Fernet token FAQ at https://docs.openstack.org/admin-guide/identity-fernet-token-faq.html
  2. Fernet token performance at http://blog.dolphm.com/openstack-keystone-fernet-tokens/
  3. Payloads for Fernet token format at http://blog.dolphm.com/inside-openstack-keystone-fernet-token-payloads/
  4. Deep dive into Fernet at https://developer.ibm.com/opentech/2015/11/11/deep-dive-keystone-fernet-tokens/

Thursday, 9 February 2017

os_type property for Windows images on KVM


The OpenStack images have a long list of properties which can set to describe the image meta data. The full list is described in the documentation. This blog reviews some of these settings for Windows guests running on KVM, in particular for Windows 7 and Windows 2008R2.

At CERN, we've used a number of these properties to help users filter images such as the OS distribution and version but also added some additional properties for specific purposes such as

  • when the image was released (so the images can be sorted by date)
  • whether the image is the latest recommended one (such as setting the CentOS 7.2 image to not recommended when CentOS 7.3 comes out)
  • which CERN support team provided the image 

For a typical Windows image, we have

$ glance image-show 9e194003-4608-4fe3-b073-00bd2a774a57
+-------------------+----------------------------------------------------------------+
| Property          | Value                                                          |
+-------------------+----------------------------------------------------------------+
| architecture      | x86_64                                                         |
| checksum          | 27f9cf3e1c7342671a7a0978f5ff288d                               |
| container_format  | bare                                                           |
| created_at        | 2017-01-27T16:08:46Z                                           |
| direct_url        | rbd://b4f463a0-c671-43a8-bd36-e40ab8d233d2/images/9e194003-4   |
| disk_format       | raw                                                            |
| hypervisor_type   | qemu                                                           |
| id                | 9e194003-4608-4fe3-b073-00bd2a774a57                           |
| min_disk          | 40                                                             |
| min_ram           | 0                                                              |
| name              | Windows 10 - With Apps [2017-01-27]                            |
| os                | WINDOWS                                                        |
| os_distro         | Windows                                                        |
| os_distro_major   | w10entx64                                                      |
| os_edition        | DESKTOP                                                        |
| os_version        | UNKNOWN                                                        |
| owner             | 7380e730-d36c-44dc-aa87-a2522ac5345d                           |
| protected         | False                                                          |
| recommended       | true                                                           |
| release_date      | 2017-01-27                                                     |
| size              | 37580963840                                                    |
| status            | active                                                         |
| tags              | []                                                             |
| updated_at        | 2017-01-30T13:56:48Z                                           |
| upstream_provider | https://cern.service-now.com/service-portal/function.do?name   |
| virtual_size      | None                                                           |
| visibility        | public                                                         |
+-------------------+----------------------------------------------------------------+

Recently, we have seen some cases of Windows guests becoming unavailable with the BSOD error "CLOCK_WATCHDOG_TIMEOUT (101)".  On further investigation, these tended to occur around times of heavy load on the hypervisors such as another guest doing CPU intensive work.

Windows 7 and Windows Server 2008 R2 were the guest OSes where these problems were observed. Later OS levels did not seem to show the same problem.

We followed the standard processes to make sure the drivers were all updated but the problem still occurred.

Looking into the root cause, the Red Hat support articles were a significant help.

"In the environment described above, it is possible that 'CLOCK_WATCHDOG_TIMEOUT (101)' BSOD errors could be due to high load within the guest itself. With virtual guests, tasks may take more time that expected on a physical host. If Windows guests are aware that they are running on top of a Microsoft Hyper-V host, additional measures are taken to ensure that the guest takes this into account, reducing the likelihood of the guest producing a BSOD due to time-outs being triggered."

These suggested to use the os_type parameter to help inform the hypervisor to use some additional flags. However, the OpenStack documentation explained this was a XenAPI only setting (which would not therefore apply for KVM hypervisors).

It is not always clear which parameters to set for an OpenStack image. Setting os_distro has a value such as 'windows' or 'ubuntu'. While the flavor of the OS could be determined, the setting of os_type is needed to be used by the code.

Thus, in order to get the best behaviour for Windows guests, from our experience, we would recommend setting both the os_distro and os_type as follows.
  • os_distro = 'windows'
  • os_type = 'windows'
When the os_type parameter is set, some additional XML is added to the KVM configuration following the Kilo enhancement.

<features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
    </hyperv>
  </features>
  ....
  <clock offset='localtime'>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='hpet' present='no'/>
    <timer name='hypervclock' present='yes'/>
  </clock>

These changes have led to an improvement when running on a loaded hypervisors, especially for Windows 7 and 2008R2 guests. A bug has been opened for the documentation to explain the setting is not Xen only.

Acknowledgements

  • Jose Castro Leon performed all of the analysis and testing of the various solutions.

References



Monday, 9 January 2017

Containers on the CERN cloud

We have recently made the Container-Engine-as-a-Service (Magnum) available in production at CERN as part of the CERN IT department services for the LHC experiments and other CERN communities. This gives the OpenStack cloud users Kubernetes, Mesos and Docker Swarm on demand within the accounting, quota and project permissions structures already implemented for virtual machines.

We shared the latest news on the service with the CERN technical staff (link). This is the follow up on the tests presented at the OpenStack Barcelona (link) and covered in the blog from IBM. The work has been helped by collaborations with Rackspace in the framework of the CERN openlab and the European Union Horizon 2020 Indigo Datacloud project.

Performance

At the Barcelona summit, we presented with Rackspace and IBM regarding our additional performance tests after the previous blog post. We expanded beyond the 2M requests/s to reach around 7M where some network infrastructure issues unrelated to OpenStack limited the scaling further.

As we created the clusters, the deployment time increased only slightly with the number of nodes as most of the work is done in parallel. But for 128 node or larger clusters, the increase in time started to scale almost linearly. At the Barcelona summit, the Heat and Magnum teams worked together to develop proposals for how to improve further in future releases, although a 1000 node cluster in 23 minutes is still a good result


Cluster Size (Nodes)
Concurrency
Deployment Time (min)
2
50
2.5
16
10
4
32
10
4
128
5
5.5
512
1
14
1000
1
23

Storage

With the LHC producing nearly 50PB this year, High Energy Physics has some custom storage technologies for specific purposes, EOS for physics data, CVMFS for read-only, highly replicated storage such as applications.

One of the features of providing a private cloud service to the CERN users is to combine the functionality of open source community software such as OpenStack with the specific needs for high energy physics. For these to work, some careful driver work is needed to ensure appropriate access while ensuring user rights. In particular,
  • EOS provides a disk-based storage system providing high-capacity and low-latency access for users at CERN. Typical use cases are where scientists are analysing data from the experiments.
  • CVMFS is used for a scalable, reliable and low-maintenance for read-only data such as software.
There are also other storage solutions we use at CERN such as
  • HDFS for long term archiving of data using Hadoop which uses an HDFS driver within the container.  HDFS works in user space, so no particular integration was required to use it from inside (unprivileged) containers
  • Cinder provides additional disk space using volumes if the basic flavor does not have sufficient. This Cinder integration is offered by upstream Magnum, and work was done in the last OpenStack cycle to improve security by adding support for Keystone trusts.
CVMFS was more straightforward as there is no need to authenticate the user. The data is read-only and can be exposed to any container. The access to the file system is provided using a driver (link) which has been adapted to run inside a container. This saves having to run additional software inside the VM hosting the container.

EOS requires authentication through mechanisms such as Kerberos to identify the user and thus determine what files they have access to. Here a container is run per user so that there is no risk of credential sharing. The details are in the driver (link).

Service Model

One interesting question that came up during the discussions of the container service was how to deliver the service to the end users. There are several scenarios:
  1. The end user launches a container engine with their specifications but they rely on the IT department to maintain the engine availability. This implies that the VMs running the container engine are not accessible to the end user.
  2. The end user launches the engine within a project that they administer. While the IT department maintains the templates and basic functions such as the Fedora Atomic images, the end user is in control of the upgrades and availability.
  3. A variation of option 2., where the nodes running containers are reachable and managed by the end user, but the container engine master nodes are managed by the IT department. This is similar to the current offer from the Google Container Engine and requires some coordination and policies regarding upgrades
Currently, the default Magnum model is for the 2nd option and adding option 3 is something we could do in the near future. As users become more interested in consuming containers, we may investigate the 1st option further

Applications

Many applications at use in CERN are in the process of being reworked for a microservices based architecture. A choice of different container engines is attractive for the software developer. One example of this is the file transfer service which ensures that the network to other high energy physics sites is kept busy but not overloaded with data transfers. The work to containerise this application was described at the recent CHEP 2016 FTS poster.

While deploying containers is an area of great interest for the software community, the key value comes from the physics applications exploiting containers to deliver a new way of working. The Swan project provides a tool for running ROOT, the High Energy Physics application framework, in a browser with easy access to the storage outlined above. A set of examples can be found at https://swan.web.cern.ch/notebook-galleries. With the academic paper, the programs used and the data available from the notebook, this allows easy sharing with other physicists during the review process using CERNBox, CERN's owncloud based file sharing solution.



Another application being studied is http://opendata.cern.ch/?ln=en which allows the general public to run analyses on LHC open data. Typical applications are Citizen Science and outreach for schools.

Ongoing Work

There are a few major items where we are working with the upstream community:
  • Cluster upgrades will allow us to upgrade the container software. Examples of this would be a new version of Fedora Atomic, Docker or the container engine. With a load balancer, this can be performed without downtime (spec)
  • Heterogeneous cluster support will allow nodes to have different flavors (cpu vs gpu, different i/o patterns, different AZs for improved failure scenarios). This is done by splitting the cluster nodes into node groups (blueprint)
  • Cluster monitoring to deploy Prometheus and cAdvisor with Grafana dashboards for easy monitoring of a Magnum cluster (blueprint).

References

Thursday, 29 September 2016

Hyperthreading in the cloud

The cloud at CERN is used for a variety of different purposes from running personal VMs for development/test, bulk throughput computing to analyse the data from the Large Hadron Collider to long running services for the experiments and the organisation.

The configuration of many of the hypervisors is carefully tuned to maximise the compute throughput, i.e. getting as much compute work done in a given time rather than optimising the individual job performance. Many of the workloads are also nearly all embarrassingly parallel, i.e. each unit of compute can be run without needing to communicate with other jobs. A few workloads, such as QCD, need classical High Performance Computing but these are running on dedicated clusters with Infiniband interconnect compared to the typical 1Gbit/s or 10Gbit/s ethernet for the typical hypervisor.

CERN has a public procurement procedure which awards tenders to the bid with the lowest price for a given throughput compliant with the specifications.  The typical CERN hardware configuration is based on a dual socket configuration and must have at least 2GB/core.

Intel provides a capability for doubling the number of cores on the underlying processor called Simultaneous multithreading or SMT. From the machine perspective, this appears as double the number of cores compared to non-SMT configurations. Enabling SMT requires a BIOS parameter change so resources need to be defined in advance and appropriate capacity planning to define the areas of the cloud which are SMT on or off statically.

The second benefit of an SMT off configuration is the memory per core doubles. A server with 32 SMT on cores and 64GB of memory with hyper-threading has 2GB per core. A change to 16 cores by dropping SMT leads to 4GB per core which can be useful for some workloads.

Setting the BIOS parameters for a subset of the hypervisors causes multiple difficulties
  • With older BIOSes, this is a manual operation. New tools are available on the most recent hardware so this is an operation which can be performed with a Linux program and a reboot.
  • A motherboard replacement requires that the operation is repeated. This can be overlooked as part of the standard repair activities.
  • Capacity planning requires allocation of appropriate blocks of servers. At CERN, we use OpenStack cells to allow the cloud to scale to our needs with each cells having a unique hardware configuration such as particular processor/memory configuration and thus dedicated cells need to be created for the SMT off machines. When these capacities are exceeded, the other unused cloud resources cannot be trivially used but further administration reconfiguration is required.
The reference benchmark for High Energy Physics is HEPSpec06, a subset of the Spec benchmarks which match the typical instruction workload. Using this, run in parallel on each of the cores in a machine, the throughput provided by a given configuration can be measured.

SMT VM configuration Throughput HS06
On2 VMs each 16 cores351
On4 VMs each 8 cores355
Off1 VM of 16 cores284.5

Thus, the total throughput of the server with SMT off is significantly less (284.5 compared to 351) but the individual core performance is higher (284.5/16=17.8 compared to 351/32=11). Where an experiment workflow is serialised for some of the steps, this higher single core performance was a significant gain, but at an operational cost.

To find a cheaper approach, the recent additions of NUMA flavors in OpenStack was used. The hypervisors were configured with SMT on but a flavor was created to only use half of the cores on the server with 4GB/core so that the hypervisors were under committed on cores but committed by memory to avoid another VM being allocated to the unused cores. In our configuration, this was done by adding numa_nodes=2 to the flavor and the NUMA aware scheduler does the appropriate allocation.

This configuration was benchmarked and compared with the SMT On/Off.

SMT VM configuration Throughput HS06
On2 VMs each 16 cores351
Off1 VM of 16 cores284.5
On1 VM of 16 cores with numa_nodes=2283

The new flavor shows similar characteristics to the SMT Off configuration without requiring the BIOS setting change and can therefore be deployed without needing the configuration of dedicated cells with a particular hardware configuration. The Linux and OpenStack schedulers appear to be allocating the appropriate distribution of cores across the processors.






Thursday, 22 September 2016

Our Cloud in Liberty

We have previously posted experiences with upgrades of OpenStack such as
The upgrade to Liberty for the CERN cloud was completed at the end of August. Working with the upstream OpenStack, Puppet and RDO communities, this went pretty smoothly without any issues so there is no significant advice to report. We followed the same approach as the past, gradually upgrading component by component.  With the LHC reaching it's highest data rates so far this year (over 10PB recorded to tape during June), the upgrades needed to be done without disturbing the running VMs.

Some hypervisors are still on Scientific Linux CERN 6 (Kilo) using Python 2.7 in a software collection but the backwards compatibility has allows the rest of the cloud to migrate while we complete the migration of 5000 VMs from old hardware and SLC6 to new hardware on CentOS 7 in the next few months.

After the migration, we did encounter a few problems:
The first two cases are being backported to Liberty so others who have not upgraded may not see these.

We've already started the Mitaka upgrade with Barbican, Magnum and Heat ahead of the other components as we enable a CERN production container service. The other components will follow in the next six months including the experiences of the migration to Neutron which we'll share at the OpenStack summit in October.



Friday, 17 June 2016

Scaling Magnum and Kubernetes: 2 million requests per second

Two months ago, we described in this blog post how we deployed OpenStack Magnum in the CERN cloud. It is available as a pre-production service and we're steadily moving towards full production mode as a standard part of the CERN IT service offerings to give Containers-as-a-Service.

As part of this effort, we've started testing the upgrade procedures, the latest being to the final Mitaka release. If you're here to see some fancy load tests, keep reading below, but some interesting details on the upgrade:
  • We build our own RPMs to include a few patches from post-Mitaka upstream (the most important being the trustee user to support lifecycle operations on the bays) and some CERN customizations (removal of neutron LBaaS and floating ips which we don't yet have, adding the CERN Certificate Authority, ...). Check here for the patches and build procedure
  • We build our Fedora Atomic 23 image to get more recent versions of docker and kubernetes (1.10 and 1.2 respectively), plus support for an internal distributed filesystem called CVMFS. We do use the upstream disk-imagebuilder procedure with a few additional elements available here
While discussing how we could further test the service, we thought of this kubernetes blog post, achieving 1 million requests per second against a service running on a kubernetes cluster. We thought we could probably do the same. Requirements included:
  • kubernetes 1.2, which our recent upgrade offered
  • available resources to deploy the cluster, and luckily we were installing a new batch of a few hundred physical nodes which could be used for a day or two
So along with the upgrade, Bertrand and Mathieu got to work to test this setup and we quickly got it up and running.

Quick summary of the setup:
  • 1 kubernetes bay
  • 1 master node, 16 cores (not really needed but why not)
  • 200 minions, 4 cores each
In total there are 800 cores, which matches the cluster used in the original test. How did our test go?



We ended up trying a bit more and doubled the number to 2 million requests per second :)



We learned a few things on the way:
  • set Heat's max_resources_per_stack to something big. Magnum stacks create a lot of these, and with bays of hundreds of nodes the value gets high enough that unlimited (-1) is tempting and we have it like that now. It leaves the option for people to deploy a stack with so many resources that Heat could break, so we'll investigate what the best value is
  • while creating and deleting many large bays, Heat shows errors like 'TimeoutError: QueuePool limit of size ... overflow ... reached' which we've seen in the past for other OpenStack services. We'll contribute the patch to fix it upstream if not there yet
  • latency values get high even before the 1 million barrier, we'll check further the demo code and our setup (using local disk, in this case SSDs instead of the default volume attachment in Magnum should help)
  • Heat timeout and retrial configuration values need to be tuned to manage very large stacks. We're still not sure what are the best values, but will update the post once we have them
  • Magnum shows 'Too many files opened' errors, we also have a fix to contribute for this one
  • Nova, Cinder (bay nodes use a volume), Keystone and all other OpenStack services scaled beautifully, our cloud usually has a rate of ~150 VMs created and deleted per hour, here's the plot for the test period, we eventually tried bays up to 1000 nodes


And what's next? 
  • Larger bays: at the end of these tests we deployed a few bigger bays with 300, 500 and 1000 nodes. And in just a couple weeks there will be a new batch of physical nodes arriving, so we plan to upgrade Heat to Mitaka and build on the recent upstream work (by Spyros together with Ton and Winnie from IBM) adding Magnum scenarios to Rally to run additional scale tests and see where it breaks
  • Bay lifecycle: we stopped at launching a large number of requests in a bay, next we would like to perform bay operations (update of number of nodes, node replacement) and see which issues (if any) we find in Magnum
  • New features: lots of upstream work going on, so we'll do regular Magnum upgrades (cinder support, improved bay monitoring, support for some additional internal systems at CERN)
And there's also Swarm and Mesos, we plan on testing those soon as well. And kubernetes updated their test, so stay tuned...

Acknowledgements

  • Bertrand Noel, Mathieu Velten and Spyros Trigazis from CERN IT, for the work upstream and integrating Magnum at CERN, and on getting these demos running
  • Rackspace for their support within the CERN Openlab on running containers at scale
  • Indigo Datacloud building a platform as a service for e-science in Europe
  • Kubernetes for an awesome tool and the nice demo
  • All in the CERN OpenStack Cloud team, for a great service (especially Davide Michelino and Belmiro Moreira for all the work integrating Neutron at CERN)
  • The upstream Magnum team, for building what is now looking like a great service, and we look forward for what's coming next (bay drivers, bare metal support, and much more)
  • Tim, Arne and Jan for letting us use the new hardware for a few days

Saturday, 30 April 2016

Resource management at CERN



As part of the recent OpenStack summit in Austin, the Scientific Working group was established looking into how scientific organisations can best make use of OpenStack clouds.

During our discussions with more than 70 people (etherpad), we concluded on 4 top areas to look at and started to analyse the approaches and common needs. The areas were
  1. Parallel file system support in Manila. There are a number of file systems supported by Manila but many High Performance Computing sites (HPC) use Lustre which is focussed on the needs of the HPC user community.
  2. Bare metal management looking at how to deploy bare metal for the maximum performance within the OpenStack frameworks for identity, quota and networking. This team will work on understanding additional needs with the OpenStack Ironic project.
  3. Accounting covering the wide range of needs to track usage of resources and showback/chargeback to the appropriate user communities.
  4. Stories is addressing how we collect requirements from the scientific use cases and work with the OpenStack community teams, such as the Product working group, to include these into the development roadmaps along with defining reference architectures on how to cover common use cases such as high performance or high throughput computing clouds in the scientific domain.
Most of the applications run at CERN are high throughput, embarrassingly parallel applications. Simulation and analysis of the collisions such as in the LHC can be farmed to different compute resources with each event being handled independently and no need for fast interconnects. While the working group will cover all the areas (and some outside this list), our focus is on accounting (3).

Given limited time available, it was not possible for each of the interested members of the accounting team to explain their environment. This blog is intended to provide the details of the CERN cloud usage, the approach to resource management and some areas where OpenStack could provide additional function to improve the way we manage the accounting process. Within the Scientific Working group, these stories will be refined and reviewed to produce specifications and identify the potential communities who could start on the development.

CERN Pledges

The CERN cloud provides computing resources for the Large Hadron Collider and other experiments. The cloud is currently around 160,000 cores in total spread across two data centres in Geneva and Budapest. Resources are managed world wide with the World Wide Computing LHC Grid which executes over 2 million jobs per day. Compute resources in the WLCG are allocated via a pledge model. Rather than direct funding from the experiments or WLCG, the sites, supported by their government agencies, commit to provide compute capacity and storage for a period of time as a pledge and these are recorded in the REBUS system. These are then made available using a variety of middleware technologies.

Given the allocation of resources across 100s of sites, the experiments then select the appropriate models to place their workloads at each site according to compute/storage/networking capabilities. Some sites will be suitable for simulation of collisions (high CPU, low storage and network). Others would provide archival storage and significant storage IOPS for more data intensive applications. For storage, the pledges are made in capacity on disk and tape. The compute resource capacity is pledges in Kilo-HepSpec06 units, abbreviated to kHS06 (based on a subset of the Spec 2006 benchmark) that allows faster processors to be given a higher weight in the pledge compared to slower ones (as High Energy Physics computing is an embarrassingly parallel high throughput computing problem).

The pledges are reviewed on a regular basis to check the requests are consistent with the experiments’ computing models, the allocated resources are being used efficiently and the pledges are compatible with the requests.

Within the WLCG, CERN provides the Tier-0 resources for the safe keeping of the raw data and performs the first pass at reconstructing the raw data into meaningful information. The Tier-0 distributes the raw data and the reconstructed output to Tier 1s, and reprocesses data when the LHC is not running.

Procurement Process

The purchases for the Tier-0 pledge for compute is translated into a formal procurement process. Given the annual orders exceed 750 KCHF, the process requires a formal procedure:
  • A market survey to determine which companies in the CERN member states could reply to requests in general areas such as compute servers or disk storage. Typical criteria would be the size of the company, the level of certification with component vendors and offering products in the relevant area (such as industry standard servers) 
  • A tender which specifies the technical specifications and quantity for which an offer is required. These are adjudicated on the lowest cost compliant with specifications criteria. Cost in this case is defined as the cost of the material over 3 years including warranty, power, rack and network infrastructure needed. The quantity is specified in terms of kHS06 with 2GB/core and 20GB storage/core which means that the suppliers are free to try different combinations of top bin processors which may be a little more expensive or lower performing ones which would then require more total memory and storage. Equally, the choice of motherboard components has significant flexibility within the required features such as 19” rack compatible and enterprise quality drives. The typical winning configurations recently have been white box manufacturers.
  • Following testing of the proposed systems to ensure compliance, an order is placed with several suppliers, the machines manufactured, delivered, racked-up and burnt it using a set of high load stress tests to identify issues such as cooling or firmware problems.
Typical volumes are around 2,000 servers a year in one or two rounds of procurement. The process from start to delivered capacity takes around 280 days so bulk purchases are needed followed by allocation to users rather than ordering on request. If there are issues found, this process can take significantly longer.

Step
Time (Days)
Elapsed (Days)
User expresses requirement
0
Market Survey prepared
15
15
Market Survey for possible vendors
30
45
Specifications prepared
15
60
Vendor responses
30
90
Test systems evaluated
30
120
Offers adjudicated
10
130
Finance committee
30
160
Hardware delivered
90
250
Burn in and acceptance
30 days typical with 380 worst case
280
Total
280+ Days

Given the time the process takes, there are only one to two procurement processes run per year. This means that a continuous delivery model cannot be used and therefore there is a need for capacity planning on an annual basis and to find approaches to use the resources before they are allocated out to their final purpose.

Physical Infrastructure

CERN manages two data centres in Meyrin, Geneva and Wigner, Budapest. The full data is available at the CERN data centre overview page. When hardware is procured, the final destination is defined as part of the order according to rack space, cooling and electrical availability. 




While the installations in Budapest are new installations, some of the Geneva installations involve replacing old hardware. We typically retire hardware between 4 and 5 years old when the CPU power/watt is significantly better with new purchases and the hardware repair costs for new equipment are more predictable and sustainable.

Within the Geneva centre, there are two significant areas, physics and redundant power. Physics power has a single power source which is expected to fail in the event of an electricity cut lasting beyond the few minutes supported by the battery units. The redundant power area is backed by diesels. The Wigner centre is entirely redundant.

Lifecycle

With an annual procurement cycle with 2-3 vendors per cycle, each one with their own optimisations to arrive at the lowest cost for the specifications, the hardware is highly heterogeneous. This has a significant benefit when there are issues, such as disk firmware or BMC controllers, that lead to delays in one of the deliveries being accepted, so the remaining hardware can be made available to experiments.

However, we run the machines for the 3 year warranty and then some additional years on minimal repairs (i.e. simple parts are replaced with components from servers of the same series), we have around 15-20 different hardware configurations for compute servers active in the centre at any time. There are variations in the specifications (as technologies such as SSDs and 10Gb Ethernet became commodity, the new tenders needed these) and those between vendor responses for the same specifications (e.g. slower memory or different processor models).

These combinations do mean that offering standard flavors for each hardware complication would be very confusing for the users, given that there is no easy way for a user to know if resources are available in a particular flavor except to try to create a VM with that flavor.

Given new hardware deliveries and limited space, there are equivalent retirement campaigns. The aim is to replace the older hardware by more efficient newer boxes that can deliver more HS06 within the same power/cooling envelope. The process to empty machines depends on the workloads running on the servers. Batch workloads generally finish within a couple of weeks so setting the servers to no longer accept new work just before the retirements is sufficient. For servers and personal build/test machines, we aim to migrate the workloads to capacity on new servers. This operation is increasingly being performed using live migration and MPLS to extend the broadcast domains for networks to the new capacity.

Projects and Quota

All new users are allocated a project, “Personal XXXX” where XXXX is their CERN account when they subscribe to the CERN cloud service through the CERN resource portal. The CERN resource portal is the entry point to subscribe to the many services available from the central IT department and for users to list their currently active subscriptions and allocations. The personal projects have a minimal quota for a few cores and GBs of block storage so that users can easily follow the tutorial steps on using the cloud and create simple VMs for their own needs. The default image set is available on personal projects along with the standard ‘m’ flavors which are similar to the ones on AWS.

Shared projects can also be requested for activities which are related to an experiment or department. For these resources, a list of people can be defined as administrators (through CERN’s group management system e-groups) and a quota for cores, memory and disk space requested. Additional flavors can also be asked for according to particular needs such as the CERNVM flavors with a small system disk and a large ephemeral one.

The project requests go through a manual approval process, being reviewed by the IT resource management to check the request against the pledge and the available resources. An adjustment of the share of the central batch farm is also made so that the sum of resources for an experiment continues to be within the pledge.

Once the resource request has been manually approved, the ticket is then passed to the cloud team for execution. Rundeck provides us with a simple tool for performing high privilege operations with good logging and recovery. This is used in many of our workflows such as hardware repair. The Rundeck procedure reads the quota settings from the ticket and executes the appropriate project creation and role allocation requests to Keystone, Nova and Cinder.

Need #1 : CPU performance based allocation and scheduling

As with all requests, there is a mixture of requirements and implementation. The needs are stated according to our current understanding. There may be alternative approaches or compromises which would address these needs in common with other user requirements. One of the aims of the Scientific Working group is to expose these ideas to other similar users, adapt them to meet the general community and work with the product working group, user committee and developers on an approach. Thus, the subsequent areas where CERN would be interested in improvements in the underlying software techologies.

The resource manager for the cloud in a high throughput/performance computing environment allocates resources based on performance rather than pure core count.

A user wants to request a processor of a particular minimum performance and is willing to have a larger reduction in his remaining quota.

A user wants to make a request for resources and then iterate according to how much performance related quota they have left to fill the available quota, e.g. give me a VM with less than a certain performance rating.

A resource manager would like to encourage the use of older resources which are often idle.

A quotas for slower and faster cores is currently the same (thus users create a VM, delete it if it is one of the slower type) so there is no incentive to use the slower cores.

As a resource manager preparing an accounting report, the faster cores should have a higher weight against the pledge to ensure continued treatment of slower cores.

The proposal is therefore to have an additional, optional quota on CPU units so that resource managers can allocate out total throughput rather than per core capacities.

The alternative approach of defining a number of flavors for each of the hardware types and quota. However, there are a number of drawbacks with this approach:
  • The number of flavors to define would be significant (in CERN’s case, around 15-20 different hardware configurations multiplied by 4-5 sizes for small, medium, large, xlarge, ...)
  • The user experience impact would be significant as the user would have to iterate over the available flavors to find free capacity. For example, trying out m4.large first and finding the capacity was all used, then trying m3.large etc.
There is, as far as I know, no per-flavor quota. The specs have been discussed in some operator feedback sessions but the specification did not reach consensus.

Extendible resource tracking seems to be approaching the direction with ‘compute units’ (such as here) as defined in the specification. Many parts of this are not yet implemented so it is not easy to see if it addresses the requirements.

Need #2 : Nested Quotas

As an experiment resource co-ordinator, it should be possible to re-allocate resources according to the priorities of the experiment without needing action by the administrators. Thus, moving quotas between projects which are managed by the experiment resource co-ordinators within the pledge allocated by the WLCG.

Nested keystone projects have been in the production release since Kilo. This gives the possibility for role definitions within the nested project structure.

The implementation of the nested quota function has been discussed within various summits for the past 3 years. The first implementation proposal, Boson, was for a dedicated service for quota management. However, there were concerns raised by the PTLs on the impacts for performance and maintainability of this approach. The alternative of enhancing the quotas in each of the projects has been followed (such as Nova). These implementations though have not advanced due to other concerns with quota management which are leading towards a common library, delimiter, which is being discussed for Newton.

Need #3 : Spot Market

As a cloud provider, uncommitted resources should be made available at a lower cost but at a lower service level, such as pre-emption and termination at short notice. This mirrors the AWS spot market or the Google Pre-emptible instances. The benefits would be higher utilization of the resources and ability to provide elastic capacity for reserved instances by reducing the spot resources.

A draft specification for this functionality has been submitted along with the proposal which is currently being reviewed. An initial implementation of this functionality (called OpenStack Preemptible Instances Extension, or opie) will be made available soon on github following the work on Indigo Datacloud by IFCA. A demo video is available on YouTube.

Need #4 : Reducing quota below utilization

As an experiment resource co-ordinator, quotas are under regular adjustment to meet the chosen priorities. Where a project has a lower priority but high current utilization, further resource usage should be blocked but existing resources not deleted since the user may still need to complete the processing on those VMs. The resource co-ordinator can then contact the user to encourage the appropriate resources to be deleted.

To achieve this function, one approach would be to allow the quota to be set below the current utilization in order to give the project administrator the time to identify the resources which would be best to be deleted in view of the reduced capacity.

Need #5 : Components without quota

As a cloud provider, some inventive users are storing significant quantities of data in the Glance image service. There is only a maximum size limit with no accumulated capacity leaves this service open to non-planned storage.

It is proposed to add quota functionality inside Glance for
  • The total capacity of images stored in Glance
  • The total capacity of snapshots stored in Glance
The number of images and snapshots would be an lower priority enhancement request since the service risk comes from the total capacity although the numbers could potentially also be abused.

Given that this is new functionality, it could also be a candidate for the first usage of the new delimiter library.

Need #6 : VM Expiration

As a private cloud provider, some VMs should be time limited to ensure that they are still required by their end users and automatically expire if there is no confirmation. Personal projects are often in this category as users will launch test instances but fail to delete them on completion of the test. These cases the users should be required to confirm that they will need the resources and that these are within their current allocated quota (Need #4).

Acknowledgements

There are too many people who have been involved in the resource management activities to list here. The teams contributing to the description above are:
  • CERN IT teams supporting cloud, batch, accounting and quota
  • BARC, Mumbai for collaborating around the implementation of nested quota
  • Indigo Datacloud team for work on the spot market in OpenStack
  • Other labs in the WLCG and the Scientific Working Group