OpenStack in Production

Saturday 19 January 2019

OpenStack In Production - moving to a new home

During 2011 and 2012, CERN IT took a new approach to how to manage the infrastructure for analysing the data from the LHC and other experiments. The Agile Infrastructure project was formed covering service provisioning, configuration management and monitoring by adopting commonly used open source solutions with active communities to replace the in house tool suite.

In 2019, the CERN cloud managed infrastructure has grown by a factor of 10 compared to the resources in 2013. This has been achieved in collaboration with the many open source communities we have worked with over the past years, including

OpenStack
RDO
CentOS
Puppet
Foreman
Elastic
Grafana
Collectd
Kubernetes
Ceph

These experiences have been shared with over 40 blogs and more than 100 different talks at open source events during this time from small user groups to large international conferences.

The OpenStack-In-Production blog has been covering the experiences, with a primary focus on the development of the CERN cloud service. However, the challenges of the open source world are now covering many more projects so it is time to move to a new blog to cover not only work on OpenStack but other communities and the approaches to integrated these projects into the CERN production infrastructure.

Thus, this blog will be moving to its new home at https://techblog.web.cern.ch/techblog/, incorporating our experiences with these other technologies. For those who would like to follow a specific subset of our activities, there are also taxonomy based content to select new OpenStack articles at https://techblog.web.cern.ch/techblog/tags/openstack/ and the legacy blog content at https://techblog.web.cern.ch/techblog/tags/openinfra/.

One of the most significant benefits we've found from sharing is receiving the comments from other community members. These often help to guide us in further investigations on solving difficult problems and to identify common needs to work on together upstream.

Look forward to hearing from you all on the techblog web site.

References

CERN Cloud - 5 year perspective - https://www.slideshare.net/noggin143/20181219-ucc-open-stack-5-years-v3-128057833

Wednesday 9 May 2018

Introducing GPUs to the CERN Cloud

High-energy physics workloads can benefit from massive parallelism -- and as a matter of fact, the domain faces an increasing adoption of deep learning solutions. Take for example the newly-announced TrackML challenge [7], already running in Kaggle! This context motivates CERN to consider GPU provisioning in our OpenStack cloud, as computation accelerators, promising access to powerful GPU computing resources to developers and batch processing alike.

What are the options?

Given the nature of our workloads, our focus is on discrete PCI-E Nvidia cards, like the GTX1080Ti and the Tesla P100. There are 2 ways to provision these GPUs to VMs: PCI passthrough and virtualized GPU. The first method is not specific to GPUs, but applies to any PCI device. The device is claimed by a generic driver, VFIO, on the hypervisor (which cannot use it anymore) and exclusive access to it is given to a single VM [1]. Essentially, from the host’s perspective the VM becomes a userspace driver [2], while the VM sees the physical PCI device and can use normal drivers, expecting no functionality limitation and no performance overhead.

Visualizing passthrough vs mdev vGPU [9]

In fact, perhaps some “limitation in functionality” is warranted, so that the untrusted VM can’t do low-level hardware configuration changes on the passed-through device, like changing power settings or even its firmware! In fact, security-wise PCI passthrough leaves a lot to be desired. Apart from allowing the VM to change the device’s configuration, it might leave a possibility for side-channel attacks on the hypervisor (although we have not observed this, and a hardware “IOMMU” protects against DMA attacks from the passed-through device). Perhaps more importantly, the device’s state won’t be automatically reset after deallocating from a VM. In the case of a GPU, data from a previous use may persist on the device’s global memory when it is allocated to a new VM. The first concern may be mitigated by improving VFIO, while the latter, the issue of device reset or “cleanup”, provides a use case for a more general accelerator management framework in OpenStack -- the nascent Cyborg project may fit the bill.

Virtualized GPUs are a vendor-specific option, promising better manageability and alleviating the previous issues. Instead of having to pass through entire physical devices, we can split physical devices into virtual pieces on demand (well, almost on demand; there needs to be no vGPU allocated in order to change the split) and hand out a piece of GPU to any VM. This solution is indeed more elegant. In Intel and Nvidia’s case, virtualization is implemented as a software layer in the hypervisor, which provides “mediated devices” (mdev [3]), virtual slices of GPU that appear like virtual PCI devices to the host and can be given to the VMs individually. This requires a special vendor-specific driver on the hypervisor (Nvidia GRID, Intel GVT-g), unfortunately not yet supporting KVM. AMD is following a different path, implementing SR-IOV at a hardware level.

CERN’s implementation

PCI passthrough has been supported in Nova for several releases, so it was the first solution we tried. There is a guide in the OpenStack docs [4], as well as previous summit talks on the subject [1]. Once everything is configured, the users will see special VM flavors (“g1.large”), whose extra_specs field includes passthrough of a particular kind of gpu. For example, to deploy a GTX 1080Ti, we use the following configuration:

nova-compute
`pci_passthrough_whitelist={"vendor_id":"10de"}`

nova-scheduler
add `PciPassthroughFilter` to enabled/default filters

nova-api
`pci_alias={"vendor_id":"10de",”product_id”:”1b06”,”device_type”:”type-PCI”,”name”:”nvP1080ti_VGA”}`
`pci_alias={"vendor_id":"10de",”product_id”:”10ef”,”device_type”:”type-PCI”,”name”:”nvP1080ti_SND”}`

flavor extra_specs
`--property "pci_passthrough:alias"="nvP1080ti_VGA:1,nvP1080ti_SND:1"`

A detail here is that most GPUs appear as 2 pci devices, the VGA and the sound device, both of which must be passed through at the same time (they are in the same IOMMU group; basically an IOMMU group [6] is the smallest passable unit).

Our cloud was in Ocata at the time, using CellsV1, and there were a few hiccups, such as the Puppet modules not parsing an option syntax correctly (MulitStrOpt) and CellsV1 dropping the pci requests. For Puppet, we were simply missing some upstream commits [15]. From Pike on and in CellsV2, these issues shouldn’t exist. As soon as we had worked around them and puppetized our hypervisor configuration, we started offering cloud GPUs with PCI passthrough and evaluating the solution. We created a few GPU flavors, following the AWS example of keeping the amount of vCPUs the same as the corresponding normal flavors.

From the user’s perspective, there proved to be no functionality issues. CUDA applications, like TensorFlow, run normally; the users are very happy that they finally have exclusive access to their GPUs (there is good tenant isolation). And there is no performance penalty in the VM, as measured by the SHOC benchmark [5] -- admittedly quite old, we preferred this benchmark because it also measures low-level details, apart from just application performance.

From the cloud provider’s perspective, there’s a few issues. Apart from the potential security problems identified before, since the hypervisor has no control over the passed-through device, we can’t monitor the GPU. We can’t measure its actual utilization, or get warnings in case of critical events, like overheating.

Normalized performance of VMs vs. hypervisor on some SHOC benchmarks. First 2: low-level gpu features, Last 2: gpu algorithms [8]. There are different test cases of VMs, to check if other parameters play a role. The “Small VM” has 2 vCPUs, “Large VM” has 4, “Pinned VM” has 2 pinned vCPUs (thread siblings), “2 pin diff N” and “2 pin same N” measure performance in 2 pinned VMs running simultaneously, in different vs the same NUMA nodes

Virtualized GPU experiments

The allure of vGPUs amounts largely to finer-grained distribution of resources, less security concerns (debatable) and monitoring. Nova support for provisioning vGPUs is offered in Queens as an experimental feature. However, our cloud is running on KVM hypervisors (on CERN CentOS 7.4 [14]), which Nvidia does not support as of May 2018 (Nvidia GRID v6.0). When it does, the hypervisor will be able to split the GPU into vGPUs according to one of many possible profiles, such as in 4 or in 16 pieces. Libvirt then assigns these mdevs to VMs in a similar way to hostdevs (passthrough). Details are in the OpenStack docs at [16].

Despite this promise, it remains to be seen if virtual GPUs will turn out to be an attractive offering for us. This depends on vendors’ licensing costs (such as per VM pricing), which, for the compute-compatible offering, can be significant. Added to that is the fact that only a subset of standard CUDA is supported (not supported are the unified memory and “CUDA tools” [11], probably referring to tools like the Nvidia profiler). vGPUs are also oversubscribing the GPU’s compute resources, which can be seen in either a positive or negative light. On the one hand, this guarantees higher resource utilization, especially for bursting workloads, like developers. On the other hand, we may expect a lower quality of service [12].

And the road goes on...

Our initial cloud GPU offering is very limited, and we intend to gradually increase it. Before that, it will be important to address (or at least be conscious about) the security repercussions of PCI passthrough. But even more significant is to address GPU accounting in a straightforward manner, by enforcing quotas on GPU resources. So far we haven’t tested the case of GPU P2P, with multi-GPU VMs, which is supposed to be problematic [13].

Another direction we’ll be researching is offering GPU-enabled container clusters, backed by pci-passthrough VMs. It may be that, with this approach, we can emulate a behavior similar to vGPUs and circumvent some of the bigger problems with pci passthrough.

References

[1]: OVH presentation: https://www.youtube.com/watch?v=1tdvz3ejokM

[2]: VFIO description: https://www.kernel.org/doc/Documentation/vfio.txt

[3]: VFIO mediated devices (mdev): https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt

[4]: Nova docs on PCI passthrough: https://docs.openstack.org/nova/latest/admin/pci-passthrough.html

[5]: SHOC benchmark suite: https://github.com/vetter/shoc

[6]: IOMMU groups: https://vfio.blogspot.ch/2014/08/iommu-groups-inside-and-out.html

[7]: TrackML challenge announcement: https://home.cern/about/updates/2018/05/are-you-trackml-challenge

[8]: S3D: https://github.com/vetter/shoc/wiki/S3d, https://www.olcf.ornl.gov/wp-content/themes/olcf/titan/Titan_BuiltForScience.pdf

[9]: Images taken from: http://www.linux-kvm.org/images/5/59/02x03-Neo_Jia_and_Kirti_Wankhede-vGPU_on_KVM-A_VFIO_based_Framework.pdf

[10]: Nvidia vGPU architecture: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#architecture-grid-vgpu

[11]: CUDA Unified memory and tooling not supported on Nvidia vGPU: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#features-grid-vgpu

[12]: AMD on vGPU QoS: https://pro.radeon.com/en/quality-of-service-amd-mxgpu/

[13]: GPU P2P doesn’t work: http://lists.openstack.org/pipermail/openstack-operators/2018-March/014988.html

[14]: CERN CentOS 7: https://linux.web.cern.ch/linux/centos7/

[15]: Missing Puppet commits: https://github.com/openstack/puppet-nova/commit/e7fe8c16ae873834ccf145b2bcbc62081a957241, https://github.com/openstack/puppet-nova/commit/c1a4ab211dd2322572349719379cd13c6f2abb9a

[16]: Adding vGPUs to guests: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html

Monday 12 March 2018

Hardware burn-in in the CERN datacenter

During the Ironic sessions at the recent OpenStack Dublin PTG in Spring 2018, there were some discussions on adding a further burn in step to the OpenStack Bare Metal project (Ironic) state machine. The notes summarising the sessions were reported to the openstack-dev list. This blog covers the CERN burn in process for the systems delivered to the data centers as one example of how OpenStack Ironic users could benefit from a set of open source tools to burn in newly delivered servers as a stage within the Ironic workflow.

CERN hardware procurement follows a formal process compliant with public procurements. Following a market survey to identify potential companies in CERN's member states, a tender specification is sent to the companies asking for offers based on technical requirements.

Server burn in goals

Following the public procurement processes at CERN, large hardware deliveries occur once or twice a year and smaller deliveries multiple times per year. The overall resource management at CERN was covered in a previous blog. Part of the steps before production involves burn in of new servers. The goals are

Ensure that the hardware delivered complies with CERN Technical Specifications
Find systematic issues with all machines in a delivery such as bad firmware
Identify failed components in single machines
Provoke early failure in failing components due to high load during stress testing

Depending on the hardware configuration, the burn-in tests take on average around two weeks but do vary significantly (e.g. for systems with large memory amounts, the memory tests alone can take up to two weeks). This has been found to be a reasonable balance between achieving the goals above compared to delaying the production use of the machines with further testing which may not find more errors.

Successful execution of the CERN burn in processes is required in the tender documents prior to completion of the invoicing.

Workflow

The CERN hardware follows a lifecycle from procurement to retirement as outlined below. The parts marked in red are the ones currently being implemented as part of the CERN Bare Metal deployment.

As part of the evaluation, test systems are requested from the vendor and these are used to validate compliance with the specifications. The results are also retained to ensure that the bulk equipment deliveries correspond to the initial test system configurations and performance.

Preliminary Checks

CERN requires that the Purchase Order ID and an unique System Serial Number are set in the NVRAM of the Baseboard Management Controller (BMC), in the Field Replaceable Unit (FRU) fields Product Asset Tag (PAT) and Product Serial (PS) respectively:

# ipmitool fru print 0 | tail -2
Product Serial : 245410-1
Product Asset Tag : CD5792984

The Product Asset Tag is set to the CERN delivery number and the Product Serial is set to the unique serial number for the system unit.

Likewise, certain BIOS fields have to be set correctly such as booting from network before disk to ensure the systems can be easily commissioned.

Once these basic checks have been done, the burn in process can start. A configuration file, containing the burn-in tests to be run, is created according on the information stored in the PAT and PS FRU fields. Based on the content of the configuration file, the enabled tests will automatically start.

Burn in

The burn in process itself is highlighted in red in the workflow above, consisting of the following steps

Memory
CPU
Storage
Benchmarking
Network

Memory

The memtest stress tester is used for validation of the RAM in the system. Details of the tool are available at http://www.memtest.org/.

CPU

Testing the CPU is performed using a set of burn tools, burnK7 or burnP6, and burn MMX. These tools not only test the CPU itself but are also useful to find cooling issues such as broken fans since the power load is significant with the processors running these tests.

Disk

Disk burn ins are intended to create the conditions for early drive failure. The bathtub curve aims to cause the early failure drives to fail prior to production.

With this aim, we use the badblocks code to repeatedly read/write the disks. SMART counters are then checked to see if there are significant numbers of relocated bad blocks and the CERN tenders require disk replacement if the error rate is high.

We still use this process although the primary disk storage for the operating system has now changed to SSD. There may be a case for minimising the writing on an SSD to maximise the life cycle of the units.

Benchmarking

Many of the CERN hardware procurements are based on price for total compute capacity needed. With the nature of most of the physics processing, the total throughput of the compute farm is more important than the individual processor performance. Thus, it may be that the most total performance can be achieved by choosing processors which are slightly slower but less expensive.

CERN currently measures the CPU performance using a set of benchmarks based on a subset of the SPEC 2006 suite. The subset, called HEPSpec06, is run in parallel on each of the cores in the server to determine the total throughput from the system. Details are available at the HEPiX Benchmarking Working Group web site.

Since the offers include the expected benchmark performance, the results of the benchmarking process are used to validate the technical questionnaire submitted by the vendors. All machines in the same delivery would be expected to produce similar results so variations between different machines in the same batch are investigated.

CPU benchmarking can also be used to find problems where there is significant difference across a batch, such as incorrect BIOS settings on a particular system.

Disk performance is checked using a reference fio access suite. A minimum performance level in I/O is also required in the tender documents.

Networking

Networking interfaces are difficult to burn in compared to disks or CPU. To do a reasonable validation, at lest two machines are needed. With batches of 100s of servers, a simple test against a single end point will produce unpredictable results.

Using a network broadcast, the test finds other machines running the stress test, they pair up and run a number of tests.

iperf3 is used for bandwidth, reversed bandwidth, udp and reversed udp
iperf for full duplex testing (currently missing from iperf3)
ping is used for congestion testing

Looking forward

CERN is currently deploying Ironic into production for bare metal management of machines. Integrating the burn in and retirement stages into the bare metal management states would bring easy visibility of the current state as the deliveries are processed.

The retirement stage is also of interest to ensure that there is no CERN configuration in the servers (such as Ironic BMC credentials or IP addresses). CERN has often donated retired servers to other high energy physics sites such as SESAME in Jordan and Morocco which requires a full server factory reset before dismounting. This retirement step would be a more extreme cleaning followed by complete removal from the cloud.

Discussing with other scientific laboratories such as SKA through the OpenStack Scientific special interest group has shown interest in extending Ironic to automate the server on-boarding and retirement processes as described in the session at the OpenStack Sydney summit. We'll be following up on these discussions at Vancouver.

Acknowledgements

CERN IT department - http://cern.ch/it
CERN Ironic and Rework Contributors

Alexandru Grigore
Daniel Abad
Mateusz Kowalski

References

Computing in High Energy Physics procurement paper (CHEP 2013) - http://iopscience.iop.org/article/10.1088/1742-6596/513/6/062003/pdf
Automatic server registration and burn-in framework (HEPiX Fall 2013) - https://indico.cern.ch/event/247864/contributions/1570349/attachments/426708/592287/Automatic_Registration_Burn_in.ppt
Backblaze failure report - https://www.backblaze.com/blog/hard-drive-stats-for-2017/
Resource Management at CERN - http://openstack-in-production.blogspot.fr/2016/04/resource-management-at-cern.html

Friday 2 March 2018

Expiry of VMs in the CERN cloud

The CERN cloud resources are used for a variety of purposes from running compute intensive workloads to long running services. The cloud also provides personal projects for each user who is registered to the service. This allows a small quota (5 VMs, 10 cores) where the user can have resources dedicated for their use such as boxes for testing. A typical case would be for the CERN IT Tools training where personal projects are used as sandboxes for trying out tools such as Puppet.

Personal projects have a number of differences compares to other projects in the cloud

No non-standard flavors
No additional quota can be requested
Should not be used for production services
VMs are deleted automatically when the person stops being a CERN user

With the number of cloud users increasing to over 3,000, there is a corresponding growth in the number of cores used by personal projects, growing by 1,200 cores in the past year. For cases like training users, there is often the case that the VMs are created and the user then does not remember to delete the resources so they consume cores which could be used for compute capacity to analyse the data from the LHC.

One possible approach would be to reduce the quota further. However, tests such as setting up a Kubernetes cluster with OpenStack Magnum often need several VMs to perform the different roles so this would limit the usefulness of personal projects. The usage of the full quota is also rare.

VM Expiration

Based on a previous service which offered resources on demand (called CVI based on Microsoft SCVMM), the approach was taken to expire personal virtual machines.

Users can create virtual machines up to the limit of their quota
Personal VMs are marked with an expiry date
Prior to their expiry, the user is sent several mails to inform them their VM will expire soon and how to extend it if it is still useful.
On expiry, the virtual machine is locked and shutdown. This helps to catch cases where people have forgotten to prolong their VMs.
One week later, the virtual machine is deleted, freeing up the resources.

Implementation

We use Mistral to automate several OpenStack tasks in the cloud (such as regular snapshots and project creation/deletion). This has the benefit of a clean audit log to show what steps worked/failed along with clear input/output states supporting retries and an authenticated cloud cron for scheduling.

Our OpenStack projects have some properties set when they are created. This is used to indicate additional information like the accounting codes to be charged for the usage. There are properties for indicating if the type of project such as personal and if the expiration workflow should apply. Mistral YAQL code can then select resources where expiration applies.

task(retrieve_all_projects).result.select(dict(id => $.id, name => $.name, enabled => $.enabled, type => $.get('type','none'),expire => $.get('expire','off'))).where($.type='personal').where($.enabled).where($.expire='on')

The expire_at parameter is stored as a VM property. This makes it visible for automation such as CLIs through the openstack client show server CLI.

There are several parts to the process

A cron trigger'd workflow which

Machines in error state or currently building are ignored
A newly created machine which does not have an expiry date set has the expiration date set according to the grace period
Sees if any machines are entering close to their expiry time and sends a mail to the owner
Checks for invalid settings of the expire_at property (such as people setting it a long way in the future or deleting the property) and restores a reasonable value if this is detected
If a machine has reached it's expiry date, it's locked and shutdown
If a machine has past it's date by the grace period, it's deleted

A workflow, launched by Horizon or from the CLI

Retrieves the expire_at value and extends it by the prolongation period

The user notification is done using a set of mail templates and a dedicated workflow (https://gitlab.cern.ch/cloud-infrastructure/mistral-workflows/blob/master/workflows/send_mail_template.yaml). This allows templates such as instance reminders to have details about the resources included, such as the example from the mail template.

The Virtual Machine {instance} from the project {project_name} in the Cloud Infrastructure Service will expire on {expire_date}.

A couple of changes to Mistral will be submitted upstream

Support for HTML mail bodies which allows us to have a nicer looking e-mail for notification with links included
Support for BCC/CC on the mail so that the OpenStack cloud administrator e-mail can also be kept on copy when there are notifications

A few minor changes to Horizon were also done (currently local patches)

Display expire_at value on the instance details page
Add a 'prolong' action so that instances can be prolonged via the web by using the properties editor to set the date of the expiry (defaulting to the current date with the expiry time). This launches the workflow for prolonging the instance.

Author

Jose Castro Leon from the CERN cloud team

References

CERN IT department at http://information-technology.web.cern.ch/
Workflow code is in the repository at https://gitlab.cern.ch/cloud-infrastructure/mistral-workflows, specifically at https://gitlab.cern.ch/cloud-infrastructure/mistral-workflows/tree/master/workbooks for the YAQL.
CERN end user documentation is at http://clouddocs.web.cern.ch/clouddocs/projects/vm_expiration.html
OpenStack Mistral Workflow as a Service documentation is at https://docs.openstack.org/mistral/latest/overview.html

Wednesday 21 February 2018

Maximizing resource utilization with Preemptible Instances

Motivation

The CERN cloud consists of around 8,500 hypervisors providing over 36,000
virtual machines. These provide the compute resources for both the laboratory's
physics program but also for the organisation's administrative operations such
as paying bills and reserving rooms at the hostel.

The resources themselves are generally ordered once to twice a year with servers being kept for around 5 years. Within the CERN budget, the resource planning teams looks at:

The needs of the physics program for the coming years under the review of the Computing Scrutiny Review Board and the LHC Experiments Committee.

The resources required to run the computing services requirements for the CERN laboratory. These are projected using capacity planning trend data and upcoming projects such as video conferencing.

With the installation and commissioning of thousands of servers concurrently
(along with their associated decommissioning 5 years later), there are scenarios
to exploit underutilised servers. Programs such as LHC@Home are used but we have also been interested to expand the cloud to provide virtual machine instances which can be rapidly terminated in the event of

Resources being required for IT services as they scale out for events such as a large scale web cast on a popular topic or to provision instances for a new version of an application.
Partially full hypervisors where the last remaining cores are not being requested (the Tetris problem).
Compute servers at the end of their lifetime which are used to the full before being removed from the computer centre to make room for new deliveries which are more efficient and in warranty.

The characteristics of this workload is that it should be possible to stop an
instance within a short time (a few minutes) compared to a traditional physics job.

Resource Management In Openstack

Operators use project quotas for ensuring the fair sharing of their infrastructure. The problem with this, is that quotas pose as hard limits.This
leads to actually dedicating resources for workloads even if they are not used
all the time or to situations where resources are not available even though
there is quota still to use.

At the same time, the demand for cloud resources is increasing rapidly. Since
there is no cloud with infinite capabilities, operators need a way to optimize
the resource utilization before proceeding to the expansion of their infrastructure.

Resources in idle state can occur, showing lower cloud utilization than the full
potential of the acquired equipment while the users’ requirements are growing.

The concept of Preemptible Instances can be the solution to this problem. These
type of servers can be spawned on top of the project's quota, making use of the
underutilised capabilities. When the resources are requested by tasks with
higher priority (such as approved quota), the preemptible instances are
terminated to make space for the new VM.

Preemptible Instances with Openstack

Supporting preemptible instances, would mirror the AWS Spot Market and the
Google Preemptible Instances. There are multiple things to be addressed here as
part of an implementation with OpenStack, but the most important can be reduced to these:

Tagging Servers as Preemptible

In order to be able to distinguish between preemptible and non-preemptible
servers, there is the need to tag the instances at creation time. This property
should be immutable for the lifetime of the servers.

Who gets to use preemptible instances

There is also the need to limit which user/project is allowed to use preemptible
instances. An operator should be able to choose which users are allowed to spawn this type of VMs.

Selecting servers to be terminated

Considering that the preemptible instances can be scattered across the different cells/availability zones/aggregates, there has to be “someone” able to find the existing instances, decide the way to free up the requested resources according to the operator’s needs and, finally, terminate the appropriate VMs.

Quota on top of project’s quota

In order to avoid possible misuse, there could to be a way to control the amount of preemptible resources that each user/project can use. This means that apart from the quota for the standard resource classes, there could be a way to enforce quotas on the preemptible resources too.

OPIE : IFCA and Indigo Dataclouds

In 2014, there were the first investigations into approaches by Alvaro Lopez
from IFCA (https://blueprints.launchpad.net/nova/+spec/preemptible-instances).
As part of the EU Indigo Datacloud project, this led to the development of the
OpenStack Pre-Emptible Instances package (https://github.com/indigo-dc/opie).
This was written up in a paper to Journal of Physics: Conference Series
(http://iopscience.iop.org/article/10.1088/1742-6596/898/9/092010/pdf) and
presented at the OpenStack summit (https://www.youtube.com/watch?v=eo5tQ1s9ZxM)

Prototype Reaper Service

At the OpenStack Forum during a recent OpenStack summit, a detailed discussion took place on how spot instances could be implemented without significant changes to Nova. The ideas were then followed up with the OpenStack Scientific Special Interest Group.

Trying to address the different aspects of the problem, we are currently
prototyping a “Reaper” service. This service acts as an orchestrator for
preemptible instances. It’s sole purpose is to decide the way to free up the
preemptible resources when they are requested for another task.

The reason for implementing this prototype, is mainly to help us identify
possible changes that are needed in Nova codebase to support Preemptible
Instances.

More on this WIP can be found here:

https://gitlab.cern.ch/ttsiouts/ReaperServicePrototype

Summary

The concept of Preemptible Instances gives operators the ability to provide a
more "elastic" capacity. At the same time, it enables the handling of increased
demand for resources, with the same infrastructure, by maximizing the cloud
utilization.

This type of servers is perfect for tasks/apps that can be terminated at any
time, enabling the users to take advantage of extra cpu power on demand without the fixed limits that quotas enforce.

Finally, here in CERN, there is an ongoing effort to provide a prototype
orchestrator for Preemptible Servers with Openstack, in order to pinpoint the
changes needed in Nova to support this feature optimally. This could also be
available in future for other OpenStack clouds in use by CERN such as the
T-Systems Open Telekom Cloud through the Helix Nebula Open Science Cloud
project.

Contributors

Theodoros Tsioutsias (CERN openlab fellow working on Huawei collaboration)
Spyridon Trigazis (CERN)
Belmiro Moreira (CERN)

References

CERN Huawei openlab collaboration at https://openlab.cern/members/huawei and https://openlab.cern/project/openstack-clouds
Helix Nebula Science Cloud project at http://www.helix-nebula.eu
Indigo Datacloud at https://www.indigo-datacloud.eu/
CERN SKA collaboration plans at https://www.openstack.org/videos/sydney-2017/future-science-on-future-openstack-developing-next-generation-infrastructure-at-cern-and-ska
OpenStack Scientific Special Interest Group at https://wiki.openstack.org/wiki/Scientific_SIG
Nova patch to support Reaper - https://review.openstack.org/547450

Tuesday 30 January 2018

Keep calm and reboot: Patching recent exploits in a production cloud

At CERN, we have around 8,500 hypervisors running 36,000 guest virtual machines. These provide the compute resources for both the laboratory's physics program but also for the organisation's administrative operations such as paying bills and reserving rooms at the hostel. These resources are spread over many different server configurations, some of them over 5 years old.

With the accelerator stopping over the CERN annual closure until mid March, this is a good period to be planning reconfiguration of compute resources such as the migration of our central batch system which schedules the jobs across the central compute resources to a new system based on HTCondor. The compute resources are heavily used but there is more flexibility to drain some parts in the quieter periods of the year when there is not 10PB/month coming from the detectors. However, this year we have had an unexpected additional task to deploy the fixes for the Meltdown and Spectre exploits across the centre.

The CERN environment is based on Scientific Linux CERN 6 and CentOS 7. The hypervisors are now entirely CentOS 7 based with guests of a variety of operating systems including Windows flavors and CERNVM. The campaign to upgrade involved a number of steps

Assess the security risk
Evaluate the performance impact
Test the upgrade procedure and stability
Plan the upgrade campaign
Communicate with the users
Execute the campaign

Security Risk

The CERN environment consists of a mixture of different services, with thousands of projects on the cloud, distributed across two data centres in Geneva and Budapest.

Two major risks were identified

Services which provided the ability for end users to run their own programs along with others sharing the same kernel. Examples of this are the public login services and batch farms. Public login services provide an interactive Linux environment for physicists to log into from around the world, prepare papers, develop and debug applications and submit jobs to the central batch farms. The batch farms themselves provide 1000s of worker nodes processing the data from CERN experiments by farming event after event to free compute resources. Both of these environments are multi-user and allow end users to compile their own programs and thus were rated as high risk for the Meltdown exploit.
The hypervisors provide support for a variety of different types of virtual machines. Different areas of the cloud provide access to different network domains or to compute optimised configurations. Many of these hypervisors will have VMs owned by different end users and therefore can be exposed to the Spectre exploits, even if the performance is such that exploiting the problem would take significant computing time.

The remaining VMs are for dedicated services without access for end user applications or dedicated bare metal servers for I/O intensive applications such as databases and disk or tape servers.

There are a variety of different hypervisor configurations which we split down by processor type (in view of the Spectre microcode patches). Each of these needs independent performance and stability checks.

Microcode	Assessment	#HVs	Processor name(s)
06-3f-02	covered	3332	E5-2630 v3 @ 2.40GHz,E5-2640 v3 @ 2.60GHz
06-4f-01	covered	2460	E5-2630 v4 @ 2.20GHz, E5-2650 v4 @ 2.20GHz
06-3e-04	hopefully	1706	E5-2650 v2 @ 2.60GHz
??	unclear	427	CPU family: 21 Model: 1 Model name: AMD Opteron(TM) Processor 6276 Stepping: 2
06-2d-07	unclear	333	E5-2630L 0 @ 2.00GHz, E5-2650 0 @ 2.00GHz
06-2c-02	unlikely	168	E5645 @ 2.40GHz, L5640 @ 2.27GHz, X5660 @ 2.80GHz

These risks were explained by the CERN security team to the end users in their regular blogs.

Evaluating the performance impact

The High Energy Physics community uses a suite called HEPSPEC06 to benchmark compute resources. These are synthetic programs based on the C++ components of SPEC CPU2006 which match the instruction mix of the typical physics programs. With this benchmark, we have started to re-benchmark (the majority of) the CPU models we have in the data centres, both on the physical hosts and on the guests. The measured performance loss across all architectures tested so far is about 2.5% in HEPSPEC06 (a number also confirmed by by one of the LHC experiments using their real workloads) with a few cases approaching 7%. So for our physics codes, the effect of patching seems measurable, but much smaller than many expected.

Test the upgrade procedure and stability

With our environment based on CentOS and Scientific Linux, the deployment of the updates for Meltdown and Spectre were dependent on the upstream availability of the patches. These could be broken down into several parts

Firmware for the processors - the microcode_ctl packages provide additional patches to protect against some parts of Spectre. This package proved very dynamic as new processor firmware was being added on a regular basis and it was not always clear when this needed to be applied, the package version would increase but it was not always that this included an update for the particular hardware type. Following through the Intel release notes, there were combinations such as "HSX C0(06-3f-02:6f) 3a->3b" which explains that the processor description 06-3f-02:6f is upgraded from release 0x3a to 0x3b. The fields are the CPU family, model and stepping from /proc/cpuinfo and the firmware level can be found at /sys/devices/system/cpu/cpu0/microcode/version. A simple script (spectre-cpu-microcode-checker.sh) was made available to the end users so they could check their systems and this was also used by the administrators to validate the central IT services.
For the operating system, we used a second script (spectre-meltdown-checker.sh) which was derived from the upstream github code at https://github.com/speed47/spectre-meltdown-checker. The team maintaining this package were very responsive incorporating our patches so that other sites could benefit from the combined analysis.

Communication with the users

For the cloud, there are several resource consumers.

IT service administrators who provide higher level functions on top of the CERN cloud. Examples include file transfer services, information systems, web frameworks and experiment workload management systems. While some are in the IT department, others are representatives of their experiments or supporters for online control systems such as those used to manage the accelerator infrastructure.
End users consume cloud resources by asking for virtual machines and using them as personal working environments. Typical cases would be a MacOS user who needs a Windows desktop where they would create a Windows VM and use protocols such as RDP to access it when required.

The communication approach was as follows:

A meeting was held to discuss the risks of exploits, the status of the operating systems and the plan for deployment across the production facilities. With a Q&A session, the major concerns raised were around potential impact on performance and tuning options.
An e-mail was sent to all owners of virtual machine resources informing them of the upcoming interventions.
CERN management was informed of the risks and the plan for deployment.

CERN uses ServiceNow to provide a service desk for tickets and a status board of interventions and incidents. A single entry was used to communicate the current plans and status so that all cloud consumers could go to a single place for the latest information.

Execute the campaign

With the accelerator starting up again in March and the risk of the exploits, the approach taken was to complete the upgrades to the infrastructure in January, leaving February to find any residual problems and resolve them. As the handling of the compute/batch part of the infrastructure was relatively straight forward (with only one service on top), we will focus in the following on the more delicate part of hypervisors running services supporting several thousand users in their daily work.

The layout of our infrastructure with its availability zones (AVZs) determined the overall structure and timeline of the upgrade. With effectively four AVZs in our data centre in Geneva and two AVZs for our remote resources in Budapest, we scheduled the upgrade for the services part of the resources over four days.

The main zones in Geneva were done one per day, with a break after the first one (GVA-A) in case there were unexpected difficulties to handle on the infrastructure or on the application side. The remaining zones were scheduled on consecutive days (GVA-B and GVA-C), the smaller ones (critical, WIG-A, WIG-B) in sequential order on the last day. This way we upgraded around 400 hosts with 4,000 guests per day.

Within each zone, hypervisors were divided into 'reboot groups' which were restarted and checked before the next group was handled. These groups were determined by the OpenStack cells underlying the corresponding AVZs. Since some services required to limit the window of service downtime, their hosting servers were moved to the special Group 1, the only one for which we could give a precise start time.

For each group several steps were performed:

install all relevant packages
check the next kernel is the desired one
reset the BMC (needed for some specific hardware to prevent boot problems)
log the nova and ping state of all guests
stop all alarming
stop nova
shut down all instances via virsh
reboot the hosts
... wait ... then fix hosts which did not come back
check running kernel and vulnerability status on the rebooted hosts
check and fix potential issues with the guests

Shutting down virtual machines via 'virsh', rather than the OpenStack APIs, was chosen to speed up the overall process -- even if this required to switch off nova-compute on the hosts as well (to keep nova in a consistent state). An alternative to issuing 'virsh' commands directly would be to configure 'libvirt-guests', especially in the context of the question whether guests should be shut down and rebooted (which we did during this campaign) or paused/resumed. This is an option we'll have a look at to prepare for similar campaigns in the future.

As some of the hypervisors in the cloud had very long uptimes and this was the first time we systematically rebooted the whole infrastructure since the service went to full production about five years ago, we were not quite sure what kind issues to expect -- and in particular at which scale. To our relief, the problems encountered on the hosts hit less than 1% of the servers and included (in descending order of appearance)

hosts stuck in shutdown (solved by IPMI reset)
libvirtd stuck after reboot (solved by another reboot)
hosts without network connectivity (solved by another reboot)
hosts stuck in grub during boot (solved by reinstalling grub)

On the guest side, virtual machines were mostly ok when the underlying hypervisor was ok as well.
A few additional cases included

incomplete kernel upgrades, so the root partition could not be found (solved by booting back into an older kernel and reinstall the desired kernel)
file system issues (solved by running file system repairs)

So, despite initial worries, we hit no major issues when rebooting the whole CERN cloud infrastructure!

Conclusions

While these kind of security issues do not arrive very often, the key parts of the campaign follow standard steps, namely assessing the risk, planning the update, communicating with the user community, execution and handling incomplete updates.

Using cloud availability zones to schedule the deployment allowed users to easily understand when there would be an impact on their virtual machines and encourages good practise to load balance resources.

References

CERN security advisory for Meltdown / Spectre
CERN IT Status Board

Authors

Arne Wiebalck
Jan Van Eldik
Tim Bell

LHC Tunnel

Saturday 19 January 2019