Saturday, 1 August 2015

OpenStack CPU topology for High Throughput Computing

We are starting to look at the latest features of OpenStack Kilo as part of the CERN OpenStack cloud to optimise a number of different compute intensive applications.

We'll break down the tips and techniques into a series of small blogs. A corresponding set of changes to the upstream documentation will also be made to ensure the options are documented fully.

In the modern CPU world, a server consists of multiple levels of processing units.
  • Sockets where each of the processor chips are inserted
  • Cores where each processors contain multiple processing units which can run multiple processes in parallel
  • Threads (if settings such as SMT are enabled) may allow multiple processing threads to be active at the expense of sharing a core
The typical hardware used at CERN is a 2 socket system. This provides optimum price performance for our typical high throughput applications which simulate and process events from the Large Hadron Collider. The aim is not to process a single event as quickly as possible but rather to process the maximum number of events within a given time (within the total computing budget available). As the price of processors vary according to the performance, the selected systems are often not the fastest possible but the ones which give the best performance/CHF.

A typical example of this approach is in our use of SMT which leads to a 20% increase in total throughput although each individual thread runs correspondingly slower. Thus, the typical configuration is

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Stepping:              4
CPU MHz:               2999.953
BogoMIPS:              5192.93
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31


By default in OpenStack, the virtual CPUs in a guest are allocated as standalone processors. This means that for a 32 vCPU VM, it will appear as

  • 32 sockets
  • 1 core per socket
  • 1 thread per socket
As part of ongoing performance investigations, we wondered about the impact of this topology on CPU bound applications.

With OpenStack Juno, there is a mechanism to pass the desired topology. This can be done through flavors or image properties.

The names are slightly different between the two usages, with flavors using properties which start hw: and images with properties starting hw_

The flavor configurations are set by the cloud administrators and the image properties can be set by the project members. The cloud administrator can also set maximum values (i.e. hw_max_cpu_cores) so that the project members cannot define values which are incompatible with the underlying resources.


$ openstack image set --property hw_cpu_cores=8 --property hw_cpu_threads=2 --property hw_cpu_sockets=2 0215d732-7da9-444e-a7b5-798d38c769b5

The VM which is booted then has this configuration reflected.

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2593.748
BogoMIPS:              5187.49
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K

NUMA node0 CPU(s):     0-31

While this gives the possibility to construct interesting topologies, the performance benefits are not clear. The standard High Energy Physics benchmark show no significant change. Given that there is no direct mapping between the cores in the VM and the underlying physical ones, this may be because the cores are not pinned to the corresponding sockets/cores/threads and thus Linux may be optimising for a virtual configuration rather than the real one.

This work was in collaboration with Sean Crosby (University of Melbourne) and Arne Wiebalck (CERN).

The following documentation reports have been raised

  • Flavors Extra Specs -  https://bugs.launchpad.net/openstack-manuals/+bug/1479270
  • Image Properties - 
  • https://bugs.launchpad.net/openstack-manuals/+bug/1480519

Thursday, 21 May 2015

Juno, EL6 and RDO Community.

Since 2011, CERN selected OpenStack as a cloud platform. It was natural to choose RDO as our RPMs provider ; RDO is a community of people using and deploying OpenStack on Red Hat Enterprise Linux, Fedora and distributions derived from these (such as Scientific Linux CERN 6 which powers our hypervisors).

The community decided not to provide official upgrade path from Icehouse to Juno on el6 systems.

While our internal infrastructure is now moving to CentOS 7, we have to maintain during the transition around 2500 compute nodes under SLC6.

As it was mentioned in the previous blog post, we recently finished the migration from IceHouse to Juno. Part of this effort was to rebuild Juno RDO packages for RHEL6 derivative and provide a tested upgrade path from IceHouse.

We are happy to announce that we recompiled openstack-nova and openstack-ceilometer packages publicly with the help of the CentOS infrastructure and made them available to the community.

The effort is led by the CentOS Cloud SIG and I'd like to thank particularly Alan Pevec, Haïkel Guemar and Karanbir Singh for their support and time.

For all the information and how to use the Juno EL6 packages please follow this link https://wiki.centos.org/Cloud/OpenStack/JunoEL6QuickStart.

Tuesday, 12 May 2015

Our cloud in Juno

This blog continues our series around upgrades of OpenStack. Previous upgrades are documented in
At CERN, we do incremental upgrades of the cloud, component by component. Giving details of the problems we encounter along the way.

For Juno, we followed the pattern as previously
  • cinder
  • glance
  • keystone
  • ceilometer
  • nova
  • horizon
As we are now rolling out our CentOS 7 based controllers, we took the opportunity to do that upgrade also. Many of the controllers themselves are virtualised which allows us to scale out as needed. An HA proxy configuration allows us to switch rapidly between the services on the different levels.

The motivation to move to CentOS 7 comes from two primary sources
  • CERN is moving from its own distribution, Scientific Linux CERN, to CentOS 7.
  • The RDO packages are being produced on CentOS 7 now. This means that we can benefit from community testing if we are also on that version.
We'll give more details on the SLC6 environment in a future posting.

We encountered one problem during the upgrade. The LDAP backend for roles compared the definition with an upper case role definition. We were using lower case roles in LDAP. This was resolved with a quick workaround and a bug will be reported around https://github.com/openstack/keystone/blob/stable/juno/keystone/assignment/backends/ldap.py#L93.

Other than that, the upgrade proceeded smoothly and we're now looking forward to deploying Heat and starting the planning for the migration to Kilo.

Tuesday, 5 May 2015

Purging Nova databases in a cell environment

CERN cloud infrastructure is running in production since July 2013 and during almost 2 years more than 1,000,000 production VMs were created in the infrastructure.

During the last few months, we have had an average of 11,000 VMs running with a creation/deletion rate between 100 and 200 VMs per hour.


Number of VMs created (cumulative)

Difference between VMs created and deleted per month (cumulative)


The information of all these instances is stored in the database and remains when the instances are deleted because nova uses “soft” delete (mark the records as deleted without removing them). As a consequence, the database size grows overtime being more difficult to manage and increasing operations time.

At CERN, we have the policy to preserve deleted instances information for 3 months in the database before removing them.

Nova has the functionality to move the deleted instances to “shadow” tables (nova-manage db archive_deleted_rows [--max_rows <number>]). This functionality can 
remove all deleted entries in the main tables or a maximum number of rows can be specified. However, a row doesn’t mean an instance because some tables have several entries for the same instance. Also, in a cloud that uses cells running the “archive_deleted_rows” with “max_rows” defined will not keep the top and children cells in sync.

In order to remove the database entries of deleted instances in a cell environment with a grace period we developed a small tool that is available at:

We start removing deleted instances from the top database defining a date until the rows should be deleted.
python cern-db-purge --date "2015-02-01 00:00:00" --config nova.conf

Cascading doesn’t work in nova database so the script checks if the instances were deleted before the specified date and removes all the rows associated with them in the different tables. Also, it saves the uuid and some more information about an instance in a file. We decided to not delete the instances from top and children at same time to have more operational control during the interventions. 

This file is used to remove the deleted instances from the children cells. 
python cern-db-purge --file "delete_these_instances.txt" --cell 'top_cell!child_cell_01' --config nova.conf

The script goes through all instances in the file for that specific child cell and after some consistency checks removes all the rows related with the instances.
In this way we make sure that if an instance is removed from the top cell is also deleted from the children cells.

Depending in the database size the tool can take several hours to run.

This tool needs access to nova database and was tested with Icehouse release. The database endpoint should be defined in the configuration file (it can be nova.conf). Since it reads and updates the database, the administrator should be extremely careful when using it.

Monday, 23 March 2015

Not all cores are created equal

Within CERN's compute cloud, the hypervisors vary significantly in performance. We generally run the servers for around 5 years before retirement and there are around 3 different configurations selected each year through public procurement.

Benchmarking in High Energy Physics is done using a benchmark suite called HEPSpec 2006 (HS06). This is based on the C++ programs within the Spec 2006 suite run in parallel according the number of cores in the server. The performance range is around a factor of 3 between the slowest and the fastest machines [1].  


When machines are evaluated after delivery, the HS06 rating for each hardware configuration is saved into a hardware inventory database.

Defining a flavor for each hardware type was not attractive as there are 15 different configurations to consider and users would not easily find out which flavors have free cores. Instead, users ask for the standard flavors, such as m1.small 1 core virtual machine, you could land on a hypervisor giving 6 HS06 or one giving 16. However, the accounting and quotas is done using virtual cores so the 6 and 16 HS06 virtual cores are considered equivalent.

in order to improve our accounting, we therefore wanted to provide the performance of the VM along with the metering records giving the CPU usage through ceilometer. Initially, we thought that this would require some additional code to be added to ceilometer but this is actually possible using the standard ceilometer functions with transformers and publishers.


The following approach was implemented.
  • On the hypervisor, we added an additional meter 'hs06' which provides the CPU rating of the VM normalised by the HS06 performance of the hypervisor. This value is determined using the HS06 value stored in the Hardware Database which can be provided to the hypervisor via a Puppet Fact.
  • This data is stored, in addition to the default 'cpu' record in ceilometer
The benefits of this approach are
  • There is no need for external lookup to the hardware database to process the accounting
  • No additional rights for the accounting process is required (such as to read the mapping between VM and hypervisor
  • Scenarios such as live migration of VMs from one hypervisor to another of different HS06 are correctly handled
  • No modifications to the ceilometer upstream code are required which both improves deployment time and does not invalidate upstream testing
  • Multiple benchmarks can be run concurrently. This allows a smooth migration from HS06 to a following benchmark HS14 by providing both sets of data.
  • Standard ceilometer behaviour is not modified so existing programs such as Heat which use this data can continue to run
  • This assumes no overcommitment of CPU. Further enhancements to the configuration would be possible in this area but this would require further meters.
  • The information is calculated directly on the hypervisor so it is scalable and it is calculated inline which avoids race conditions when the virtual machine is deleted and therefore the mapping VM to HV is no longer available
The assumptions are
  • The accounting is based on the delivered clock ticks to the hypervisor. This will vary in cases where the hypervisor is running a more recent version of the operating system with a later compiler (and thus probably has a higher HS06 rating). Running older OS versions is therefore corresponding less efficient.
  • The cloud is running at least the Juno OpenStack release
To implement this feature, the pipeline capabilities of ceilometer are used. These are configured automatically by the puppet-ceilometer component into /etc/ceilometer/pipeline.yaml.
The changes required are in several blocks. In the sources section as indicated by
---
sources:
A further source needs to be defined to get the CPU metric available for transformation. This polls every 10 minutes (600 seconds) from the CPU meter and sends the data to the sink for the hs06
    - name: hs06_source
      interval: 600
      meters:
          - "cpu"
      sinks:
          - hs06_sink
The hs06_sink processing is defined later in the file in the sinks section
sinks:
The entry below takes the number of virtual cores of the VM and scales by 10 (which is the example HS06 CPU performance per core) and 0.98 (for the virtualisation overhead factor). It is reported in units of HS06s (i.e. HepSpec 2006). The value of 10 would be derived from the Puppet HS06 value for the machine divided by the number of cores in the server (from the Puppet fact processorcount). Puppet can be used to configure a hard-coded value per hypervisor that is delivered to the machine as a fact and used to generate the pipeline.yaml configuration file.
    - name: hs06_sink
      transformers:
          - name: "arithmetic"
            parameters:
                target:
                    name: "hs06"
                    unit: "HS06"
                    type: "gauge"
                    expr: "$(cpu).resource_metadata.vcpus*10*0.98"
      publishers:
          - notifier://
Once these changes have been done, the ceilometer daemons can be restarted to get the new configuration.
 service openstack-ceilometer-compute restart
If there are errors, these will be reported to /var/log/ceilometer/compute.log. These can be checked with
egrep "(ERROR|WARNING)" /var/log/ceilometer/compute.log
The first messages like "dropping sample with no predecessor" are to be expected as they are handling differences between the previous values and the current ones (such as cpu utilisation).
After 10 minutes or so, ceilometer will poll the CPU, generate the new hs06 value and this can be queried using the ceilometer CLI.
ceilometer meter-list | grep hs06
will include the hs06 meter
| hs06                                | cumulative | HS06        | c6af7651-5fc5-4d37-bf57-c85238ee098c         | 1cdd42569f894c83863e1b76e165a70c | c4b673a3bb084b828ab344a07fa40f54 |
| hs06                                | cumulative | HS06        | e607bece-d9df-4792-904a-3c4adca1b99c         | 1cdd42569f894c83863e1b76e165a70c | c4b673a3bb084b828ab344a07fa40f54 |
and the last 5 entries in the database can be retrieved
ceilometer sample-list -m hs06 -l 5
produces the output
+--------------------------------------+------+-------+--------+------+---------------------+
| Resource ID                          | Name | Type  | Volume | Unit | Timestamp           |
+--------------------------------------+------+-------+--------+------+---------------------+
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:19:49 |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:16:49 |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:13:49 |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:10:49 |
| b812c69c-3c9f-4146-952e-078a266b11c5 | hs06 | gauge | 11.0   | HS06 | 2015-03-22T08:54:25 |
+--------------------------------------+------+-------+--------+------+---------------------+

References

  1. Ulrich Schwickerath - "VM benchmarking: update on CERN approach" http://indico.cern.ch/event/319819/session/1/contribution/7/material/slides/0.pdf
  2. Ceilometer architecture http://docs.openstack.org/developer/ceilometer/architecture.html
  3. Basic introduction to ceilometer using RDO - https://www.rdoproject.org/CeilometerQuickStart
  4. Ceilometer configuration guide for transformers http://docs.openstack.org/admin-guide-cloud/content/section_telemetry-pipeline-configuration.html
  5. Ceilometer arithmetic guide at https://github.com/openstack/ceilometer-specs/blob/master/specs/juno/arithmetic-transformer.rst

Saturday, 21 March 2015

Nova quota usage - synchronization

Nova quota usage gets frequently out of sync with the real usage consumption.
We are hitting this problem since a couple of releases and it’s increasing with the number of users/tenants in the CERN Cloud Infrastructure.

In nova there are two configuration options (“max_usage” and “until_refresh”) that define when the quota usage should be refreshed. In our case we have configured them with “-1” which means the quota usage must be refreshed every time “_is_quota_refresh_needed” method is called.
For more information about these options you can see a great blog post by Mike Dorman at http://t.co/Q5X1hTgJG1

This worked well in the releases before Havana. The quota gets out of sync and it’s refreshed next time a tenant user performs an operation (ex: create/delete/…).
However, in Havana with the introduction of “user quotas” (https://wiki.openstack.org/wiki/ReleaseNotes/Havana#Quota) this problem started to be more frequent even when forcing the quota to refresh every time.

At CERN Cloud Infrastructure a tenant usually has several users. When a user creates/deletes/… an instance and the quota gets out of sync it will affect all users in the tenant. The quota refresh only updates the resources of the user that is performing the operation and not all tenant resources. This means that in a tenant the quota usage will only be fixed if the user owner of the resource out of sync performs an operation.

The source of quota desync is very difficult to reproduce. In fact all our tries have failed to reproduce it consistently.
In order to fix the quota usage the operator needs to manually calculate the quota that is in use and update the database. This process is very cumbersome, time consuming and is can lead to the introduction of even more inconsistencies in the database.

In order to improve our operations we developed a small tool to check which quotas are out of sync and fix them if necessary.
The tool is available in CERN Operations github at: https://github.com/cernops/nova-quota-sync

How to use it?

usage: nova-quota-sync [-h] [--all] [--no_sync] [--auto_sync]
                       [--project_id PROJECT_ID] [--config CONFIG]

optional arguments:
  -h, --help            show this help message and exit
  --all                 show the state of all quota resources
  --no_sync             don't perform any synchronization of the mismatch
                        resources
  --auto_sync           automatically sync all resources (no interactive)
  --project_id PROJECT_ID
                        searches only project ID

  --config CONFIG       configuration file

The tool calculates the resources in use and compares them with the quota usages.
For example, to see all resources in quota usages that are out of sync:

# nova-quota-sync --no_sync

+-------------+----------+--------------+----------------+----------------------+----------+
| Project ID  | User ID  |  Instances   |     Cores      |         Ram          |  Status  |
+-------------+----------+--------------+----------------+----------------------+----------+
| 58ed2d48... | user_a   |  657 -> 650  |  2628 -> 2600  |  5382144 -> 5324800  | Mismatch |
| 6f999252... | user_b   |    9 -> 8    |    13 -> 11    |    25088 -> 20992    | Mismatch |
| 79d8d0a2... | user_c   |  232 -> 231  |  5568 -> 5544  |  7424000 -> 7392000  | Mismatch |
| 827441b0... | user_d   |   42 -> 41   |    56 -> 55    |   114688 -> 112640   | Mismatch |
| 8a5858da... | user_e   |    2 -> 4    |     2 -> 4     |     1024 -> 2048     | Mismatch |
+-------------+----------+--------------+----------------+----------------------+----------+

The quota usage synchronization can be performed interactively per tenant/project (don’t specify the argument --no_sync) or automatically for all “mismatch” resources with the argument “--auto-sync”.

This tool needs access to nova database. The database endpoint should be defined in the configuration file (it can be nova.conf). Since it reads and updates the database be extremely careful when using it.

Note that quota reservations are not considered in the calculations or updated.

Tuesday, 17 February 2015

Delegation of rights

At CERN, we have 1st and 2nd level support teams to run the computer centre infrastructure. These groups provide 24x7 coverage for problems and initial problem diagnosis to determine which 3rd line support team needs to be called in the event of a critical problem. Typical operations required are
  • Stop/Start/Reboot server
  • Inspect console
When we ran application services on physical servers, these activities could be performed using a number of different technologies
  • KVM switches
  • IPMI for remote maagement
  • Power buttons and the console trolley
With a virtual infrastructure, the applications are now running on virtual machines within a project. These operations are not available by default for the 1st and 2nd level teams since only the members of the project can perform these commands. On the other hand, the project administrator rights contain other operations (such as delete or rebuild servers) which are not needed by these teams.

To address this, we have defined an OpenStack policy for the projects concerned. This is an opt-in process so that the project administrator needs to decide whether these delegated rights should be made available (either at project creation or later).

Define operator role

The first step is to define a new role, operator, for the projects concerned. This can be done through the GUI (http://docs.openstack.org/user-guide-admin/content/section_dashboard_admin_manage_roles.html) or via the CLI (http://docs.openstack.org/user-guide-admin/content/admin_cli_manage_projects_users.html). In CERN's case, we include it into the workflow in the project creation.

On a default configuration,

$ keystone role-list
+----------------------------------+---------------+
|                id                |      name     |
+----------------------------------+---------------+
| ef8afe7ea1864b97994451fbe949f8c9 | ResellerAdmin |
| 8fc0ca6ef49a448d930593e65fc528e8 | SwiftOperator |
| 9fe2ff9ee4384b1894a90878d3e92bab |    _member_   |
| 172d0175306249d087f9a31d31ce053a |     admin     |
+----------------------------------+---------------+

A new role operator needs to be defined, using the steps from the documentation

 $ keystone role-create --name operator
+----------+----------------------------------+
| Property |              Value               |
+----------+----------------------------------+
|    id    | e97375051a0e4bdeaf703f5a90892996 |
|   name   |             operator             |
+----------+----------------------------------+

and the new role will then appear in the keystone role-list.

Now add a new user operator1

$ keystone user-create --name operator1 --pass operatorpass
+----------+----------------------------------+
| Property |              Value               |
+----------+----------------------------------+
|  email   |                                  |
| enabled  |               True               |
|    id    | f93a50c12c164f329ee15d4d5b0e7999 |
|   name   |            operator1             |
| username |            operator1             |
+----------+----------------------------------+

and add the  operator1 account to the role

$ keystone user-role-add --user operator1 --role operator  --tenant demo
$ keystone user-role-list --tenant demo --user operator1

A similar role is defined for accounting which is used to allow the CERN accounting system read-only access to data about instances so that an accounting report can be produced without needing OpenStack admin rights.

For mapping which users are given this role, we use the Keystone V3 functions available through the OpenStack unified CLI.

$ openstack role add --group operatorGroup --role operator  --tenant demo

Using a group operatorGroup, we are able to define the members in Active Directory and then have those users updated automatically with consistent role sets. The users can also be added explicitly

$ openstack role add --user operator1 --role operator  --tenant demo


Update nova policy

The key file is called policy.json in /etc/nova which defines the roles and what they can do. There are two parts to the rules, firstly a set of groupings which give a human readable description for a complex rule set such as a member is someone who is not an accounting role and not an operator:

    "context_is_admin":  "role:admin",
    "context_is_member": "not role:accounting and not role:operator",
    "admin_or_owner":  "is_admin:True or (project_id:%(project_id)s and rule:context_is_member)",
    "default": "rule:admin_or_owner",
    "default_or_operator": "is_admin:True or (project_id:%(project_id)s and not role:accounting)",

The particular rules are relatively self-descriptive.

The actions can then be defined using these terms

        "compute:get":"rule:default_or_operator",
    "compute:get_all": "rule:default_or_operator",
    "compute:get_all_tenants": "rule:default_or_operator",
    "compute:stop":"rule:default_or_operator",
    "compute:start":"rule:default_or_operator",
    "compute:reboot":"rule:default_or_operator", "compute:get_vnc_console":"rule:default_or_operator",
    "compute:get_spice_console":"rule:default_or_operator",
    "compute:get_console_output":"rule:default_or_operator",
    "compute_extension:console_output": "rule:default_or_operator",
    "compute_extension:consoles": "rule:default_or_operator",


With this, a user group can be defined to allow stop/start/reboot/console while not being able to perform the more destructive operations such as delete.