LHC Tunnel

LHC Tunnel

Wednesday, 14 August 2013

Managing identities in the cloud

CERN has 11,000 physicists who use the lab's facilities including the central IT department resoures. As with any research environment, there are many students, PhDs and other project members who join one of the experiments at CERN. They need to have computing accounts to access CERN's cloud but we also need to make sure these resources are handled correctly when they are no longer affiliated with the organisation.

Managing Users

For the CERN OpenStack cloud, we wanted complete integration with the site identity management system. With around 200 arrivals/departures per month, managing identities within OpenStack would have been a major effort.

CERN's users are stored in our Active Directory system which provides a single central password and user attribute store such as full name, organisational unit and location. We also define our user groups using Active Directory so that lists of members of an experiment can be centrally managed and applications share this master source of data for allocating roles to user groups.

Keystone provides the OpenStack authentication service including an LDAP back end. Working with the community during the Folsom release, we developed a number of patches so that Keystone was able to use the LDAP interface to Active Directory (see http://docs.openstack.org/trunk/openstack-compute/admin/content/configuring-keystone-for-ldap-backend.html for details). This allows users from both the command line and Horizon GUI to use OpenStack with their standard credentials.

Where we can, we leave the LDAP schema read-only since there are many other dependencies and major schema changes can cause significant disruption.

Multiple Identities

Historically, users have multiple accounts at CERN.
  • Primary account which is used for their prime activity
  • Secondary accounts are used for cases where you wish a different identity. Typical examples are where an administrator would need an account that would provide standard user rights for documentation or an ultra account which is rarely used
  • Service accounts which are shared. Here the user is responsible for the account but is able to transfer the account to another user. Typical examples would be an account used for running a daemon or an application internal resource.
As examples, timbell (my primary account for my day to day work), timothybell (my secondary to simulate a typical low privilege user profile for documentation) and owncloud (a service account related to a specific application).

The structure of the cloud identities is such that we are aiming to use primary accounts and using roles within projects to reduce the need for secondary accounts. The project with multiple members to manage the project covers the service account scenario with respect to resources.

Thus, the cloud can potentially simplify both identity/roles and authentication by focusing on the one user, one account model. We expect exceptions but since one of the aims of the move to the cloud was to simplify our environment, we hope these can be limited to very special circumstances.

Managing Roles

We use the standard conventions for OpenStack roles.
  • Admin is a global role providing 'super-user' access to OpenStack. This is allocated to a group within Active Directory and the only members are the staff who support the cloud within IT.
  • For each project, there is a members list defined. When a project is set up, a group is provided as part of the request which defines the people who are able to perform actions within the project such as VM creation/deletion/reboot.
There is a regular script which ensures that the Active Directory groups are synchronised with those in Keystone.

User Lifecycle

With over 200 arrivals and departures every month, it is important to track the owner of resources to retire them when someone is no longer working on a CERN related activity. 

We use Microsoft Federated Identity Manager (FIM) as an engine to automatically create users when someone is registered in the CERN Human Resources database and to expire them as they leave.

Users who wish to use the cloud can subscribe via the CERN accounts and resources portal. This creates an account and a personal project for them in a few minutes so they can already start investigating cloud technologies.

The general approach is that personal resources (such as the Personal project in OpenStack) will be removed. VMs will be stopped and deleted. Departing users are removed from their roles. Ownership of shared resources, such as projects, can be transferred before leaving or are automatically passed to the supervisor.

With this lifecycle, the OpenStack resources follow that for other computing resources and there are no orphaned resources.

To allow FIM and OpenStack to integrate, we developed a service called Cornerstone which provides a SOAP interface for FIM such as create personal project, create shared project, etc. and then performs the automated operations behind the scenes.

One interesting issue was the propagation delays. When a new project is created in FIM, Active Directory is updated but there is a small delay before all the slaves of Active Directory are updated. Thus, for project creation, we use a single Active Directory server to receive the information to avoid inconsistency (at the expense of availability if AD is down). 


As we've rolled out Grizzly, there is now ongoing work on the CERN Grizzly OpenStack to enhance user access. Specifically,
  • Kerberos and X.509 certificates for user authentication are widely used in the High Energy Physics work. Kerberos is often used for interactive user authentication. X.509 certificates are also used for users but increasingly as a way to identify services such as automated job submission factories. Now that Keystone supports REMOTE_USER authentication, we can use the Apache kerberos and certificate authentication methods to front end the Keystone service. This will avoid having to source a profile and enter passwords.
  • Integration of CERN's web based Single Sign On is an attractive option for Horizon. While common passwords are used, the user of Horizon still needs to enter their password to get access to the dashboard. CERN uses Microsoft ADFS to provide a Single Sign On capability which is used for most web applications.
  • We have a team of system administrators who perform the standard operations tasks when there are alarms in our monitoring system. These sysadmins need to be able to start/stop/reboot instances across the cloud but not perform create/delete/... operations. We will investigate how to model this within the existing JSON policy files
A number of ongoing activities in Havana will make further integration easier:
  • The Keystone V3 API is coming along which will include additional functionality in the area of mapping groups to roles. We will investigate how to map OpenStack roles into Active Directory groups and thus avoid synchronisation scripts.
  • Domains will add an extra level of project handling allowing us to group projects together. This will also create the possibilities of a structured set of roles within our user communities.
We'll be participating in the Havana design discussions around these areas so that we can further streamline our user and identity management in future.

Wednesday, 7 August 2013

Flavors - An English perspective

OpenStack has the capability to define flavors of virtual machines, how many cores, swap, disk and memory.

As a native UK English speaker, I find the term flavor to already be a problem. My spell checker fixes it to Flavour and requires regular manual changes. I appeal to the OpenStack technical committee to not accept any term which is not the same in US and UK English or to allow an alias :-)

At CERN, we make available 4 standard flavors modeled on the Amazon ones. These common names are already familiar to public cloud users and allows some better compatibility with scripts using EC2.

System Disk

For most cases, this set allows us to cover the configurations physicists ask for. There are some inefficiencies which can occur if an app requires 8GB but not much CPU power, or needs an 80GB disk but not much memory. These can be addressed by some overcommitting.

Currently, we overcommit on CPU and also use SMT. The current configuration of hypervisors is 24 core (i.e. 48 core with SMT enabled), 96GB memory and 3 2TB disks. This matches the configuration of the above flavors for memory and CPU and produced a configuration which is around twice the per-core performance on Amazon when we compare use benchmarks such as the HEPSpec2006 which is a subset of the SPEC benchmarks using C++.

Past experience with virtualisation has made us cautious to overcommit on memory. A hypervisor that starts swapping can cause a significant impact on the all VMs. As we gain more operational experience, we may start memory overcommit but we need to establish a baseline performance first.

However, the disk configuration is becoming increasing a problem. Since our hypervisors run a variety of workload, we need to mirror the disks to ensure a reasonable reliability and also to avoid the operational work of having to re-install and for the users to re-create VMs after every disk failure. We use Linux software RAID on the KVM hypervisors running on Scientific Linux 6 (a derivative of RHEL). We have experimented with different combinations for the 3rd disk between making it a spare or a 3rd mirror. Currently, we are running in a 3-way mirror as we found some Linux stability issues on RAID-1 with spare.

The hardware itself was purchased for running bare-metal classic High Throughput Computing batch services. Typical configurations are based around Supermicro Quad systems assembled by European resellers. With these configurations, you would run a single instance of Linux on bare metal and have a batch scheduler (CERN uses LSF) to run the varied workload with fair share between the users.

However, when we use a similar configuration for hypervisors, some interesting effects emerge.
  • Space becomes more limited. Having some space in /var for logs and crash dumps is standard for a Linux host but when we add glance image caches and the backing store for the VMs along with mirroring the disks, it starts to get tight on the hypervisors with 2TB of space.
  • We could potentially be running 48 m1.tiny configurations which would require 960GB of disk space to support their VMs. Operations like suspend to disk become operationally difficult.
  • With only 3 spindles, we are limited for IOPS. The impact of this is reduced since much of the High Energy Physics code is CPU bound or directly accessing storage over the network using protocols such as HTTP or root (a specific protocol developed for accessing HEP data sets)
  • I/O patterns emerge according to the standard Linux schedules. Typical cases of Linux scheduling are yum updates and updatedb runs for the locate command which use the cron.daily schedules at 4am. Suddenly, we have 48 VMs all running updatedb and yum update at exactly the same moment with 3 disks
We get requests for special flavors with more than 4 cores, very large memory or multi-terabyte large system disks. Analysing these cases, there are a number of motivations.
  • Some applications are still scale-up rather than scale-out. Most of these cases we're suggesting that people delay moving to the CERN private cloud for a few months as we are running at basic service levels (equivalent to Amazon) and the scale up applications tend to be server consolidation rather than cloud applications. With the improvements coming in Havana and as we bring cinder external block storage online, more of these use cases can be considered.
  • In other cases, there is scalability issue with the application itself. As we add more nodes, contention on distributed systems often encounter bottlenecks and show non-linear performance changes. Creating a new VM for each single core batch job would create a significant increase compared to today's load of around 7,000 physical servers.
  • Large external storage is a common request. These applications such as Cassandra or MongoDB are the cases we classify as 'Hippos'. Using the Pets/Cattle analogy from Microsoft/Cloudscaling, we have a cattle server which is redundant but with a large volume of disk storage. These are best served with a cinder like solution rather than by creating a system disk of multi-terabyte size. We've been evaluating NetApp, Gluster and Ceph storage solutions in this area and plan to bring a production service online later in the year.
  • While we currently do not use live migration, we will be using it more as we increase the service levels to cover some of the server consolidation use cases in specialised zones within the CERN cloud. Experience with our previous service consolidation environment has shown that large memory servers have proved a major difficulty such as  transferring 64GB or more of virtual machine in a consistent but transactional mode. Some VMs change their memory faster than we can transfer it between the old and new hypervisors. Thus, we limit our out-of-the-box flavors to 8GB and review other cases to understand the application needs further.
While these are reasonable operational restrictions, one of the challenges of the private cloud is how to handle exceptions. Within a private cloud model, assuming no cross charging but a quota model based on pledges, there is a need to reflect the cost of unusual configurations. A 64GB, 8 core VM with 1TB of system disk would be very difficult to pack with a combination of 2GB/1core/20GB VMs. This leads to inefficiencies in resource utilisation for CPU, memory or disk.

As we look out to the future, there are a number of positive developments for addressing the flavor sprawl.
  • External block storage functionality with Cinder will allow us to cover the large disk storage 'Hippo' use case. An m1.medium can ask for an external volume and use that for database storage.
  • We are investigating Linux KSM options to cover scenarios where multiple identical Linux images are running on a single hypervisor. Under these scenarios, KSM would share the code pages providing significant optimisations for the small VM packing scenarios.
  • Future procurement rounds will be looking at configurations specifically for hypervisors rather than the current approach of re-cycling existing servers which had been purchased for other application profiles. A wide range of options from SSDs for higher IOPS, more disks or even further exploitation of external storage are being investigated.
Overall, the cloud model provides huge flexibility for our users to ask for the configurations they need. In the past, a custom configuration would take many months to deliver (using public procurement models of market surveys, specifications, tendering, adjudication, ordering, installation and burn-in). Physicists can now ask for a new VM and get it within the time to get a coffee. 

While many flavors can provide flexibility, we should not lose sight of the need to maximise efficiency and make sure that CPU, memory and disk are all used at a higher level in the cloud than previously with bare metal dedicated resources.

Upcoming work in this area is to
  • Investigate a current limitation in the experimental cells functionality within OpenStack that we use to achieve scalability. Flavor requests are not passed to child cells and thus adding new flavors is a manual process. We will be working with the community to address this restriction.
  • Exploit underlying virtualisation and block storage solutions to provide standard flavors with the flexibility for additional services which could cover their requirement. Cinder with many back end drivers is one of our top priority areas to deploy.

Thursday, 1 August 2013

The First Week - Projects

The First Week - Hot Topics - Projects!

At CERN, we've recently gone live with our OpenStack based cloud for 11,000 physicists around the world.

The major efforts prior to going live were to perform the integration into the CERN identity management system, define projects and roles and configure the components in a high availability set up using the Puppetlabs tools. Much of this work was generic and patches have been submitted back to the community.

Surprisingly, the major topics on our cloud go-live day were not how to implement applications using cloud technologies, how accounting is performed or support levels for non-standard images.

Instead, the support lines were hot with two topics... flavors and projects! I'll cover flavors in a subsequent posting.

For projects, each user signing up to CERN private cloud is given an personal quota of 10 cores so they can work in a sandbox to understand how to use clouds. The vast majority of resources will be allocated to shared projects where multiple users collaborate together to simulate physics and analyse the results from the Large Hadron Collider to match with the theory.

Users want a descriptive name for their project. Our standard request form asks
  • What is the name of your project ?
  • How many cores/MB memory/GB disk/etc would you like ?
  • Who is the owner for the project ? 
  • Who are the administrators ?
These do not seem to be Bridge of Death questions but actually require much reflection.

From the user's perspective, the name should be simple such as 'Simulation'. From the cloud provider's needs, we have multiple user groups to support so the name need to be unique and clear.
With OpenStack, there are upcoming concepts such as domains which will allow us to group a set of projects together and we wish to prepare the ground for those. So, there is a need for fine grain project definition awaiting future methods to group these projects together.

In the end, we settled on
  • Projects start with the LHC experiment such as ATLAS or CMS. The case reflects whether they are acronyms or not (ATLAS stands for A Toroidal LHC Apparatus, CMS for Compact Muon Solenoid which is ironic for something weighing over 12,000 tonnes)
  • Many requests asked for _ or - in the names. We prefer spaces. There will be bugs (we've already fixed one in the dashboard) but this is desirable in the long term.
  • For projects run by the IT department, we use our service catalog based on the ITIL methodology so that each functional element can request a project.
Keeping projects small and focused has a number of benefits
  • The accounting can be more precise as the association between a project and a service is clearer.
  • The list of members for a project can be kept small. With the Grizzly based OpenStack cloud, the policies allow members to restart other VMs in the project. This allows sharing of administration responsibilities but can cause issues if multiple teams are members of a single project.
The disadvantages are
  • Additional administration to set up the projects
  • Dependencies on the upcoming domain features in Keystone which should be arriving in the Havana release
  • Quota management is more effort for the support teams. With large projects, the central team can allocate out a large block of quota to a project and leave the project team to look after the resources.
Ultimately, there is a need for an accounting structure of projects and domains so we can track who is using what. Getting the domain/project structure right at the start feeds into the quota management and accounting model in the future.

The good news on the projects and quotas is that Havana has some interesting improvements which will improve the support load