Models for Deploying and Scaling the GPII
- 1 Introduction
- 2 Technical Dimensions of the Cloud
- 2.1 Types of Cloud Infrastructure
- 2.2 Background Information: the GPII Services
- 2.3 Cloud Scalability
- 2.4 Scaling Scenarios
- 2.4.1 Basic Configuration
- 2.4.2 Multi-load Balancer Configuration
- 2.4.3 Replicated Database Configuration
- 2.4.4 Cached Configuration
- 2.4.5 Next-Generation Clustering and Distributed Application Management
- 2.4.6 High Availability
- 2.4.7 System Redundancy
- 2.5 The Reference Architecture
- 2.5.1 Details of the Reference Architecture
- 2.5.2 Compute and Storage Hardware
- 2.5.3 Networking Hardware
- 2.5.4 Software Infrastructure
- 2.5.5 Configuration Management
- 3 Conclusion
- 4 Appendix A: Glossary of Software Components
This document summarizes the activities and findings of a project to determine the models costs associated with deploying and sustaining the GPII technology. The goal of this document is help answer several key questions:
- What are the technical dimensions of cloud scalability?
- How does the GPII architecture interact with cloud technologies?
- What scenarios and options are available for deploying and supporting the GPII?
- How can it be scaled to support millions of users?
In particular, this report provides concrete examples of some the deployment options available, how they are technically constituted, and the costs that they may represent to an organization that is deploying the GPII.
The Reference Implementation
For the purposes of this costing project, we designed and tested a specific reference implementation that represents a real-world collection of cloud hardware and software infrastructure capable of hosting the GPII software. The motivation for building this reference architecture was to more clearly and realistically explore the costs and tasks associated with deploying the GPII infrastructure at scale. This reference architecture is based on the Inclusive Design Institute’s (IDI) open research cloud, which is currently being implemented at OCAD University to support critically-needed emerging accessible infrastructure such as the GPII.
The reference architecture serves a number of interrelated purposes; it:
- provides a working, realistic environment for assessing the potential tasks and tools required to deploy the GPII and their associated costs,
- serves as an “applied research” tool for determining realistic hardware and human resource costs for self-hosted infrastructure,
- is a model for comparison with other infrastructure alternatives such as Amazon Web Services (AWS), and
- serves as a driver for developing automated deployment “recipes” and other low-level cloud services required by the GPII.
There are several potential organizational models for sustaining the GPII in production, including support of the requisite infrastructure by small organizations and nonprofits (such as regional schools or library systems), large technology companies, and even nation-wide deployments provided by government or NGOs. In practice, these scenarios will likely coexist and interact with uni- or bi-directional synchronization amongst them. For example, instances of the GPII infrastructure may live inside the boundary of a corporate or government firewall, while public infrastructure will also be available to support the storage and replication of user preferences and data globally. Given the diversity of these models and the differing numbers of users and frequency of use that they will accommodate, the GPII will likely be deployed in many different ways. In particular, we anticipate its use in in the context of Infrastructure-as-a-Service public clouds for smaller deployments and in large-scale private clouds for corporate or government installations.
Given this predicted diversity, deployers of the GPII need not copy this reference architecture directly, nor should they see it as a prescription or set of requirements for how the GPII must be deployed. Instead, our reference architecture should be used as a point of comparison and a baseline for selecting different tools and approaches. Different deployments may choose to use different server hardware, networking equipment, software tools, and hosting models, while using aspects of this architecture as a model.
Dev Ops Toolkit
As part of the cloud implementation and cost modelling process, we created a toolkit of automation “recipes” or playbooks that will help developers and deployers more easily provision servers for the GPII in any cloud based environment.
We also produced a set of reusable load testing scripts that can be used to evaluate the performance characteristics of GPII services such as the Preferences Server in a repeatable, automated fashion.
Structure of this document
The first part of this document describes the factors that influence the cost of deploying the GPII: the components of the GPII that need to be scaled, how scaling can be accomplished, and the technical issues involved with running large systems at scale. This is followed by a description of the reference architecture that was created to explore the tasks and technologies required to deploy and maintain a cloud-based installation of the GPII. Next, the cost of implementing the GPII infrastructure in three potential usage scenarios is estimated and discussed. Finally, the performance testing tools and dev ops toolkit, which are the primary software deliverables of this costing project, are discussed.
Technical Dimensions of the Cloud
There is a vast array of options available today for deploying and scaling web and cloud-based software. From self-hosting to infrastructure-as-a-service (IaaS), and from low-cost commodity servers to high-end proprietary all-in-one servers and networking solutions, deployers and system administrators have never had more variety of choices available to them.
From an architectural perspective, the goal of the GPII infrastructure is to support diverse deployment options and avoid prescribing a single server or software stack for deployment. Nonetheless, the GPII technology has been designed from the start to fit naturally within the technical idioms of cloud computing. From this perspective, a cloud-based infrastructure involves several key architectural characteristics:
- Partitioning of individual physical servers (hosts) into smaller, virtualized servers (virtual machines, or guests)
- Grouping many servers into clusters that allow their individual CPU, memory, and storage resources to be allocated homogeneously as a single, larger pool of resources
- Modularization of applications into smaller services or microservices that communicate with each other via HTTP
- Reduction of server-side state (e.g. user session state and stateful APIs in which requests are not self-contained and idempotent) and the requirement for clients to maintain an exclusive or long-running conversation with a particular server, in order to support the large scale distribution of services amongst multiple servers and regions for high availability and horizontal scaling
- Isolation of services into distinct virtual machines or containers so that each may be maintained, upgraded, and developed without impacting the others
- Automated provisioning and maintenance of servers, virtual machines, and containers using configuration management tools such as Ansible, continuous integration tools such as Jenkins, and distributed versioning systems such as Git
- The tendency to treat services as ephemeral and replaceable when they can be easily recreated and spun up with minimal administrative intervention due to configuration management automation. Resources are often stopped and replaced on the fly when problems occur or maintenance is required (because there are duplicates available via horizontal load balancing)
- Use of replication, eventual consistency, and sharding of databases in order to scale them, rather than maintaining a single, monolithic “master” database that must remain always available
Types of Cloud Infrastructure
“The cloud” has become something of an overloaded buzzword, used variously to describe the web at large, centralized storage services such as Dropbox, proprietary platform synchronization services such as iCloud, and document editing tools such as Google Docs and Office 365. We use the term here specifically to refer to server and systems infrastructure that support the large-scale, distributed, high-availability use of applications that synchronize data and resources between local devices (i.e. mobile and desktop operating systems) and web-based data stores. In other words, the cloud here refers to a particular approach to deploying servers, networking infrastructure, and software to support “always available” access to user preferences and adaptation services for accessibility.
There are several types of cloud infrastructure that are relevant to this particular discussion:
- Self-hosted “private” clouds that combine server hardware with management and provisioning software such as OpenStack
- Low-level, vendor-hosted Infrastructure-as-a-Service (IaaS) compute and storage clouds such as Amazon EC2, Digital Ocean, and IBM SoftLayer
- Mid-level hosted application-tier services such as Amazon’s database services, load balancers, and more
- High-level Platform-as-a-Service systems targeted at application developers, which abstract away hardware and operating system details, such as Heroku or IBM’s BlueMix cloud
Background Information: the GPII Services
This section provides a brief refresher on several of the key GPII components that run in the cloud. It is intended to provide background for those who are less familiar with the GPII architecture prior to discussing different models for cloud scaling and how they impact these components.
For a more in-depth discussion of the GPII components, see the GPII Architecture Overview page.
The Preferences Server is a web-based service that provides a REST API for saving and retrieving user preferences in a JSON format. The Preferences Server is a composite of several other microservices that assist in the process of securing and transforming preference sets. It is backed by a CouchDB database for storing preferences sets. The Preferences Server and the Solutions Registry (described below) are the primary storage-oriented services within the GPII architecture.
Flow Manager and Matchmakers
The Flow Manager is the event-driven, asynchronous orchestrator of the personalization workflow. It is implemented as a single core of cross-platform code that can be run in two different (but collaborating) configurations: a very small server that runs on the local device, which is responsible for bringing together the various platform-specific phases of the autopersonalization process and invoking them at the appropriate time a cloud-based server that is responsible for orchestrating the preferences fetching, matchmaking, and privacy filtration parts of the workflow, as well as servicing requests from third-party applications when no local Flow Manager is available
In its cloud-based incarnation, the Flow Manager is the primary computational load-bearing component in the GPII architecture. It: handles requests from the local Flow Managers installed on user devices delegates to the other GPII components such as the Authorization Manager, the Preferences Server, and the Matchmaker returns a comprehensive payload of all the settings that should be applied to the user’s current context
Invoked by the Flow Manager, the Matchmaker is responsible for deciding which solutions and settings will best meet the user's needs and returning a comprehensive list of solutions and settings for the user.
The Solutions Registry is the comprehensive source for information about applications, access features, and assistive technologies (referred to collectively as Solutions). It can be queried to determine which solutions are available to the user on their particular platform. The Solutions Registry contains declarative metadata about a solution: how to launch it, where its settings are stored, and what features it provides. This information can then be used to make a successful match with the user's needs and preferences. The Solutions Registry is where vendors and application developers can add entries that describe their software. Like the Preferences Server, the Solutions Registry is storage-oriented and uses CouchDB as its backend database.
The scaling approach used most often for cloud-based infrastructure, which we employ heavily in this reference architecture, is called horizontal scaling. This type of scaling occurs both at the hardware and the software layers, and was a key architectural consideration when designing the GPII autopersonalization system.
Types of Scalability
Discussions of general system scalability are typically divided into two primary techniques: vertical and horizontal scaling.
Vertical scaling refers to an approach where the capacity of a single server node in an environment is upgraded; for example, adding extra memory or faster processors to a server. Vertical scaling can be regarded as an easier approach, but its effectiveness is limited by hardware constraints (e.g. maximum CPU speeds available on the market) and cost; it is appropriate for only some types of applications and does not, in itself, address the need for redundancy.
Horizontal scaling, in contrast, involves introducing additional nodes into the environment as a means of increasing the capacity of a distributed application. This approach has the potential to increase complexity due to the greater number of nodes, as well as increased latency due to the additional number of hops that requests need to traverse. These pitfalls can be addressed by using automation to manage nodes and relying on faster network links for cluster communication. There also needs to be a load balancing service above the application tier which is aware of the number of application servers that can respond to incoming requests. The GPII utilizes horizontal scaling to ensure compute capacity can be added as the incoming load warrants while also ensuring that a small number of node failures do not impact the application service as a whole.
At the hardware level, horizontal scaling involves the deployment of a large number of relatively low-cost compute and storage servers working in tandem as a cluster. Compute nodes typically feature fast CPUs with many cores, while storage nodes emphasize large numbers of disks in a RAID 10 (or similar) configuration. If steady, long-term load increases, new servers and disks can be easily added to the cluster to accommodate this growth. We recommend the use of “commodity” servers that can be easily replaced with new models from a variety of vendors at the end of their lifetime and, more importantly, that can be shifted between compute and storage roles as needed based on application load. In our reference configuration, all nodes feature fast CPUs and large amounts of storage so that they can be used in either a compute or storage configuration.
IaaS providers, on the other hand, take care of abstracting the physical hardware layer away from deployers, who simply request resources by type (compute or storage) and rely on the IaaS stack (e.g. Amazon’s own proprietary systems or the OpenStack open source alternative) to correctly provision the appropriate hardware.
Each commodity server is, in turn, sliced up into a number of smaller virtual machines, which are allocated a limited portion of the server’s overall CPU capabilities, memory, and storage space. This approach serves several purposes, one of which is to make more efficient use of the many cores in modern CPUs. For example, our reference architecture servers each include two six-core CPUs, for a total of 12 available cores. Individual virtual machines can be assigned to each core, making more efficient use of the available resources than would typically be possible with a single, monolithic (and thus highly threaded and more complex) application. Virtualization also helps to provide isolation between layers of the architecture, simplifying maintenance and ensuring that outages on one virtual machine are less likely to affect others.
Deploying many servers in tandem like this, however, requires additional software management infrastructure to help provision, monitor, and allocate these resources. The appeal of hosted IaaS infrastructure (such as Amazon AWS or Digital Ocean) is that a deployer need not be concerned with the mechanics of provisioning and understanding the physical layer. Instead, they can use a rich set of software APIs for managing virtual machines while largely ignoring the layer beneath them. The downside to many IaaS platforms is that they are proprietary and can potentially lead to vendor lock-in.
To address this issue in our own setup, we adopted the open source OpenStack project, which provides a “cloud operating system” responsible for managing, provisioning, and sharing the underlying hardware resources. OpenStack provides APIs and tools that are compatible with those offered by AWS, including APIs for managing virtual storage, compute, and networking resources.
Since OpenStack was still in relatively early development in mid-2013—with all of the rough edges and incompatibilities that are often associated with new open source projects—our reference implementation also includes a set of Nebula One cloud controllers. These consist of a proprietary combination of hardware and a customized distribution of OpenStack, offering a “plug and play” solution to provisioning physical servers. It also provides a user-friendly dashboard that deployers can use to manage their virtual machines.
There are alternatives to dedicated, turnkey controllers such as the Nebula; other vendors provide their own controller hardware, and there are easy-to-install pre-customized versions of OpenStack that can be deployed on regular servers (such as those offered by Piston, Rackspace, RedHat, and Breqwatr). Alternatively, a customized GPII-specific configuration of OpenStack could be developed by a production ops team. Increasingly, too, there are other open source alternatives to OpenStack for managing private clouds. Particularly promising are tools such as Kubernetes and Mesos, which represent the next wave of container-based infrastructure.
Since this document was originally drafted, server virtualization has been augmented with containerization. Containers, in contrast to virtual machines, are lighter weight modules that can more efficiently use their underlying host’s resources. They’re also simpler to administer and maintain, since multiple containers can share the same kernel and system libraries while having their own isolated file systems. Containerized deployment and scalability is discussed in greater detail below. Since 2014, GPII services have been deployed in a containerized environment using Docker. In this new containerized environment, a small number of dedicated GPII virtual machines are sliced up into smaller containers, each of which hold a different part of the GPII infrastructure (such as a container for the Preferences Server, and another one for the Cloud-based Flow Manager, etc.).
This section describes several of the key issues with scaling software components in the cloud, and focuses on several specific characteristics of the GPII architecture that contribute to its scalability factors.
Single-threaded asynchronous request-response architecture: as is common to many applications implemented in Node.js, the basic request-response behaviour of a single instance of a GPII component is implemented as a single-threaded asynchronous event loop. This approach simplifies application code relative to multithreaded applications while still allowing for horizontal scaling.
Deliberate avoidance of server-side storage of client state: GPII components avoid storing client state within themselves. While this does not improve scalability in and of itself, it contributes to horizontal scalability through substantially reducing the complexity of load balancing (see below).
Based on this foundation, we have elaborated on three primary techniques to help scale the GPII horizontally:
- Load balancing
- Database replication
Most GPII components, such as the Preferences Server, respond to HTTP requests from other components in the system in order to carry out their workload. A single instance of a component will eventually reach limits in its ability to respond to requests; these limits are imposed by numerous factors such as the design of the component itself and the computing resources available to it. Once a limit is reached, the component's behaviour will be negatively affected and response times will suffer.
To address the issue of a single instance of a component becoming overloaded, load balancing distributes workloads across multiple instances of the same component so that we can achieve maximum throughput, minimize response time, and avoid overload of any single component. Load balancers operate different algorithms to distribute requests to different instances, such as round robin (which distributes one request per instance in sequential order) and request hashing. Regardless of the approach taken, load balancing is easier when a service doesn’t store server-side state about its clients; a substantial amount of the complexity in load-balancing a production application is typically consumed by various strategies to manage coordination, syncing and transfer of server-side client state.
A database server allows for the storage, retrieval, and general management of data utilized by application servers. The GPII project makes heavy use of CouchDB, one of the more prominent NoSQL databases. CouchDB allows applications to interact with it using the HTTP protocol, which opens the possibility of load balancing incoming database requests using software like HAProxy or Nginx. In other words, request loads can be easily distributed amongst CouchDB instances using standard HTTP load balancing tools instead of requiring database-specific schemes.
To achieve the best performance from any database, efforts should be made to avoid unnecessarily contacting a database by using caches first. In this scenario, frequently-used data is accessed directly from the cache, which typically stores data in memory to provide much faster access to it. If the required data does not exist in the cache (or is too stale), it will then be fetched from the database. The side effect of retrieving data from the underlying database will be to place it in the cache. While caching has not been implemented for the Preferences Server as part of this costing project, research was done to determine the best options for caching. The following options are worthy of further investigation and implementation:
Using Nginx’ proxy_cache options, it is possible to configure the reverse proxy layer so that the backend services do not need to be queried unnecessarily for static assets. For requests that don’t involve authentication, Varnish can be used to serve cached results from dynamic actions performed by backend services (however, few services in the GPII will be available without OAuth-based authentication). Memcached provides a distributed key value store implemented as a pool of servers that are queried by clients. Client libraries exist for a variety of languages, including Node.js. Cached data resides in memory; once a Memcached server reaches the limit of its allocated memory, older values are discarded to make room for new data.
Caching solutions typically also require effort to implement application-appropriate mechanisms that are responsible for invalidating stale cached data.
Figure 1: A diagram of all the software components required to deploy, maintain, and scale cloud-based infrastructure such as the GPII. On the left of the diagram are the various services that are required as part of each virtual machine or container; these include services to support provisioning, logging, monitoring, and backing up servers. At the top, the primary load balancers work to distribute requests to a pool of applications servers. A caching layer in the middle reduces strain on the database tier by storing often-used data in a fast cache. The database tier consists of a database load balancer and multiple databases configured in a master-to-master replication scheme.
We have researched and defined a set of technical blueprints intended to help system administrators and implementers (including ourselves) scale the GPII infrastructure to increasingly higher loads. These comprise four distinct scaling scenarios that build on each other, each introducing a new strategy for handling an increased level of usage. With each new level, the technical complexity of the approach increases along with the demands on server hardware:
- Basic configuration, consisting of a load balancer, multiple Preferences Servers, and one database
- Multi-load balancer configuration, which introduces multiple load balancers that are configured for Transport Layer Security (TLS) and HTTPS
- Replicated database configuration, which introduces multiple databases in a “master-to-master” replication configuration, managed by their own distinct load balancing tier
- Cached configuration, which introduces a caching layer above the databases, substantially reducing the cost of reading preference sets by caching them externally to the database
In addition to defining these scenarios, we also created automated recipes to deploy instances of the Preferences Server using these configurations. From there, reproducible load test scenarios were defined, implemented, and run with each configuration. These tests and automation recipes are described in greater detail below.
Figure 2: Basic testing configuration, which consists of a load balancer managing a pool of application server instances and a single database.
In this basic configuration, computational load (i.e. the work of responding to requests from incoming Preferences Server clients) is balanced by multiple instances of the Preferences Server. Since only one load balancer and database is deployed, outages related to the load balancer or database will negatively affect application availability. However, since multiple Preferences Servers are part of the topology, maintenance tasks such as rolling application updates can be easily performed without affecting availability. This configuration is unlikely in practice, but provides a benchmark to determine the “raw” behaviour of a GPII service such as the Preferences Server.
Multi-load Balancer Configuration
Figure 3: Multi-load balancer configuration, which adds a pool of TLS-enabled load balancers managing a pool of application server instances, which access a single database.
By adding multiple load balancers, we eliminate one set of potential points of failures, leaving just the database tier vulnerable to instability or outages. Additionally, Transport Layer Security (TLS), also known as HTTPS, has been introduced in this configuration to provide encryption of network traffic, protecting sensitive user data.
TLS does, however, require additional computing resources wherever utilized (generally on the load balancers but possibly further down the stack as well) and also increases the likelihood of increasing request latency. When possible, TLS session reuse and caching can be utilized to reduce possible impacts. This configuration is, in essence, the “minimum viable” deployment of a database-backed GPII service such as the Preferences Server or Solutions Registry, since the use of TLS is crucial to protecting user privacy and security.
Replicated Database Configuration
Figure 4: Replicated database configuration, which introduces a database load balancer and multiple instances of the database working in a master-to-master replication scenario.
Multi-master database replication, such as the form provided by CouchDB, provides a way for application servers to contact more than one database server to perform updates. Changes made to one database instance will be replicated to others, without having to nominate a single database instance as the primary one. A common addition to this topology, besides additional database servers, is HAProxy, which accepts incoming requests from application servers and forwards them to the pool of database servers based on a variety of preconfigured conditions, such as health or load checks, load balancing algorithms, or even delegating a specific class of requests to particular database servers. In this configuration, we introduce multiple database instances to distribute the load of request to the GPII data tier (e.g. the Preferences Server’s backend database).
Figure 5: Cached configuration, which introduces a caching layer between the application and database to reduce load on the database server by storing frequently-used data in memory.
The changes in this last configuration ensure that caching servers, such as Memcached, are contacted by the application servers for all data requests. If a Memcached server has the particular data in its memory cache, then that data is returned to the application server; otherwise a request is made to the database servers, and its response is cached for subsequent requests.
A corollary of making use of Memcached or a similar application-level caching server is that the application code retrieving data must typically be modified to the following pattern: Request cached data by a named key from Memcached. Use the cached data from Memcached if available. Fall back to a database query if the data is not cached by that key value in Memcached, and store the query results in Memcached with specific expiration/invalidation instructions, to speed future retrieval.
In other words, an abstraction in the application tier’s code is required to integrate caching. In the GPII architecture, this abstraction layer has been implemented in the form of pluggable Data Sources, which hide away the mechanics of accessing data into a separate data access layer. A new Memcached-based Data Source strategy could be implemented to support this caching scenario.
In summary, the benefits of caching are lower request latencies and reduced load on database servers. This configuration was researched but not implemented as part of this costing project due to the implementation complexity. As the GPII code matures and its data retrieval patterns are fully implemented, it will be easier to introduce application-managed caching servers in the form of new Data Source abstractions.
Next-Generation Clustering and Distributed Application Management
Much of this document so far has discussed the use of virtual machines and OpenStack. In 2013, when this report was first drafted, both were important emerging techniques for cloud software deployment. Since then, however, there has been a significant shift in how server resources are partitioned and managed through the use of containers. Containers enable developers to easily write, test, and package up their application for deployment using an environment that is identical to that of the production environment. This significantly reduces the likelihood of tough-to-debug errors related to differences in libraries or versions between development and production systems.
At this point it seems relatively clear that containerization of applications, combined with software and tooling to manage and orchestrate them across multiple nodes, is the future of managing distributed applications at scale using open source tools, especially within a private cloud context. In many cases, containerization is a complement, or addition to, virtualization in the cloud.
Current approaches to containerization take advantage of relatively recent features of the Linux kernel designed to manage the constraint and prioritization of system resources (CPU, memory, network bandwidth) in isolated contexts. Functionally, this permits applications to be structured, deployed and managed as lightweight self-contained packages, similar to each application running in on a virtualized OS but without the same resource overhead. Put simply, containers enable the underlying hardware resources to be more efficiently used (since virtual machines often cause unnecessary resource contention and locks on CPU resources), while also reducing the effort required to maintain and upgrade services (due to sharing a common kernel and core libraries amongst containers). There is tremendous interest and activity in the open source community around containerization at the moment.
In the context of the GPII’s reference implementation, Docker containers are used for the current deployment architecture. Docker is the most prominent container framework at the moment, and has significant adoption by industry in clouds such as Microsoft Azure and IBM BlueMix. Docker containers have proven valuable for reproducibility and predictability when deploying the GPII infrastructure. While the GPII is not yet deployable as a fully containerized infrastructure, several of the core components including the Preferences Server, Matchmaker, and Cloud-based Flow Manager have been deployed in containers since late 2014 (prior to the Cloud4All year 3 review).
While the containerization space remains somewhat unstable and Docker dominates the conversation, several recent (mid-2015) developments are also notable and currently being tracked for potential usage with the GPII:
- The Google-backed Kubernetes container orchestration project
- The launch of the Linux Foundation’s Open Container Initiative
- The increasing maturity of container-focused Linux distributions such as CoreOS and Project Atomic, which discard traditional package management entirely in favour of containers
- The arrival of Docker-competing container formats such as Rocket
Correspondingly, it has also become increasingly acknowledged in the public conversation around OpenStack that its combination of complexity and functionality may in many cases exceed the resources and needs of many organizations. The most plausible foreseeable future is a containerized one, either with managed by Kubernetes, Mesos, or an alternative, or perhaps even in combination with virtualization and OpenStack clustering.
It is important to remember that the goal of the GPII is to deploy an infrastructure intended to support the building of next-generation inclusive projects and services. Like other infrastructure-oriented services (or the Internet itself), it has high availability as a basic requirement of deployment.
Specifically, the vision for eventual high-availability deployment should consider, among others, the factors described below.
A high-availability deployment of the GPII must avoid single points of failure and distribute work across a geographically and technically diverse infrastructure, where the loss of one piece in the distributed network (a data centre power outage, for example) can be absorbed temporarily by other systems in the infrastructure.
The distributed load balancer architecture described in the scenarios above will support the ability to redirect requests to another rack or region. However, the weakest link in the system will be the Domain Name System (DNS) server, which is responsible for performing the lookup of a particular domain name to the IP address of one or more load balancers. The challenge relates to the fact that there is typically a single centralized DNS server for a given domain name. If this goes down, the site will be unavailable until DNS records can be updated. The solution to this challenge is to use a global DNS service such as those offered by CloudFlare, Amazon Route 53, or SoftLayer Netscaler. Global DNS also has the advantage of supporting lower latencies and load times for users by directing their requests to instances of the GPII infrastructure located in data centres closer to them.
Disaster Preparedness, Management and Recovery
In the event of a significant emergency, it may be necessary to migrate the GPII temporarily to the public cloud or other infrastructure. This use case is supported by configuration management and automation tools such as Ansible, which enable any part of the infrastructure to be spun up on new servers or clouds with an absolute minimum of human involvement. Several Ansible-based GPII automation playbooks were developed as part of this costing project.
More effort needs to be invested in researching and developing automated playbooks for each type of failure scenario. As part of the larger GPII production effort, these use cases should be examined, documented, and simulated in several real cloud environments.
Autoscaling, Self-Healing, and Automated Recovery
The deployed GPII should be capable of adjusting its resources automatically to deal with sudden spikes in load. This could include the use of spare private cloud capacity or budgeting for the temporary purchase of compute and/or storage nodes in the public cloud.
Related to autoscaling is self-healing and automated recovery - wherever possible, adverse conditions should be recognized and corrected as part of the process of the running system, either by components themselves or by external monitoring systems.
The Reference Architecture
Our reference architecture adopts the self-hosted “private cloud” approach, which allows us to investigate all of the component pieces required to host the GPII, from the lowest hardware level up to the software tools used to provision and deploy services. This is in contrast to infrastructure-as-a-service (IaaS) options like Amazon Web Services, where the hardware and low-level software infrastructure is largely hidden from the end-user. While both approaches have their strengths and weaknesses, this reference architecture provides a more detailed, full-stack “lens” through which we can analyse the costs and requirements associated with deploying and supporting the GPII over the long term.
At a high level, our reference implementation consists of the following components:
- Networking equipment
- Cloud controllers
- Automated configuration management “recipes” for provisioning virtual machines and containers
These hardware and software infrastructural components — in combination with technical/system administration support staff and the GPII services themselves — represent the primary considerations when determining the cost and scope of putting the GPII into large-scale production.
Details of the Reference Architecture
This section outlines the technical details of the reference architecture, which is based on the IDI open research cluster. The original version of this section was written in 2013, when the GPII was first deployed primarily for testing, development, and Cloud4All project pilots. The text has been updated to reflect some of the refinements and improvements made since then to support, among other things, containerization.
The goal of this section is to provide a concrete instance of one possible private cloud configuration for deploying the GPII. This is intended to shed light on the kinds of questions and considerations that will go into producing a production-level deployment environment for the GPII. It is not prescriptive in nature; the GPII architecture does not depend on any of these specifics (server hardware, controllers, nor even the use of OpenStack), and other approaches are also being explored.
Figure 6: An illustration of the physical layout of the IDI open research cluster, which serves as a reference implementation of the GPII. A cluster of servers hosts an array of small virtual machines, all managed by OpenStack-based cloud controllers and connected to the internet via a pair of switches.
Compute and Storage Hardware
A node in an OpenStack cluster provides CPU, RAM, and potentially storage resources when requested. The extent of services provided by a node depends on how many OpenStack components were deployed on it. In the case of the IDI cluster, the OpenStack Compute, Block Storage, Object Storage, and Networking components run on each node; this enables them to support a variety of compute and storage needs. In practice, however, the GPII is a compute-heavy application that typically only uses ordinary virtual machines (and containers) with standard filesystem storage (block storage) for databases such as CouchDB.
Dell PowerEdge C2100 servers were used as part of our reference architecture. These specifications represent fairly typical “middle of the road” server hardware that is easy to upgrade and replace from a variety of vendors. It is worth noting that the GPII architecture itself makes few demands in terms of specific hardware. The GPII can run nearly anywhere. As long as a server is capable of running Linux (or, conceivably, Windows Server) and features fast networking, it is likely capable of running the GPII services. Significant effort has been put into ensuring that the software makes no specific hardware demands so that it can easily be deployed in a variety of cloud environments.
When a request is issued to create a virtual machine, it is received by the OpenStack controller, which then determines which node will actually provide the required compute and storage resources. The controller, among other things, hosts the OpenStack Dashboard, Identity, and Image Services. If the controller’s services are not available due to an outage, then new resources and existing VMs cannot be managed. As a result, more than one controller should be part of any production OpenStack cluster in order to provide redundancy in case of a hardware failure or rack outage. In the case of the IDI reference infrastructure, we have three controllers.
While high availability OpenStack configuration details are out of scope for this document, it should be noted that an OpenStack controller’s services require a message queue (RabbitMQ) and database (MySQL), which must also be set up with redundancy in mind (via replication or clustering). OpenStack is, at least currently, more of a “cloud development toolkit” that experienced system integrators can develop and extend to build their own private clouds, rather than something you simply install and use. This is one advantage of using a dedicated controller device such as the Nebula One (or alternatives from Piston, Breqwatr, and others); high availability is built into the appliance, rather than another piece of infrastructure that needs to be implemented prior to using the cluster.
All communication between the controllers, nodes, and the VMs in a cloud cluster happens over the network. In order to ensure that this communication occurs in a reliable and responsive manner, very fast network switches and cabling is required. We recommend 10GbE switches should be utilized in a GPII private cloud cluster.
Each cabinet, which contains one controller and its nodes, also requires a switch. If more than one cabinet is present (such as the three used in our reference implementation), then another switching layer (referred to as aggregation switches) should be implemented so that a controller in one cabinet has access to the nodes in other cabinets. Our reference architecture made use of the Arista 7050S switch series.
This section describes the various software components of our reference architecture for deploying the GPII. It enumerates and describes the different types of software that must be deployed and configured in order to support a production environment for the GPII, ensuring that all systems are secured, automated, and monitored.
This section can be understood as a kind of “recipe” for building the software components of a production GPII environment; specific tools can be substituted to taste, however each of these components needs to be addressed in some fashion.
The GPII services do not require a particular operating system. Since the GPII realtime framework is based on Node.js, it is capable of being deployed on Linux, Windows or Macintosh servers. In practice, Linux is the most widely used operating system for most cloud environments, and is currently used to run Node.js servers for large scale production sites such as Walmart, Netflix, and PayPal. Any Linux distribution should work.
For the IDI reference cluster, we chose the CentOS 7 Linux distribution because it is well-supported by the Red Hat community and features modern services such as systemd, which make managing servers simpler and more reliable. Another reason for choosing CentOS is that each major release is supported for about ten years, which means it can potentially reduce administrative costs by not requiring frequent, time-consuming major upgrades. CentOS is derived from source code provided by Red Hat. In cases where vendor support is required, Red Hat Enterprise Linux is binary-compatible with CentOS.
Configuration management tools have become a crucial aspect of cloud-based application deployment and management at scale in order to support repeatable and idempotent changes to systems. Such changes can include:
- the deployment of new software
- changes to existing installations
- creation of accounts and services
- critical security updates and patches
Configuration management tools support the establishment of reusable patterns for the configuring complex distributed application infrastructure, and contribute substantially to horizontal scaling by allowing new nodes to be brought online quickly, repeatedly and even automatically to deal with increased demand.
The IDI reference cluster uses the open-source Ansible configuration management solution, along with custom code written using Ansible’s automation language (called “playbooks”). The choice of Ansible was influenced by the simplicity of its architecture relative to competing tools such as Chef or Puppet, and by its philosophy of using common Linux tools such as OpenSSH to manage configuration when possible, rather than inventing new paradigms. At basic installation, Ansible only needs to be installed on a control host that can then be used to manage other hosts via OpenSSH and management scripts using the Python language (OpenSSH and Python are installed on most Linux server distributions by default).
Being able to collect and analyze metrics from hosts and services provides us with a more complete picture of the cluster’s health and also aids in troubleshooting issues. Metrics can include such measurements as:
- Response times for requests at an aggregate, per-host, and per-service level. Other metrics of interest can be calculated from this such as average response time overall or during particular periods of use. Response time is a critical factor in end-user satisfaction.
- Number of non-fatal warnings or errors at an aggregate, per-host and/or per-service level. Especially in multi-service application stacks, some warnings or errors may only begin to occur under conditions of production usage or heavier load.
- General service or host availability. In a load-balanced and/or high-availability environment that may include automatic recovery or replacement of failing services, collecting metrics around general service or host availability (e.g. how often do services go down and need to be replaced or recovered? which ones, and when?) can be important in understanding the longer-term health of an infrastructure and showing possible targets for continuous improvement.
The reference architecture uses two projects to collect and analyze metrics:
- Graphite is a real-time graphing system originally developed for live monitoring of server metrics. At a high level, real-time monitoring with Graphite involves:
- writing code or configurations to forward numeric data of interest to Graphite’s processing backend, known as carbon
- observing the data in Graphite’s web application frontend
- Diamond is an agent that can be installed on hosts to monitor host metrics such as current CPU and memory usage and forward them to Graphite or other real-time monitoring systems. Diamond also features an API to allow the implementation of custom collectors that could be used to monitor and forward arbitrary host or service metrics.
Along with metrics, it is important to collect logs generated by hosts and their services to allow for historical analysis of trends, errors and warning messages to assist in debugging and performance. The reference architecture uses Logstash, a project that aids in the collection and storage of logs. Logstash uses elasticsearch to store and index log data which can be analyzed using a frontend such as Kibana. Each host needs to be configured to direct its logs to hosts running the Logstash services and the default syslog daemon (rsyslog on CentOS and Ubuntu) can fulfill this role.
Retrieving data, whether originating from an application server or database, is generally a time-consuming process. Reverse proxies can be configured to cache results from application servers and memory caches can be utilized closer to the data tier. Additional information covering different scenarios related to caching is described above in the ‘Components of the Infrastructure’ section.
DevOps and Performance Testing Toolkit
As a key deliverable of this costing project, we produced a toolkit of automation recipes for several core GPII services, including the Preferences Server and Matchmakers, as well as a suite of reusable performance tests that can be run by developers and testers as the GPII evolves. This toolkit is available for use by GPII developers and integrators on Github at:
Our goals when implementing the toolkit were:
- to provide an initial baseline set of benchmarks for the current preferences server
- to create and share openly a toolkit for measuring the performance and load characteristics of the GPII infrastructure, which can be run repeatedly and regularly
- to establish basic reusable “recipes” for automated provisioning and deployment of virtual machines and software to support common GPII infrastructure (i.e. load balancers, databases, and GPII services) in a cloud-based server environment (e.g. OpenStack, AWS, etc.) in other words, a toolkit that can be used over and over again to regularly monitor our performance characteristics and to assess the cost and quality of the GPII infrastructure
Our objective for implementing this testing infrastructure was to establish a baseline performance profile, measured in requests per second, for the Preferences Server in the scenarios outlined above, and then to use that baseline as a reference while we explored scaling techniques. The intention is for this baseline and testing methodology to be incorporated into the development process to measure the performance impacts of new releases of the GPII components.
An effective Preferences Server deployment utilizes several open source technologies. There is the Preferences Server itself which is a Node.js application. Nginx acts as a reverse proxy, accepting HTTPS requests and forwards them to the Preferences Server. The Preferences Server then relies on CouchDB as its data store. This configuration summarizes the application stack that was used in our tests.
The topology above can be implemented using one server. Using one host for all three services is not ideal in a production environment. Hardware failure on this one host will result in drastic service interruption for the entire application. Additionally, once the performance limit of the application has been reached using one host, it becomes necessary to distribute the load across more application servers. Nginx allows us to scale horizontally by load balancing several Preferences Server instances. This approach offers better performance, since several instances are fulfilling requests concurrently. It also provides redundancy so that stability issues on one or even a few instances will not result in service outages.
Although we did not observe performance bottlenecks related to CouchDB in our tests, it would be prudent to horizontally scale the database layer and avoid a single point of failure. This can be achieved by triggering replication events on each CouchDB instance, creating a peer database replication environment where each instance can perform read and write queries. CouchDB requests from the Preferences Server would then be sent to Nginx, which would act as a reverse proxy for each CouchDB instance.
It is important to adopt a configuration management strategy when managing numerous hosts and applications. Doing so will result in faster and reproducible deployments. Cluster management workflows can also require a solution that allows tasks to be run on all or a selection of hosts in parallel and then collect the results. Chef and Puppet are two popular tools that can help automate host configuration and software deployments. Unfortunately fulfilling each solution's list of dependencies can be a formidable task on its own, especially if the command orchestration requirement is also pursued. Ansible is a newer and actively developed tool that allows anyone to document the state of a host or application in a language that almost resembles plain English, while also allowing parallel remote command execution. We utilized both aspects when we developed roles for the Preferences Server's stack. An Ansible role can be considered as a container for desired application state, whether that involves installing packages, creating users, managing services, or practically anything else that we would want to automate. Each role (or several) can then be applied to a group of servers based on the task they need to perform in the cluster.
In order to simulate tens of thousands of users interacting with the Preferences Server we had to search for a flexible solution that would be capable of scaling as our needs grew. We considered software as basic as Apache Bench as well as more comprehensive solutions such as JMeter, but in the end Tsung seemed like a better match for our requirements. Tsung provides a holistic approach to load testing that encompasses support for multiple protocols (we focused on HTTP and HTTPS), collecting performance metrics from hosts being tested, documenting realistic test scenarios in configuration files, and generating detailed reports for each test run. It can be deployed in a clustered configuration which means when we want to simulate requests being generated from hundreds of thousands of users it will just be a matter of provisioning additional Tsung client hosts.
This report described the key activities and findings of the GPII costing project. We described the models and techniques for scaling software in the cloud, determined the types of hardware and software components required for managing cloud infrastructure, and projected cost models for using and maintaining the GPII in production. Additionally, we performed initial peformance tests and created automated installation and testing scripts that will support the ongoing development and evaluation of the GPII system.
Looking forward to a large-scale GPII production development and piloting effort, significant further research and implementation is required to determine the best approach for high availability and autoscaling of the GPII, as well as to more fully integrate the emerging benefits of container infrastructure such as Kubernetes, Mesos, and CoreOS.
As large enterprise vendors increasingly wade into the public cloud IaaS market and attempt to compete with Amazon’s market dominance, costs appear to be trending downward. In contrast, the promise of private cloud, self-hosted infrastructure using technologies such as OpenStack and Nebula seems to be significantly moderating; new technologies such as Kubernetes, Docker, and Rocket are likely going to have a significant impact on how we will administer and scale the GPII in practice. This entire space is exciting but highly uncertain and in flux. The GPII’s flexible, modular, and largely stateless microservice architecture, along with our emphasis on automated configuration management and provisioning, will help to ensure that regardless of how the infrastructure layer evolves, we will be prepared to scale to greater load and usage. There is still significant research and work to be done, but the foundations are now in place for a larger, funded GPII production effort.
Appendix A: Glossary of Software Components
Unless otherwise noted, all of the software below is open source.
An IT automation framework for application deployment, configuration management and other systems administration tasks. Ansible is part of a class of tools along with Chef, Puppet and SaltStack that have significantly shifted modern systems administration towards standardized, replicable approaches modeled on concepts such as “infrastructure as code” and idempotence.
A Linux-based provisioning server for automating the installation and initial setup of operating systems across networked environments. Cobbler allows for the creation of reproducible patterns for OS installation on a variety of systems such as bare metal servers and virtual machines, allowing new machines in standard configurations to be brought up quickly.
A web-oriented document-based NoSQL database. CouchDB is a popular means of storing persistent data in web application architecture due to its high horizontal scalability, high availability and fault tolerance. NoSQL databases have become increasingly popular and mature in recent years, and are now commonly used instead of or alongside traditional relational database architectures.
A tool for packaging and running applications in Linux “containers”, lightweight redistributable environments isolated from the main operating system. Docker has been a technology of great significance in the application management world in the last several years due to its powerful abstraction and promise of application portability.
An application for providing proxying and load balancing for TCP & HTTP-based applications. In a production web application architecture, HAProxy helps distribute requests across multiple instances of the application, contributing to performance, reliability and high availability by ensuring no instance of the application becomes overwhelmed by traffic and that requests can be routed away from unresponsive or crashed instances of the application.
An orchestration system for containerized applications, Kubernetes manages and distributes Docker-based applications across multi-host environments. Kubernetes went into 1.0 very recently (July 2015) but can be expected to make a significant impact due to its backing by Google and promise of easing the complexities of managing containerized applications in production.
A framework for managing, pooling and sharing distributed computing resources, and building applications on top of them. Mesos also offers support for the management of Docker containers as part of its resource management.
A distributed in-memory object cache generally used to reduce calls by web servers to databases or other comparatively performance-expensive back-end services. Memcached is small, simple and fast.
A web server with reverse proxying features, commonly used as part of high-performance site architectures. Nginx typically serves the same purpose as the better-known Apache web server in an application architecture, but is preferred by many high-performance sites for its greater ease of configurability and performance-oriented architecture.
An umbrella project dedicated to building tools for creating and managing cloud-computing platforms. Organizations deploying OpenStack typically aim to gain the advantages of the public cloud in large-scale provisioning and management within their own datacentre infrastructure. The OpenStack project is governed by a nonprofit corporation and has contributions from many large companies invested in open source such as IBM, Red Hat and Oracle.
Last Updated in August 2015 by Avtar Gill, Michelle D’Souza, Colin Clark, Alan Harnum, Anastasia Cheetham, Giovanni Tirloni, and Jess Mitchell.