Over the last few months I have been working in Ceph, a free unified distributed storage system, in order to implement some missing features in RADOS gateway, help some customers with Ceph clusters in production and fixing bugs.
This effort is part of my daily work here in Igalia working in upstream projects. As you could know, Igalia works in the Cloud arena providing services on development, deployment and orchestration around interesting open projects.
My goal with this first post is introducing Ceph in a simple and easy way to understand this marvelous piece of software. I will cover the design and main innovations in Ceph together with its architecture, major use cases and relationship with OpenStack (a well-known free and open-source software platform for cloud computing).
Ceph is an object storage based free software storage platform that stores data on a single distributed computer cluster. I would say this definition catches the essence of Ceph perfectly. It is also the foundation to understand its innovations, the architecture and the performance/scalability factors in Ceph.
Let's start with the object storage. The object storage is a storage architecture that manages data as objects, as opposed to other storage architectures like file systems which manage data as a file hierarchy and block storage which manages data as blocks within sectors and tracks. Each object typically includes the data, a variable amount of metadata, and a globally unique identifier.
If we add a smart cluster approach in the previous design we will have a reliable object storage service that can scales to many thousands of devices. This reliable object storage service is known as RADOS (Reliable Autonomic Distributed Object Storage) in the current Ceph implementation.
But what is a 'smart cluster approach' here? At the petabyte and exabyte scale, systems are necessarily dynamic. They are built incrementally, they grow and contract with the deployment of new storage and decommissioning of old devices, devices fail and recover on a continous basis, and large amounts of data are created and destroyed. RADOS takes care of a consistent view of the data distribution and consistent read and write access to data objects.
RADOS also provides storage nodes with complete knowledge of the distribution of data in the systems, devices can act semi-autonomously using peer-to-peer like protocols to self-manage data replication, participate in failure detection and respond to device failures and the resulting changes in the distribution of data by replicating or migrating data objects.
If we consider the minimal configuration together with the basic components needed to set up a RADOS system, we will have a set of object storage daemons (OSDs) and a small group of monitors (MONs) reponsible for managing OSD cluster membership.
In Ceph this OSD cluster membership requires a cluster map. This cluster map specifies cluster membership, device state and the mapping of data objects to devices. The data distribution is specified first by mapping objects to placemente groups (PGs) and then mapping each PG onto a set of devices. The algorithm taking care of these steps is known as CRUSH (Controlled, Scalable, Decentralized Placement of Replicated Data)
With this information in mind we may consider two major innovations in Ceph RADOS:
- The CRUSH algorithm. The way how Ceph clients and Ceph OSD daemons compute information (hashing function) about object location instead of having to depend on a central lookup table
- Smart daemons. The Ceph's OSD daemons and Ceph clients are cluster aware. This enables OSDs interact directly with other OSDs and MONs. Ceph clients interacts with OSDs directly.
Both items add significant intelligence in the solution to avoid bottlenecks and, at the same time, pursue hyperscale at the petabyte and exabyte scale.
In this point we should have enough information to understand the raw Ceph architecture. Let's have a look in the usual block diagram for Ceph:
- RGW. A web services gateway for object storage, compatible with S3 and Swift
- RBD. A reliable, fully distributed block device with cloud platform integration
- CEPHFS. A distributed file system with POSIX semantics and scale-out metatadata management
- LIBRADOS. A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
- RADOS. A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
Mapping out the major components involved under the hood and their interactions makes it still possible getting a more detailed version of this architecture:
The OpenStack basics
Although this is an introduction post in Ceph I will describe OpenStack and its relationship with Ceph briefly. It will be useful later.
Ceph may be used alone but some of its most interesting use cases take place as part of OpenStack. A quick overview on OpenStack will be useful to understand how the OpenStack and Ceph components work together to provide reliable and scalable storage.
The current stable release for OpenStack is 'Liberty' and it includes 17 components (compute, image services, object store, etc). All those components have well-known code names (Nova, Glance, Swift, etc)
The next picture catches a very high level abstraction for OpenStack:
As you can see, Glance (VM image manager) and Cinder (block storage) are two core services in the solution.
We mentioned the previous picture shows a simple view of OpenStack. A more accurate diagram together with the relationships among the services is available in the next picture for 'Folsom', a previous release (2012)
While OpenStack evolves and include new services, this 'Folsom' picture should be good enough to introduce the services related to storage and the level of complexity of OpenStack.
So the storage services in place are Swift (object store service), Glance (image service) and Cinder (block storage service).
Those services work in tandem to cover the general and specific requirements for storage in OpenStack.
Using Ceph in OpenStack
The main integration points between OpenStack and Ceph are the object and block device interfaces.
The RADOS gateway (RGW) and the RADOS block device (RBD) interfaces are used to provide the required storage to 5 services (Keystone, Swift, Cinder, Glance and Nova)
It is worth mentioning the compute service (Nova) interfaces the RBD layer via a hypervisor. An open source hypervisor working like a charm with Ceph is Qemu/KVM. It uses librbd and librados.
Other component to mention in the stack is libvirt. OpenStack uses libvirt to configure Qemu/KVM properly.
Ceph RBD dominates the choice for Cinder drivers currently, as stated in the sixth public survey of OpenStack users (page 31)
The physical deployment of Ceph and OpenStack
Setting up and operating a reliable and scalable storage cluster is always demanding. It requires a careful planning along many different aspects. Some of these critical decisions are related to the cluster capacity (RAM, disks, number of nodes, use profiles, etc)
Although it is always possible going with your own custom configuration some hardware providers offer several standard configurations.
As a random and arbitrary example, we can have a look in the HPE Helion portfolio. This set of solutions is a mix of open-source software and integrated systems for enterprise cloud computing.
The next picture shows the physical space required and how it compares to the different logical components in the architecture.
The new and old use cases
The production of data is expanding at an astonishing pace. Two major drivers in this rapid growth of global data are the analog-to-digital switch (software is everywhere) and the rapid increase in data generation by individuals and companies.
The new use cases related to storage nowadays are radically different of the previous ones a few years ago. These new use cases are all about storing and retrieving unstructured data like photos, videos and social media in massive scale. All this stuff requires real-time analitycs and reporting together with efficient processing.
To get these requirements together, some companies are extending/migrating their current datacenters to support software-defined approaches. As consecuence, those new datacenters leverage virtualization concepts such as abstraction, pooling, and automation to all of the data center’s resources and services to achieve IT as a service. In this vision all elements of the infrastructure (compute, storage, networking and security) are virtualized and delivered as a service.
In this context, we can identify some new and well-known use cases along the next 5 different categories. The original classification is used by the RedHat Storage team. Take into consideration I am merging Cloud infrastructure and Virtualzation here.
- Big data analytics. Storing, integrating, and analyzing data at petabyte scale
- Cloud infrastructure and Virtualization. Virtual machine storage and storage for tenant applications (Swift/S3 API)
- Rich media. Massive scalability and cost containment (scaling out with commodity hardware)
- File sync and share. Secure mobility, collaboration and the need for anytime, anywhere access to files
- Archival data. Agile, scalable, cost-effective and flexible unified storage (objects, blocks and file systems)
Ceph is used to support all these use cases in production with great results.
Pushing Ceph to the limit
Some folks in the CERN IT Department are pushing Ceph to the limit. They use Ceph as part of an OpenStack deployment and I have to say the numbers are great.
The solution is a large distributed OpenStack infrastructure with around 10,000 VMs and 100,000 CPU cores (1000 Cinder volumes and 1500 Glance images). The Cloud is predominantly used for physics data analysis but they also reported on a long tail of conventional IT services and user-managed application VMs.
If you want to know more on this Ceph cluster operated by CERN, I would recommend to watch this video at Vancouver Summit 2015.
In brief, and beyond of the great insights shared along the talk, the current Ceph version scales out to 10 PB. In that scale it just works. Over that threshold, it requires extra configuration adjustments.
I told you! This piece of software is marvelous!
I plan to add new blog entries to cover some of the new features implemented in the previous months. They are upstream code now so you will be able to enjoy them in Jewel!
If you are looking for some kind of support related to development, design, deployment, etc. in Ceph or you would love to see some new feature in the next releases. Feel free to contact me!