Two of the promised promises in Software-Defined Storage (SDS) are higher automation and flexibility in order to reduce the costs around the storage administration tasks.
This entry will go over the required concepts to understand the virtual data and control paths in SDS solutions and how metadata are used to convey the data requirements into the automation software with the proper granularity and flexibility.
Although there is no formal definition for Storage-Defined Storage (SDS), it is all about decoupling the hardware and the intelligence.
In SDS, the hardware becomes "irrelevant", the intelligence lives into the software, and the hardware may be considered a cheap commodity.
From an architectural perspective, SDS sees the storage in three layers: orchestration, data services (compression, deduplication...) and hardware.
SDS abstracts storage resources to enable pooling, replication, and on-demand provisioning of resources. The result is the ability to pool storage arrays into logical pools.
SDS and SDDC
Beyond of being a stand-alone technology, SDS is also a building block of the Software-Defined Data Center (SDDC), together with Software-Defined Compute (SDC) and Software-Defined Networking (SDS) technologies.
In SDDC, as introduced by the former VMware CTO Steve Herrod in 2012, "compute, storage, networking, security, and availability services are pooled, aggregated, and delivered as software, and managed by intelligent, policy-driven software".
Software-Defined technology brings new and challenging changes into the industry but all of them are part of a movement towards promoting a greater role for software systems above the hardware.
Many of the complex and advanced functions that used to require proprietary hardware are implemented now in software running in commodity hardware.
Data Services and Storage Virtualization
Data Services and Storage Virtualization are fundamental blocks to understand the virtual data and control paths in Software-Defined solutions.
Data Services are standards-based and uniform means of supplying a need such as provisioning, protection, availability, performance, security, etc. over the stored data. The behaviours of those data services are defined by policies.
On the other hand, Storage Virtualization is the process of grouping the physical storage from multiple network storage devices so that it looks like a single storage device. Note that this logical storage may be the result of stacking multiple physical and/or logical abstractions.
Virtual Data and Control Paths
As described by the Storage Networking Industry Association (SNIA), SDS builds on the virtualization of the Data Path although SDS is not virtualization alone. Bear in mind the Control Path needs to be abstracted as a service as well.
The virtual data path is formed by block, file and object interfaces that support applications written to these interfaces. At the same time, the control path uses metadata to express requirements, data services control and service level capabilities.
This is the way how SDS simplify the management in order to reduce the cost of maintaining the storage infrastructure. This data services management and defining storage policies avoids the manual administration. It enables flexibility and automation.
Following this approach, each data object is able to convey its own requirements independent of which virtual storage device it resides on.
Virtual Data and Control Paths in Ceph RGW S3
In the previous picture, virtual data and control paths in RGW S3 are surrounded by one black line.
The data path comes defined by the own AWS S3 interface specification. This path is a virtual path because the virtual RGW S3 object store is built on the top of the native Ceph object store.
The control path comes also defined by the AWS S3 interface as part of the metadata attached over the data.
The data services are enabled/configured by the storage administrator using 'pools'. These pools are the divisions or partitions used to define key parameters such as the number of placement groups, the CRUSH ruleset or the ownership. RGW S3 runs on top of those pools to provide the final storage service.
In some use cases, choosing the right pool or attaching metadata (zone/region) is enough to specify the place where the data will live. In other cases this configuration is not needed.
Take as example the 'x-amz-website-redirect-location' system-defined metadata in the S3 static website hosting feature. It is used to redirect requests for the associated object to another object in the same bucket or an external URL.
This metadata is a good example to see how the policy allows request redirections specified by the user instead of the administator.
While metadata attachment could look like an arbitrary action to get the job done, it is a recurrent pattern in S3. Here the metadata are part of the control path and they drive the behaviour of the use case along the whole storage stack.
Along this blog post we explored the virtual data and control paths in SDS, its relationship with SDDC and how metadata can be used to drive services in Ceph RGW S3.
In the case you are interested in SDDC, the SDDC VMware documentation is available on-line. Bear in mind the SDDC term covers a broad concept and this documentation covers the VMware perspective only.
- Ansible AWS S3 core module now supports Ceph RGW S3
- The Ceph RGW storage driver goes upstream in Libcloud
- Scalable placement of replicated data in Ceph
- Ceph, a free unified distributed storage system
- On S3, endpoints, regions, signatures and Boto 3