Some notes on the feasibility and results of a proof of concept (POC) exploring a streaming approach between S3 objects.

Some of the proposed scenarios study the role of the POC in long duration use cases, the separation in a control and data plane, or the possibility of being embedded in the network infrastructure itself.

The data streaming approach between S3 objects

In the last two years, new scenarios and use cases demanded by users have emerged that require the transfer and synchronization of data from a source object to another destination object.

These new scenarios and use cases appear naturally in multicloud environments and custom storage solutions that make use of the S3 API as a standard but operate outside the Amazon Web Services (AWS) ecosystem.

The current strategy is to operate these use cases through what has been called a multicloud data controller, a solution that can mobilize data from on-premises storage systems to the Cloud and from one public cloud to others, and even to multiple locations simultaneously.

However, in practice it is necessary to consider that the deployment of a multicloud data controller usually turns into the adoption of another layer of software at scale that must be maintained and managed within the solution.

Among the aspects to highlight in the value proposition of a multicloud data controller are usually the abstraction of multiple storage sources under a single namespace, unified access to data, navigation and transparent, efficient and smart transfer of objects data placing or single endpoint access.

The data controller thus becomes an element that houses and applies data 'policies', while being aware of the 'mechanisms' and storage services available to the different 'Clouds' and storage backends to which is connected.

I think some of these aspects, especially those related to the transfer and synchronization of data between objects, can benefit from a streaming approach between objects with results that would reduce the consumption of resources both in the multicloud data controller as in the different storage backends that it intermediates.

It is this last "streaming approach" that I refer to in this post as "streaming" between S3 objects, that is, the functionality of transferring data from a source object to another destination object that involves a single continuous flow "connecting" the data downloaded from the first object to be uploaded to the second object immediately and with minimal resources.

In scale, it translates into lower resource consumption and better latency of interconnected services through S3 interfaces.

Thinking about the impact of using streaming with the archive zone in Ceph RGW/S3

The S3 API has become a kind of lingua franca between object storage providers and users.

This is a good thing in that it reduces friction in communication and adoption of tools and libraries within the S3 community. Also note that the S3 API is documented as part of the AWS S3 product and is freely downloadable.

The not-so-positive part of it becoming a lingua franca in the industry is that its development is closed and proprietary. The evolution of the S3 API is aligned with the AWS ecosystem and does not respond to external needs or use cases that are available or of interest outside the AWS ecosystem.

In recent years, the Ceph Object Storage interface based on S3 (RGW/S3) has evolved in parallel with that of Amazon S3 as expected, although in some scenarios and use cases demanded by the community, the project has incorporated extensions of the API that allow greater customization and experience of using the object storage while remaining compatible with the original specification.

An example of the latter is the 'archive zone', a functionality extension on the S3 API adapted to work with the RGW federation model at the zone level that was contributed upstream in Ceph Nautilus as a result of one of our projects in Igalia.

If you know the feature, you will know the main use cases of 'archive zone' are protect and maintain a complete S3 object history beyond voluntary or involuntary updates of legitimate users in non-archive zones and/or reduce the number of copies of S3 objects in federated configurations.

This functionality is usually useful in normal business operations or in exceptional situations such as a ransomware attack.

Thus, once the 'archive zone' does its job there is always some version of the objects that may need to be 'restored' in the future.

In these cases, this restoration is part of the business logic and consists of connecting to the gateway of an archive zone to download the data to be restored from a specific versioned object, with the intention of connecting to the gateway of a non-archive zone later and upload the data on top of an existing object. This would cause the last object in all zones to sync and update with the data of interest.

In the previous situation, being able to stream between S3 objects located in different zones with different gateways, without requiring intermediate temporary storage and using the minimum possible resources, would be the most efficient way to restore the objects involved.

Streaming feasibility between S3 objects and proof of concept

To test the viability of streaming between S3 objects as we understand it in this blog post I implemented a small proof of concept (PoC) in Golang.

The implementation is quite simple in that it is about synchronizing a request to download data from an object with another request to upload data to another object.

That is, the upload follows the download, consuming data from a configurable-sized RAM buffer. This is the minimum necessary to get a PoC to start working with.

The first obvious limitation of PoC is that uploads will be 'single PUT' which limits the maximum upload capacity on Amazon S3 as mentioned above. In other storage solutions where this parameter is configurable, it will not be something to consider.

Another limiting aspect in the PoC is the need to use a variant of AWS4 authentication in which we ask not to include the body hash in the signature of upload requests with the intention of being able to start uploading data as we read them before to have them available in their entirety.

This is a very efficient way to move data as it only requires 2 authenticated requests involving a few kb of data and does not consume extra temporary resources on the server. The trade-off is that we are not signing the payload and this may not be acceptable in certain scenarios.

Considering the above factors, the tool is functional in that it shows a possible way of doing what we are looking for and helps to clear up some of the initial doubts about the possibilities of streaming with the current S3 API, memory consumption, timeouts, upload/download times, authentication algorithm, etc.

The PoC supports the Amazon S3 and Ceph RGW providers. Therefore it is possible to copy objects between both systems. This allows us to experiment and simulate "multicloud use cases".

In the case of AWS S3 it is working only with the Virginia region (us-east-1) because the rest of the regions need to establish a location header that later has to be integrated with the signing of the clients' request, etc.

The PoC is made up of a library and a simple application (CLI) that uses it. A JSON is consumed through standard input where each line represents an operation between a source object and a destination object.

Currently the only supported operation is "copy" since the only thing we are trying to do is download and upload the object simultaneously, making efficient use of a buffer in RAM.

Each entry in the JSON can also be understood as two S3 clients that connect to two "endpoints", a source and a destination. These clients are autonomous and independent. An example of a JSON input is as follows:

{"source":{"aws_keys":{"access_key":"AAAA","secret_key":"BBBB"},"connection":{
"user_agent":"poc","read_buffer":4096,"write_buffer":4096},"endpoint":{"protoc
ol":"http","domain":"rgw1.com","port":80},"region":"us-east-1","bucket_name":"
my-bucket-1","key_name":"my-key-1","version_id":""},"target":{"aws_keys":{"acc
ess_key":"CCCC","secret_key":"DDDD"},"connection":{"user_agent":"poc","read_bu
ffer":4096,"write_buffer":4096},"endpoint":{"protocol":"http","domain":"rgw2",
"port":80},"region":"us-east-1","bucket_name":"my-bucket-2","key_name":"my-key
-2"}}

In a more readable format the entry would look like this:

{
  "source": {
    "aws_keys": {
      "access_key": "AAAA",
      "secret_key": "BBBB"
    },
    "connection": {
      "user_agent": "poc",
      "read_buffer": 4096,
      "write_buffer": 4096
    },
    "endpoint": {
      "protocol": "http",
      "domain": "rgw1.com",
      "port": 80
    },
    "region": "us-east-1",
    "bucket_name": "my-bucket-1",
    "key_name": "my-key-1",
    "version_id": ""
  },
  "target": {
    "aws_keys": {
      "access_key": "CCCC",
      "secret_key": "DDDD"
    },
    "connection": {
      "user_agent": "poc",
      "read_buffer": 4096,
      "write_buffer": 4096
    },
    "endpoint": {
      "protocol": "http",
      "domain": "rgw2",
      "port": 80
    },
    "region": "us-east-1",
    "bucket_name": "my-bucket-2",
    "key_name": "my-key-2"
  }
}

Each entry therefore has the necessary data that would be used in the signature of the requests for authentication, the source and destination of the objects, the version of the object if necessary, etc.

The "connection" entity represents the parameters that are currently configurable.

In addition to the User-agents that can be interesting at the level of identification in the server or in some middleware, the read and write buffers of the S3 clients also appear, which must coincide in the current PoC implementation.

The read/write buffers are the transport buffers of the http/s connections and could in the future be different. Internally we pass them directly to the net/http library and they serve to limit the memory consumption of reception and sending with which we want to work. We thus avoid intermediate copies of data.

The buffer used to download is the buffer used to upload through Golang's reader/writer interfaces.

The current implementation supports parallel object copying as well.

Parallel copying of objects is configured through a command line switch (-w). An example copying the object "my-key-1" seven times would be the following:

$ cat entries.json | wc -l
7
$ cat entries.json | jq '.source.key_name'
"my-key-1"
"my-key-1"
"my-key-1"
"my-key-1"
"my-key-1"
"my-key-1"
"my-key-1"
$ cat entries.json | jq '.target.key_name'
"my-key-2"
"my-key-3"
"my-key-4"
"my-key-5"
"my-key-6"
"my-key-7"
"my-key-8"

We do not parallelize, the download is sequential:

$ cat entries.json | poc -w 1 | jq '.data.target.key_name'
"my-key-2"
"my-key-3"
"my-key-4"
"my-key-5"
"my-key-6"
"my-key-7"
"my-key-8"

Parallelize with 15 'workers', it actually works with 7 workers internally because there is no more workload in the example:

$ cat entries.json | poc -w 15 | jq '.data.target.key_name'
"my-key-7"
"my-key-2"
"my-key-3"
"my-key-5"
"my-key-8"
"my-key-4"
"my-key-6"

To get an idea of the memory that the PoC consumes, we performed the same test with a "my-key-1" weight of ~300 MB.

This gives us a theoretical peak of ~2100 MB if we were to store all the content of all copies in memory at once.

Actually with the example configuration, 4 transport buffers of 4 KB each for each operation, we are working with a sustained memory occupation in RAM by the process of about ~6 MB.

The data for a sequential example run would be as follows:

$ cat entries.json | poc -w 1 | jq '.status'
"success"
"success"
"success"
"success"
"success"
"success"
"success"

If we observe the RSS of the PoC process while it is running, we can see a real physical occupation increase in RAM of ~0.5 MB at the end of the execution that could well be related to Go's garbage collector (GC).

In a possible PoC optimization phase, the GC could be run manually past a threshold if necessary.

For the same example running with 7 workers in parallel it reaches ~6.5 MB. This result is consistent since the execution time is less than sequential and the difference could well be due to allocations from the Go runtime that the GC has not sent to recycle yet.

For the 4 KB buffer test (the one that comes by default with net/http) each worker in the current implementation seems to require ~70 kb of RAM.

This ~70 kb also includes the scanner line buffer that the PoC uses to load each line of the JSON and the intermediate/final structures of the JSON transformation.

The default line buffer uses ~4 kb. It can be changed with the '-s' switch. In principle, configuring this parameter could be interesting in the case that we process thousands or millions of entries and have several instances of the PoC running.

In this case we can adjust the line buffer size to the maximum line length of all the JSON inputs, thus avoiding re-allocations and memory fragmentation.

Regarding the ability to debug PoC executions. Here we have to distinguish between what can go wrong with the tool and what can go wrong with each S3 client that we are running in parallel both in downloading and uploading each of the operations.

Everything that has to do with the behavior of S3 clients is consumed by the standard input and is transferred to the standard output. It shows both what is going well and what is going wrong.

An example is shown below:

$ cat entries.json | poc | jq
{
  "status": "error",
  "message": "403: The AWS Access Key Id you provided does not exist in our
records.",
  "data": {
    "source": {
      "aws_keys": {
   ...
}
{
  "status": "success",
  "data": {
    "source": {
      "aws_keys": {
   ...
}

What the PoC does at this point is to process each of the inputs it reads from the JSON and includes it directly in "data" within the output.

"data" together with "status" are two keys that always appear in the processed output of each operation.

"status" can be "success" or "error".

In the case of "error", a message describing the type of error is also included.

The PoC's exit code if everything has gone well is 0 and if it has found at least one error it is 1. This allows you to know if everything has gone well without having to process the whole output.

A complete and successful log would show the following:

2020/07/18 09:11:36 [INFO] logger initialized ...
2020/07/18 09:11:36 [INFO] client {filename: -, numGoRoutines: 3,
lineBufferSize: 4096}
2020/07/18 09:11:36 [INFO] bufReader { lineBufferSize : 4096 }
2020/07/18 09:11:37 [INFO] stats { numOperations : 2, numSuccessOperations : 2,
numErrorOperations : 0}
2020/07/18 09:11:37 [INFO] SUCCESS

Finally, there is the aspect of the management and administration of a tool with these characteristics.

It is relevant to note that the execution of a tool of these characteristics can have a long execution time. In the order of hours, days or even weeks.

We must also bear in mind that the process can be fed continuously until otherwise indicated.

With this in mind the PoC has a "daemon" mode that turns the main thread into a "data plane" that executes pending JSON entries in multiple threads.

The "daemon" also opens a port for controlling and monitoring the PoC through a simple remote administration web page which in turn makes use of a JSON API.

Running PoC on network electronics

Another interesting scenario that allows evaluating PoC is the possibility of streaming between S3 objects from devices that are part of the network infrastructure, such as physical switches or routers.

I guess the same strategy is also workable with virtual devices although this scenario was not part of the testing.

Since the PoC was implemented in Golang, cross-compilation of the PoC to another operating system and architecture is supported and immediate.

The hardware used for the tests in this scenario was a low-end home gateway based on a Broadcom BCM6328 SoC with 320 MHz CPU, 16 MB Flash, 32 MB RAM and 4 x 100M ports. The hardware was flashed with OpenWrt.

Once the PoC was configured, it worked correctly, keeping within reasonable consumption thresholds and not showing anomalous or unexpected behavior.

Regardless of how exotic this scenario may be, it allows us to consider a distributed design of the PoC where the streaming between objects can be distributed and parallelized between the different devices of the network infrastructure, allowing its execution to take place where it is most appropriate.

Wrapping up

New scenarios and emerging use cases, enabled by multi-cloud solutions and extensions to the original S3 API, favor data "streaming" approaches between object that can simplify and accelerate current storage operations employing store-and-forward strategies.

The original S3 API was designed to easily upload and download data by connecting through client-server requests. The current storage services provided by S3 continue to work around the two most central abstractions of the service, buckets and objects, but it does not seek to connect data from the objects in a continuous and light flow that can be acted upon.

These possible flows, constituted by the downloading of data from an object together with the immediate loading of the same data; allows the composition of data sources and destinations (objects) not only within the service itself but also between compatible services that implement the S3 interface.

Throughout the post, the implementation of the above is also commented through a small proof of concept (POC) that allowed testing the viability and behavior of these flows in different contexts and situations.

The POC was fundamentally useful to validate the correct functioning, identify the variables involved and reflect on the range of values ​​involved throughout the tests.

In Ceph Object Gateway's 'archive zone' feature, we also found a scenario where applying a 'streaming' approach makes it possible for the various objects in an 'archive zone' and the other remaining zones to be restore and synchronize while consuming very few local and remote resources.

The POC explored limitations in the API (authentication) and design considerations (maximum object size, etc.) to take into account with streaming between S3 objects.

Beyond using the POC to study a single execution, multiple tests with flows between objects were run successfully.

Finally, a limited low-resource scenario where the POC was executed in an embedded system also allows us to contemplate streaming solutions where a control plane and a data plane can be incorporated.

In these cases, the data plane can be aligned and integrated with the network infrastructure itself (routers, switches, etc.)

Comments

comments powered by Disqus

Recent Entries