Container image: deliver it immediately
Container image: deliver it immediately
Hello, my name is Dmitry Svetlyakov, I am the head of the VKontakte cloud platform operation group. I have been in administration for 12 years, and more than 6 of them in container technologies.
There is little information on RuNet on how to speed up the delivery of a container image. I hope our experience will help administrators of large container installations speed up the delivery of images to end nodes, organize an alternative source for obtaining them, and make this process fault-tolerant.
I will tell:
- about the basic course 101 on OCI-images;
- about the problems we faced;
- what is Peer-to-Peer OCI distribution;
- how to improve delivery resiliency;
- as a bonus - how to speed up the unpacking of images;
- what we have achieved and how we plan to develop further.
The article was written based on my speech at the VK Kubernetes Conference, you can watch it in the recording .
OCI Image 101
The standard image delivery procedure looks like this: there is a container engine and a registry. All you need to do is request an image and get it. But in a private cloud, this requires authorization. In the case of a separate engine, the command is executed directly in the console, so that sensitive data remains in the file system.
Kubernetes uses secret objects to store this information. They contain a field with authorization data encoded in base64 format. In order for Kubernetes to use secret, you can explicitly specify it in the manifest specification of any deployment unit, or implicitly put the object in a service account. Creating objects of this type is one of the first automations that Kubernetes administrators have to solve using a custom solution or some kind of operator.
Secrets in Kubernetes is a very slippery topic as this abstraction doesn't really provide any built-in method for encrypting sensitive data We do not use the built-in secrets abstraction and prefer the Hashicorp Vault system to retrieve any sensitive data. But, unfortunately, there is no way to store an account for the registry in Vault.
It is also worth talking about the OCI Image manifest - this is a JSON-shaped description of the named image object. This document contains information such as the actual contents of the file system or the layers we need to build the container's root file system. Its second function is a description of a specific configuration, that is, how to run our container. For example, what command to run at startup and what ports we need. This completes the basic course.
Problems on a large scale
My team is developing an internal cloud, using it to run our applications.
Our cloud consists of several distributed data centers, several Kubernetes clusters and a small cart of alternatives. The largest cluster has about a thousand machines, and in total we serve more than ten thousand containers.
Applications that run in the cloud are responsible for various subsystems of the site, from rendering pages to executing various API methods such as uploading and processing photos. We also help improve user experience with ML applications. We use baremetal, and in addition to the x86 architecture, we have ARM, as well as GPU and FPGA accelerators.
At such scales, we encountered difficulties when using the classical model of image acquisition. We have hundreds of different applications with different code bases. But today, as an example, I would like to focus on one of them that you know - kPHP. I think you know that the core of the site is written in PHP and is compiled into a C-binary thanks to the kPHP compiler. In our cloud, this binary is launched in service mode, in which the scope of tasks is limited to a single RPC request. We have come a long way of evolution and are now actively separating various functionality into more lightweight services.
Now it looks like this:
- The entire site is updated every half an hour, or 48 times a day. The update includes regular bare metal servers and orchestration systems like Kubernetes. While you are reading this article, the VKontakte website will be updated.
- After assembly, debug symbols are removed from the binary, but even after that it takes about 2 GB.
- We use about 500 machines in various cloud pools to run kPHP services.
- The throughput of the registry is 10 Gbps. But it is not always possible to use it entirely, since the registry is also used for other needs.
Based on these values, we can get a metric of the time it will take us to download a 2GB blob to 500 machines from the destination registry. The end result is a scary 13 minutes and 20 seconds, or more than ten hours a day. This is the metric we had to contend with.
Before optimizing the distribution of images, we decided on the key points that we considered important to us.
- Save the source of truth as the current registry, since we have a lot of automations and integrations based on it.
- Maintain end-client compatibility regardless of how it is used in Kubernetes or an alternative technology.
- Fault tolerance and failover. In the entire delivery pipeline, the registry itself can be described as a black box, which is subject to human, software and man-made problems. It can fail entirely, resulting in inaccessible images. Of course, there is a document for disaster recovery, but if we consider the most tragic outcome, then restoring the service will take time, and I would like to go through it in a calm mode without using the engineer's jet thrust.
- Optimize the data transfer graph. In the classical model and at our scale, there are hundreds of servers that requested the same image - and we senselessly drove the same set of bytes between data centers.
- Accelerate delivery capability to match the metric with the rest of the site. This will allow us to continue moving stateless applications to the internal cloud.
Features of Peer-to-Peer Architecture
We have a mechanism that the rest of the site uses to deliver binaries to baremetal - this is our internal copyfast engine . From the name it is clear that its goal is to quickly distribute binary blobs. The engine is based on Gossip replication, and over the years we have seen the effectiveness of this method more than once, so we chose P2P as the architecture for distributing images.
Let me remind you the basics of P2P (peering). This is a peer-to-peer decentralized network in which there are no dedicated servers - each node is a client and acts as a server. In a peer-to-peer network for data transmission, an agent can request any data from its neighbors, as well as give them if it has them. The more agents, the more possible connections between them, and with an increase in the number, the total throughput of the P2P cloud also increases.
Our own copyfast engine could bring us OCI images as archives, but the container engine would need to import them, which didn't satisfy all of our initial requirements. We decided to look at the solutions offered by open source.
I would like to note that we chose the solution almost two years ago, and at that time there were two products:
- Alibaba Dragonfly v1.
- Uber Kraken.
We first reviewed Dragonfly v1 and noticed that the transfer of any chunk between agents required mandatory coordination from the supernode. This meant that the entire network bandwidth depended linearly on the performance of the coordinator. As a result, it fell as the size of the cloud or transmitted artifact increased.
Today Dragonfly of the first version is already considered obsolete. The developers suggest using the second version of the product, but, according to the documentation, the supernode still coordinates the transfer between the end nodes.
So we chose Uber Kraken: each of its components is unix-way, or as they say now, microservices.
How Uber Kraken works. Each component is a small but important building block of the entire system.
Imagine that the container engine needs to get an image to run. For this are used:
- Nginx - located on the same machine as the container engine. Why we need it, I will tell you a little later. For now, all you need to know is that it simply forwards requests to the Kraken Agent process.
- Kraken Agent is a process that runs on each of our container ships, be it a Kubernetes server or something else. It transmits and/or receives binary blobs upon request.
You can place Nginx and Kraken Agent differently depending on your needs.
To start the download, the agent will need to understand what exactly he wants to receive. To do this, it calls the BuildIndex service.
- The task of BuildIndex is to build the layer graph of the image we want to download. To do this, BuildIndex accesses the authoritative registry, and then "tells" the agent what layers are needed for the image.
- Next, Kraken Tracker comes into play - a component that shows the agent where to download the layers of the desired image. The agent sends it a list of required layers, and the tracker returns a list of end nodes where the agent can get them. The tracker stores the topology in a Redis key-value database.
At this stage, we already have the same cloud for distributing our images. After downloading each layer, the agent informs the tracker about the readiness to distribute it to others.
A few words about how our agent builds connections between other agents. After getting the list of distributors, we can also rank them according to proximity. To do this, each agent contains service information about the location of our agent. As an example, you are using a server row or just a data center label. Based on the location label, the agent will give preference to the neighbor, which optimizes data transfer in the cloud and allows you to organize the desired data-locality.
- Kraken Origin is required if there is no layer in the P2P network. The agent requests the missing layer from it, and Kraken Origin requests it from the authoritative registry and caches it for future requests. In this scheme, Kraken Origin is a superseed that saves as many layers as possible and distributes to everyone.
Note. You can store the Origin cache directly on the server or use remote storage via the S3 protocol. We chose local storage to reduce cache access latency.The entire control loop can be run in master-master mode in any number. We have chosen for ourselves the optimal launch of one replica of all head components in each of our data centers. Authorization data is not stored in Kubernetes secrets, but in the configuration files of those components that access the registry.
- This architecture can be supplemented with an optional component - Kraken Proxy . Let's say there is a collector that collects your images and sends them to an authoritarian registry. If you need to additionally “warm up” the P2P cloud, then you can use Kraken Proxy. It will split the image into a graph of layers and into the layers themselves and add BuildIndex and Origin to the components.
However, for all the advantages, Uber Kraken is not without its drawbacks. Today, the project is given less and less time, and the number of activities around it is falling. Therefore, I do not exclude the possibility that the fate of Uber Mikasu may await him . I think some of you have used this project for yourself.
There is also a mandatory requirement - to use unique tags for images so that the system knows that the image is up-to-date without resorting to an authoritarian registry. In my opinion, this is not a drawback, but a real feature that: a) teaches good manners; b) allows you to set the download policy in the "Do not download if available locally" mode.
Somehow in one of the internal chats there was a phrase that ImagePullPolicy IfNotPresent is not safe. A hypothetical attacker can change the local image on the machine, which will be ignored by the engine. Well, it seems to me that we will have more serious problems than a spoofed image if the attacker is already on our machine. Image substitution is a supply chain vulnerability and a topic for another discussion.
Delivery fault tolerance
In fact, download speed is good, but it is more important to ensure a reliable delivery method. And before we start diving into this topic, we need to realize what exactly we want to protect.
First of all, we need to write the Anti-Washer module, it ensures the safety of images from other modules that want to remove images and free up space in the registry. The module is divided into components:
- The first polls all sources to create a list of all actively used images on the site.
- With its help, the second module assigns an immunity label to the images in the registry so that they cannot be deleted during their operation.
- The third component constantly and on schedule requests all the images from the list from Kraken Origin. "Warming up" guarantees us that all the necessary images are duplicated in the P2P cloud.
- ???
- PROFIT.
As a result, we got two equivalent registries, each of which has the minimum set of images we need. To switch between them, we created a separate "Switch" module.
Remember, I promised you to tell you about nginx?
The developers deliberately did not make the agent complex and did not include the logic of the web server in it. To do this, the agent independently raises the nginx child process with the necessary config.
In order for the module to work regardless of the state of Uber Kraken, we needed to add an additional launch argument to the source code. With it, the agent does not create the nginx process on its own, this allowed us to isolate the processes from each other.
Now we can configure Nginx as we need. And we added the ability to check upstream: if the Kraken Agent answers “OK” to the presence of the Pause image, then we consider that our P2P cloud is available.
Finally, in case the check fails, an upstream is needed to serve the fallback image delivery method. We dusted off the forgotten project and took the official Docker Registry. It has a proxy mode with the ability to cache all received layers at home. In addition, it stores authorization data for communicating with the authoritarian registry, which maintains compatibility with the current Uber Kraken model. And, of course, we can warm it up with our anti-wash module. To test the backup method, some machines use it as the main upstream on an ongoing basis. It turned out to be a kind of Kraken at the minimum.
All of the modules I've talked about have their own health checks, and our monitoring system tells us if we're off course. But we are used to backing up everything, even our own checks. Therefore, we use the k8s-image-availability-exporter project .
This exporter checks the availability of all images and returns metrics in Prometheus format, the project provides rules for alertmanager . Thus, if one source of information fails, the second one will remain in service.
A couple of words about unpacking
It seems that unpacking is not worth mentioning compared to the rest: received, unpacked, and that's it. Sometimes we forget how much work has been done for us, but thanks to Sargun Dhillon, an engineer at Netflix. He made a patch to the moby project in 2017, which is backported to the containerd projects and others. Dillon added a check for the presence of the unpigz binary in the operating system. If it is present, then the patch allows you to unpack the archive using multithreading. And if the binary is not found, then a simple single-threaded unpacking is applied. So all we have to do is check if pigz is installed on our operating systems and thank Sargun.
Results and plans
The main indicators for which we fought were speed and reliability. And they have achieved significant results.
- The Pending phase, which includes downloading and unpacking, has accelerated almost 5 times. The delivery solution shows the same results as the established solution we use for baremetal. This allows us to further move the site to our cloud.
- We have become more efficient in building a data transfer graph and have achieved the desired data-locality.
- We need to maintain our authoritarian registry and Kraken, with only four outages in the past year, including crashes. None of them affected the delivery of images to end nodes.
We do not stop and continue to work, reworking open source solutions, taking into account our realities. For example, we found an interesting Endspiel mode in Kraken. In chess, the final part of the game is called the endgame, when there are few pieces left on the board, and all possible combinations can be calculated to complete. Endspiel mode in Kraken is activated when the agent has almost downloaded the entire layer, and he only needs a couple of chunks. Then the agent sends a request for these chunks to several nodes at once. This mode allows you to improve the 95th percentile of the delivery rate. We want to transfer this idea to our copyfast.
In addition, we recently made changes that allow you to run all Kraken components in rootless mode. Now we plan to hide all our secrets by adding the ability to receive authorization data from the secure storage of Hashicorp Vault to the Kraken code and also update the library for working with Redis, which will allow us to switch to Redis Cluster instead of Redis Sentinel.
All the principles of transparency and interoperability that we took at the beginning of the implementation will allow us to easily change our minds and abandon Kraken. Such an opportunity will come in handy if we get bored , an improved version of Dragonfly or a new cool project appears. That being said, we can continue to work with Kraken, knowing that the OCI specification has already settled down, and this solution can be easily adapted to our needs.
FAQ
I also want to answer the questions that I was asked after speaking at the VK Kubernetes Conference.
Do you only deliver kPHP images via Kraken?
- No, we generally deliver all our images through the Kraken cloud: starting with images where the file system is just a Go binary, and ending with ML application images along with ML models. Delivery speed increases, but not to such values as in the case of kPHP. I note that speed is the second most important factor after reliability.
- I have a lot of images, why should I store them on each agent?
- The agent does not need to store all your images locally to work. Upon request, it receives from the container engine only what it needs, and distributes to its neighbors only what it already has. Keeping as many skins as possible is the responsibility of Kraken Origin.
— Your decision does not have authorization. Does this mean you are losing safely?
- We just changed the storage location of authorization data, without abandoning them. Yes, access to the agent without authorization, but it is available only on the loopback interface. If an attacker is already inside your system, then downloading and/or replacing images is not the worst thing he can do.
The Kubernetes aaS VK Cloud Solutions team is developing its own Kubernetes aaS, which was discussed in this article . It would be great if you test it and give feedback. For testing, all new users are credited with 3,000 bonus rubles upon registration.
What to read on the topic:
Коментарі
Дописати коментар
Олег Мічман в X: «Donations and support for media resources, bloggers, projects, and individuals. https://t.co/HPKsNRd4Uo https://t.co/R6NXVPK62M» / X
https://twitter.com/olukawy/status/1703876551505309973