Multi-tenancy in Kubernetes Clusters
Introduction to multi-tenancy
With the expansion of Public Clouds and DevOps, technologies like Kubernetes have become a day-to-day reality for companies of all sizes. The popular choice for a business is to build a single Kubernetes Platform which can be shared by all departments. From a business perspective, it will provide the best utilization for the company’s people resources and hosting costs. From the technical side, it will enable all parts of the business to use stack, tooling, scalability, and all other benefits of the modern platform.
The topic of multi-tenancy arises when several Products and Services have grown enough to have their own Product Teams, Security, SLAs, Cost Codes, etc. It can be very hard to satisfy requirements for more than one Product in all areas without the concept of multi-tenancy.
The concept of Multi-tenancy consists of several basic definitions:
Tenant. Main Actor in the multi-tenanted Platform. Normally Tenant representing a Team or Department. It is common for Tenants to follow the Organisational structure.
Workload. Tenant should not be mistaken with a Workload. Tenant represents a group of People whilst Workload is an application or service those people develop and run. Single Tenants can have multiple Workloads, but any Workload should belong to a single Tenant. There can be exceptions to this rule but from my experience best to avoid it – “if everyone is responsible then no-one is responsible”.
RBAC. Role Based Access Control provides a way to manage privileges of Tenants with Relation to their Workloads. Kubernetes supports RBAC natively, so having multiple Users/Groups with multiple Roles is a standard in any Kubernetes Cluster.
Tenant Identity and RBAC
Definition of a Tenant in Kubernetes Cluster goes hand in hand with Identity and Role Based Access Control, which is a native and core Security feature of Kubernetes. Kubernetes Authentication can be implemented via external OpenID Provider and Authorisation via native Kubernetes RBAC. Many Organisation tend to use their existing Active Directory to control both via some form of Federation.
It is important to understand that Tenant is not a User or a Group. Tenant is a high level business concept which might include any number of users or “service accounts” combined into groups. Each Group will have their own unique set of permissions restricted to only Tenant’s own resources.
If single-tenanted Cluster has a Group called “developers” – multi-Tenanted Cluster will have multiple such groups one for every Tenant.
Since Kubernetes does not offer User Management out of the box Users are managed in an external UserStore like Active Directory, LDAP, etc. OpenID-enabled Kubernetes Cluster is a pre-requisite for Multi-tenancy. All hosted Kubernetes will come with integration to their respective OIDC Providers (Azure, AWS, GCP etc…).
There are several ways Tenants will interact with the cluster:
- User accessing resources via API(kubectl or SDK etc..)
- User accessing Kubernetes via some Application(CI/CD, Monitoring Tools etc…)
The first use-case is straight-forward: user authenticates and interacts with the Cluster using a Role defined for them.
In the second use-case if Application authenticates via the same UserStore as Cluster (normally via OpenID) – individuals can get same roles as they would have when directly interacting with Kubernetes API. Application will present them with functionality limited to their Role defined in Kubernetes. Normally this is done via impersonation headers. Kubernetes Dashboard, Azure Kubernetes Dashboards, Kiali from Istio are the examples of such apps. Single deployment of such Application can be used by all Tenants.
In other cases, the Application may not support OpenID, user impersonation or the passing of an OpenID token to Kubernetes API. In such cases, an Application’s Permissions can be controlled via Service Account. Depending on the specific use-case it might be necessary to have a one deployment per tenant for such applications if it runs inside the Cluster or control users within Application itself. Various tools which are not yet fully Kubernetes-aware are often fall into this category (i.e. for example Jenkins CI/CD).
In Multi-Tenanted cluster individual Tenants should see themselves as a sole occupier of the cluster. This means bringing to a minimum and mitigating as much as possible all downsides of sharing the cluster.
Main areas requiring compromises are Security and Performance which will need to be balanced directly or indirectly against Cost.
Security is a true battleground for Architects creating multi-tenanted platforms. There are no limits on how far you can go: starting from no security at all and up to switching off machines and taking networks offline 🙂
Main areas to focus on:
- Compute Security
- Network Security
Securing of Masters and Workers is a goal of securing compute resources.
- Use of dedicated Workers. Containers belonging to multiple Tenants never run on the same hosts.
- OS Hardening must be applied to Workers and Masters
- Docker is the most popular container engine for Kubernetes and must be secured appropriately and comply with the security best practices(no containers running a root, no high privileged containers etc…)
- Pod Security Policies(being deprecated) or Pod Security Admission (going forward) allows you to limit privileges of PODs
- When using dedicated workers for Tenants, it is very easy to go an extra step and place them in different networks. All conventional networking security applies here quite well: NACLs, Routing Tables, Security Groups i.e. Firewalls etc.
- If workers are shared or sit in the same subnets – Network Policies can be applied on top of Virtual Network implemented with the use of one of the Kubernetes Network Plugins .
- ServiceMesh is another way to control communications inside the Cluster and limit each Tenant to their own resources. In Addition, addresses many of the Application Security aspects which are not specific to the multi-tenancy(encryption in transit, token validation, peer to peer authentication, mutual tls etc. )
- Limiting Egress traffic for Tenants is a very powerful security control. It can be archived using combination of the multiple techniques listed above.
As a Security Tool Kubernetes RBAC is used to configure privileges of Tenants and their Workloads within the Cluster.
- Principle of least privilege should be applied
- Use Namespaces to provide a good perimeter controls for Tenant
- Each Tenant should be able control only their own resources
There are few ways of reducing impact of tenant’s workloads on itself and on other tenants specific to multi-tenanted clusters:
- Pool of dedicated workers. Allocating separate worker pools for tenants will allow them to scale independently and reduce their impact on each other. Sharing Kubernetes masters and using dedicated workers is the easiest way to separate workloads in Clusters with few large tenants. If number of tenants is high, but their size is small – the cost of managing separate workers might outweigh the benefits of sharing a cluster.
- Apply Limits. All resources should have well defined limits. Namespaces, Deployments, PODs etc. This rule can be enforced for all tenants via OPA Agent.
Billing and Cost
This aspect of multi-tenancy is often overlooked by organisations embarking into this journey. Hosting bills are often handled by a Finance Department and might never affect Tenants if they manage to keep costs low enough.
However the cost per tenant is the best method to evaluate not only the effectiveness of the Platform overall but also each individual Tenant. No matter if business has cross-charging mechanism for the Tenants or not each multi-tenanted Platform should provide this metric.
Resources “outside” the Kubernetes Cluster can be measured and priced using Provider’s tools if you are using iaas/paas services.
Price of the Resources within the Cluster should be split proportionally between tenants based on their usage. The level of precision can be tuned depending on the business requirements.
The common trap people fall – is going into too much details and complexity when calculating usage.
- calculate based on memory?
- or based on CPU cycles?
and so on and on. So they see it as a big problem and don’t do anything.
Even simple and approximate mechanism is better than nothing at all.
From my experience, ratio of Memory used by Tenant’s PODs to the Total Available Memory of the Worker over a period of time will give a good enough ratio of Tenant’s usage. All “shared” costs of the Cluster like Masters and System Workloads can be split using the same ratio.
Prometheus monitoring with a simple Grafana Dashboard will be a simple and flexible solution.
The topic of multi-tenancy is very deep. In this post we only scratched the surface and looked into some basic scenarios. We will make a deep dive into individual aspects in future posts.