In this article we will see how Cilium Cluster Mesh can be enhanced by using CoreDNS, to get a nice multi cluster experience that will let you distribute and access workloads across multiple Kubernetes clusters and that will also work for StatefulSet resources and headless services.
## Introduction
If you ever fiddled with multiple Kubernetes clusters, whatever the reasons, you know there are lots of different parts and the overall architecture may get complex very fast. There are many different tools out there that can help you set up communication and load balancing across clusters, such as Istio, Linkerd, Cilium, Submariner and K8GB, just to name a few. However, most of them are complex both in the installation and the configuration, and others are too simplistic and do not cover all use cases. Even if there is no one-size-fits-all solution, I believe that Cilium offers a nice trade-off between ease of use, features and complexity, and shines when combined with other open source tools.
One challenge that you probably faced if you have stateful workloads in Kubernetes, like databases, is to handle complex topologies for HA, resiliency, DR and so on, whilst automating everything (GitOps) and keeping manual action as minimum as possible.
At work we tried Submariner and it was fine for a while — it had all that we needed: support for overlapping Pod CIDR ranges, service discovery, and more — but when we switched from OpenShift to Kubernetes, changing the CNI from OpenShift SDN to Cilium we began having some troubles. I also want to point out that most of them were due to some of our legacy OpenShift Pod CIDR configurations, and the incompatibility (or guaranteed support) of Submariner with Cilium, and that Submariner itself is a great product that has really nice networking features, and if you have all the prerequisites to deploy it you should probably give it a try! We then decided to go fully down the Cilium road, and have been using Cilium cluster mesh for a while now.
Every multi-cluster solution handles service discovery and communication in a different way, but something that is indispensable if you want to have a smooth multi cluster experience is the capability of calling services from specific clusters. Why is that? Let’s introduce the example that will be used throughout the whole article:
Two kubernetes clusters where a stateful workload is deployed — with one or more replicas (members of the same “stateful cluster”)
The stateful workload could be anything (a MongoDB cluster, for example) that needs to be replicated/stretched in multiple clusters. Let’s say that our workload needs to explicitly state in its configuration all the participating members, for example, in pseudocode:
members {
member-0.cluster.A
member-1.cluster.A
member-2.cluster.A
member-3.cluster.B
member-4.cluster.B
member-5.cluster.B
}
Now, we have to make a few assumptions on the underlying infrastructure/requirements for this setup:
- Pod CIDR ranges must be unique between clusters (as a simplified example, we consider Cluster A to have
10.255.1.0/24
and Cluster B10.255.2.0/24
). - Cilium must be installed with the cluster mesh enabled and setup properly (you can find more on their documentation).
- If you have any firewall, there is a number of ports to open to let Cilium work its magic - the overall idea is that Kubernetes nodes and pods between Cluster A and B must be able to communicate - either via Encapsulation or via Native-Routing.
Cilium with CoreDNS #
So, Cilium Cluster Mesh is great: it’s tightly integrated with the other features that make cilium a great CNI (Hubble, Network Policies, etc…), it’s powerful and easy to configure. Yet, it lacks the capability (at the moment, maybe in the future they will add something like this!) to specify the cluster when accessing a service - so there is no way to actually call from cluster A the pod member-4
of a stateful workload, with an headless service foo that resides in cluster B, something like member-4.foo.namespace.cluster.B
- you can only load balance the service foo
between cluster A and B (round robin, local first or remote first). This is where CoreDNS comes into play and help us solve this problem!
With Cluster Mesh we already have routing + policies for infra-cluster traffic, meaning that member-1
can call directly member-4
, like this:
The curl command is just an example to understand the network flow.
So our main goal is to have a proper way of calling member-4
without having to rely on its IP (because obviously Pod IPs are volatile and may change anytime when the pod is destroyed/moved/etc), without the hassle of configuring it manually. Even though if something like
external-dns could be used, or even LoadBalancer
services with restrictive selectors (to match only member-1
, member-2
, and so on), it is complicated and requires manual intervention and many virtual IPs (or private/public IPs, in case of cloud providers).
We already have a DNS service that runs inside Kubernetes and works exactly as we need, and that is CoreDNS: an incredibly flexible DNS server written in Go that works by composition of plugins, which makes it really easy to configure and to adapt to a various range of use cases. We want to exploit the internal CoreDNS of the two clusters to be able to resolve dynamically the IPs of the pods (there are different queries that can be resolved inside Kubernetes, check the documentation for all the possibilities!) and to let Cilium handle all the rest. We can picture it like this:
The idea here is to expose the CoreDNS service (there are plenty of options: from LoadBalancer
, NodePort
to more sophisticated ones - and you can protect/restrict access to it however you like) so that Cluster A will forward requests for a “fake” domain like cluster.B (you can choose whatever fits you) to the CoreDNS of Cluster B, where it will be resolved to the pod address - which, again, is routable and perfectly valid for Cilium Cluster Mesh.
Implementation #
There a few things to configure, but overall this is a really simple customization of the Corefile (the CoreDNS file that configures the DNS server) and the exposure of the CoreDNS service. For this example, CoreDNS will simply be exposed as a NodePort
service.
Let’s start by exposing the NodePort
service for CoreDNS: this can be done with a simple manifest:
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: core-dns-nodeport
kubernetes.io/name: CoreDNS
name: core-dns-nodeport
namespace: kube-system
spec:
ports:
- name: dns
port: 53
protocol: UDP
targetPort: 53
nodePort: 30053
- name: dns-tcp
port: 53
protocol: TCP
targetPort: 53
nodePort: 30053
selector:
k8s-app: kube-dns
sessionAffinity: None
type: NodePort
Again, this is just an example; you could expose the CoreDNS service in many other ways, and you should always think about how to secure/limit the access when doing so.
Then, we can edit the
default CoreDNS configuration of Cluster A (which is by default a ConfigMap named coredns
in the namespace kube-system
) and add the following snippet:
...
# This means that every query for the "fake" zone cluster.B will be
# handled by this configuration.
cluster.B:53 {
# This will substitute the cluster.B string in the query with
# svc.cluster.local, that can be handled correctly by the remote CoreDNS.
rewrite name substring cluster.B svc.cluster.local
# We exposed the coredns service as a NodePort (30053).
# This is an example of nodes static IPs, but you could also use
# a DNS record, a load balancer, etc...
forward . 192.168.2.1:30053 192.168.2.2:30053 {
expire 10s
policy round_robin
}
cache 10
}
...
:53 {
...
# We also add the rewrite for cluster.A (self)
# Add this line before the default "kubernetes" block
rewrite name substring cluster.A svc.cluster.local
kubernetes cluster.local in-addr.arpa ip6.arpa {
...
}
...
}
The same configuration, but with the opposite cluster forwarding, is done for Cluster B:
...
cluster.A:53 {
rewrite name substring cluster.A svc.cluster.local
# Note that this IPs are different because they are the IPs of cluster A!
forward . 192.168.1.1:30053 192.168.1.2:30053 {
expire 10s
policy round_robin
}
cache 10
}
...
:53 {
...
rewrite name substring cluster.B svc.cluster.local
...
}
This is what the configuration does:
- We defined the zone cluster.B that will answer to queries on port 53 (UDP by default, but TCP can also be used with the option
force_tcp
). - The zone will rewrite the DNS query before forwarding it to the other cluster CoreDNS: it will simply substitute the substring
cluster.B
withsvc.cluster.local
, so that the query that will go to Cluster B will still be a valid query. - We forward the query to the NodePort CoreDNS service, which will get us the IP of the requested pod.
- We also have to define a rule for rewriting
cluster.A
intosvc.cluster.local
- remember we specified in our ideal statefulset the membersmember-1.cluster.A
,member-2.cluster.A
, etc… So pods in cluster A still need to be able to contact pods in the same cluster via this name and not only pods in cluster B.
At this point, everything is setup, and we can restart the Coredns pods in the kube-system
namespace. If the ConfigMap is correct, the pod will be ready in a matter of seconds. You can test if everything is working by simply doing a cURL
or a nslookup
from a pod inside cluster A to an headless service in cluster B and you should see that the IP of the pod in cluster B is correctly resolved and routable!
## Conclusion
We saw how Cilium cluster mesh can be enhanced by simply combining it with CoreDNS, so that you are able to call specific services in specific clusters. Not only that: since the solution is basically based only on Cilium, you can still leverage all the features that Cilium offers for multicluster traffic, such as Hubble, or Network Policies, that will be still valid.