solving the kubernetes node readiness problem with vigil
About
I ran into a problem that I suspect most teams running Kubernetes at scale have seen but few have a clean answer for: when a new node joins the cluster, workloads start scheduling immediately — before the node is actually ready to serve them. DaemonSets haven’t come up yet, critical services are still initializing, and your application pods land on a half-baked node. This has bitten us enough times that we finally built a solution: Vigil.
The Problem
Kubernetes has a concept of node readiness — the kubelet reports Ready once
it can run pods. But “can run pods” and “should run pods” are very different
things.
In any production cluster, your nodes run DaemonSets: monitoring agents, CNI plugins, CSI drivers, log shippers, security agents. These are the infrastructure layer that your application pods depend on. The problem is that Kubernetes doesn’t wait for any of them before scheduling workloads.
Here’s what the failure looks like in practice. A new node comes up and your application pod gets scheduled onto it:
$ kubectl get events --field-selector reason=Unhealthy -n app-team
LAST SEEN TYPE REASON OBJECT MESSAGE
12s Warning Unhealthy pod/api-server-7f8b2 Readiness probe failed: dial tcp 10.0.47.12:8125: connect: connection refused
12s Warning Unhealthy pod/api-server-7f8b2 Readiness probe failed: dial tcp 10.0.47.12:8125: connect: connection refused
18s Warning Unhealthy pod/api-server-7f8b2 Readiness probe failed: dial tcp 10.0.47.12:8125: connect: connection refused
The app starts up and immediately tries to push metrics to the local Datadog agent — which hasn’t started yet. Depending on how the app handles that failure, you get anything from lost metrics to outright crashes.
Or worse — a critical node-level service has a bug in its latest version and
never reaches Ready on new nodes. Your existing nodes are fine because they’re
running the old version, but every newly launched application pod landing on
fresh nodes fails immediately:
$ kubectl get pods -n kube-system -l app=critical-service --field-selector spec.nodeName=ip-10-0-47-12
NAME READY STATUS RESTARTS AGE
critical-service-x9k2f 0/1 CrashLoopBackOff 4 3m12s
$ kubectl get pods -n app-team --field-selector spec.nodeName=ip-10-0-47-12
NAME READY STATUS RESTARTS AGE
api-server-7f8b2 0/1 Running 0 2m45s
worker-processor-m3k9d 0/1 Running 0 2m38s
Your applications are running on the node, but they’re broken because the infrastructure they depend on never came up. The scheduler has no idea — it sees available CPU and memory and keeps packing pods onto a node that can’t actually serve them.
There’s also the resource accounting problem. The scheduler doesn’t factor in DaemonSet resource consumption when placing workloads. A node with 4 CPU cores looks like it has 4 cores available, even though DaemonSets will eventually claim 1.5 of them. Your application pod and the DaemonSets both get scheduled onto the new node at roughly the same time — but the higher-priority DaemonSets start first and claim their resources. Your application pod gets squeezed out before it ever runs:
$ kubectl get events --field-selector involvedObject.name=api-server-7f8b2 -n app-team
LAST SEEN TYPE REASON OBJECT MESSAGE
5s Warning OutOfcpu pod/api-server-7f8b2 Node ip-10-0-47-12 is out of cpu resources
$ kubectl get pods -n app-team --field-selector spec.nodeName=ip-10-0-47-12
NAME READY STATUS RESTARTS AGE
api-server-7f8b2 0/1 OutOfcpu 0 2m45s
worker-processor-m3k9d 0/1 OutOfcpu 0 2m38s
The scheduler thought the node had plenty of room, but by the time the application pod tries to start, the DaemonSets have already claimed the resources it was counting on.
What We Tried First
Before building anything, we spent a lot of time trying to make the problem smaller.
Optimizing DaemonSet startup time. If DaemonSets come up fast enough, the race condition window shrinks. We set up per-region ECR image caching so pulls were nearly instant. We tuned kubelet settings aggressively — adjusting API rate limits, registration timing, and pod startup parallelism. We dug into Kubernetes API server throttling configs to make sure the control plane wasn’t the bottleneck.
This helped. It shrank the window from “a couple of minutes” to “tens of seconds.” But tens of seconds is still enough for a fast-starting application pod to land on a node and fail. And it did nothing for the scenario where a DaemonSet has a bug and never comes up.
Searching for existing solutions. We looked hard. We searched for open source projects, Kubernetes-native features, KEPs in progress — anything that addressed “don’t schedule workloads until DaemonSets are ready.” We couldn’t find anything that solved it cleanly.
Kubernetes has PriorityClasses (DaemonSets already get high priority),
PodDisruptionBudgets (wrong problem), and various admission webhooks people
have cobbled together (fragile and high-maintenance). None of them actually
answer the question: is this node ready for workloads?
The Idea
The building block was already there: startup taints.
Kubernetes supports taints and tolerations — you can taint a node so that only pods with a matching toleration will schedule on it. Node provisioners like Karpenter, Cluster Autoscaler, and others support applying taints to nodes at launch time. The idea is straightforward:
- Apply a startup taint to every new node (e.g.,
node.example.com/initializing:NoSchedule) - Something watches the node, waits for DaemonSets to be ready
- Remove the taint once the node is actually ready for workloads
Steps 1 and 3 are easy. Step 2 is where it gets interesting.
You could maintain a manual list of “these DaemonSets must be running before the node is ready” — but that’s fragile. Teams add and remove DaemonSets constantly. Someone forgets to update the list and you’re back to square one, or worse, nodes are stuck forever because the list references a DaemonSet that no longer exists.
We wanted something that could figure it out automatically.
Vigil
Vigil is a Kubernetes controller that solves this. It watches for nodes
with the startup taint, auto-discovers which DaemonSets should be running on
each node, waits for them all to reach Ready, and then removes the taint.
The key insight is the auto-discovery. Vigil uses the same scheduling predicates that the Kubernetes scheduler uses — node selectors, affinities, tolerations — to determine which DaemonSets belong on a given node. No manual allowlist. If you add a new DaemonSet with a node selector that matches, Vigil automatically includes it in its readiness checks.
Here’s what the flow looks like:
New node launches
→ Node has taint: node.example.com/initializing:NoSchedule
→ Vigil detects the tainted node
→ Vigil discovers 8 DaemonSets should run on this node
→ Vigil watches: 5/8 Ready... 7/8 Ready... 8/8 Ready
→ Vigil removes the taint
→ Workload scheduling begins
DaemonSet pods themselves tolerate the startup taint (as they tolerate most
taints by default), so they schedule and start normally. Regular workload pods
don’t have the toleration, so they wait in Pending until Vigil clears the
node.
There’s a safety valve too — a configurable timeout (default 120 seconds). If DaemonSets don’t come up in time, Vigil removes the taint anyway so the node isn’t stuck forever. This is a tradeoff: you’d rather have the node available with a degraded DaemonSet than have it completely unusable. The timeout gives your alerting time to fire while keeping the cluster functional.
One Critical Requirement
Vigil itself has to be running somewhere that doesn’t depend on the taints it manages. Think about it: if Vigil runs on nodes with the startup taint, and Vigil is what removes that taint, you have a deadlock.
There are two clean options:
- A dedicated node pool without startup taints. A small set of nodes (even just one or two) reserved for cluster infrastructure like Vigil, where the taint isn’t applied.
- Fargate (or equivalent serverless compute). If you’re on EKS, running Vigil on Fargate profiles means it doesn’t need a node at all — it runs in its own isolated compute environment.
Either way, Vigil must be able to come up independently of the nodes it’s managing.
Final Thoughts
The underlying problem here is that Kubernetes treats node readiness as a binary — the kubelet says “I’m ready” and the scheduler starts packing pods. In reality, readiness is a spectrum. A node isn’t truly ready until its infrastructure layer is running.
Vigil gives you a clean way to express that. It’s a small controller with a single purpose: make sure new nodes are actually ready before they receive workloads. No manual lists to maintain, no fragile webhooks, no crossing your fingers that DaemonSets will win the race.
If this is a problem you’ve hit, check out Vigil on GitHub. It’s open source and ready to use.