Mostly just a place where I dump interesting things I've learned that I'm pretty sure I'll forget.. It's my own personal knowledge base.

solving the kubernetes node readiness problem with vigil

Published on 31 Mar 2026

About

I ran into a problem that I suspect most teams running Kubernetes at scale have seen but few have a clean answer for: when a new node joins the cluster, workloads start scheduling immediately — before the node is actually ready to serve them. DaemonSets haven’t come up yet, critical services are still initializing, and your application pods land on a half-baked node. This has bitten us enough times that we finally built a solution: Vigil.

The Problem

Kubernetes has a concept of node readiness — the kubelet reports Ready once it can run pods. But “can run pods” and “should run pods” are very different things.

In any production cluster, your nodes run DaemonSets: monitoring agents, CNI plugins, CSI drivers, log shippers, security agents. These are the infrastructure layer that your application pods depend on. The problem is that Kubernetes doesn’t wait for any of them before scheduling workloads.

Here’s what the failure looks like in practice. A new node comes up and your application pod gets scheduled onto it:

$ kubectl get events --field-selector reason=Unhealthy -n app-team
LAST SEEN   TYPE      REASON      OBJECT                MESSAGE
12s         Warning   Unhealthy   pod/api-server-7f8b2   Readiness probe failed: dial tcp 10.0.47.12:8125: connect: connection refused
12s         Warning   Unhealthy   pod/api-server-7f8b2   Readiness probe failed: dial tcp 10.0.47.12:8125: connect: connection refused
18s         Warning   Unhealthy   pod/api-server-7f8b2   Readiness probe failed: dial tcp 10.0.47.12:8125: connect: connection refused

The app starts up and immediately tries to push metrics to the local Datadog agent — which hasn’t started yet. Depending on how the app handles that failure, you get anything from lost metrics to outright crashes.

Or worse — a critical node-level service has a bug in its latest version and never reaches Ready on new nodes. Your existing nodes are fine because they’re running the old version, but every newly launched application pod landing on fresh nodes fails immediately:

$ kubectl get pods -n kube-system -l app=critical-service --field-selector spec.nodeName=ip-10-0-47-12
NAME                        READY   STATUS             RESTARTS   AGE
critical-service-x9k2f      0/1     CrashLoopBackOff   4          3m12s

$ kubectl get pods -n app-team --field-selector spec.nodeName=ip-10-0-47-12
NAME                        READY   STATUS    RESTARTS   AGE
api-server-7f8b2            0/1     Running   0          2m45s
worker-processor-m3k9d      0/1     Running   0          2m38s

Your applications are running on the node, but they’re broken because the infrastructure they depend on never came up. The scheduler has no idea — it sees available CPU and memory and keeps packing pods onto a node that can’t actually serve them.

There’s also the resource accounting problem. The scheduler doesn’t factor in DaemonSet resource consumption when placing workloads. A node with 4 CPU cores looks like it has 4 cores available, even though DaemonSets will eventually claim 1.5 of them. Your application pod and the DaemonSets both get scheduled onto the new node at roughly the same time — but the higher-priority DaemonSets start first and claim their resources. Your application pod gets squeezed out before it ever runs:

$ kubectl get events --field-selector involvedObject.name=api-server-7f8b2 -n app-team
LAST SEEN   TYPE      REASON      OBJECT                 MESSAGE
5s          Warning   OutOfcpu    pod/api-server-7f8b2   Node ip-10-0-47-12 is out of cpu resources

$ kubectl get pods -n app-team --field-selector spec.nodeName=ip-10-0-47-12
NAME                        READY   STATUS      RESTARTS   AGE
api-server-7f8b2            0/1     OutOfcpu    0          2m45s
worker-processor-m3k9d      0/1     OutOfcpu    0          2m38s

The scheduler thought the node had plenty of room, but by the time the application pod tries to start, the DaemonSets have already claimed the resources it was counting on.

What We Tried First

Before building anything, we spent a lot of time trying to make the problem smaller.

Optimizing DaemonSet startup time. If DaemonSets come up fast enough, the race condition window shrinks. We set up per-region ECR image caching so pulls were nearly instant. We tuned kubelet settings aggressively — adjusting API rate limits, registration timing, and pod startup parallelism. We dug into Kubernetes API server throttling configs to make sure the control plane wasn’t the bottleneck.

This helped. It shrank the window from “a couple of minutes” to “tens of seconds.” But tens of seconds is still enough for a fast-starting application pod to land on a node and fail. And it did nothing for the scenario where a DaemonSet has a bug and never comes up.

Searching for existing solutions. We looked hard. We searched for open source projects, Kubernetes-native features, KEPs in progress — anything that addressed “don’t schedule workloads until DaemonSets are ready.” We couldn’t find anything that solved it cleanly.

Kubernetes has PriorityClasses (DaemonSets already get high priority), PodDisruptionBudgets (wrong problem), and various admission webhooks people have cobbled together (fragile and high-maintenance). None of them actually answer the question: is this node ready for workloads?

The Idea

The building block was already there: startup taints.

Kubernetes supports taints and tolerations — you can taint a node so that only pods with a matching toleration will schedule on it. Node provisioners like Karpenter, Cluster Autoscaler, and others support applying taints to nodes at launch time. The idea is straightforward:

Apply a startup taint to every new node (e.g., node.example.com/initializing:NoSchedule)
Something watches the node, waits for DaemonSets to be ready
Remove the taint once the node is actually ready for workloads

Steps 1 and 3 are easy. Step 2 is where it gets interesting.

You could maintain a manual list of “these DaemonSets must be running before the node is ready” — but that’s fragile. Teams add and remove DaemonSets constantly. Someone forgets to update the list and you’re back to square one, or worse, nodes are stuck forever because the list references a DaemonSet that no longer exists.

We wanted something that could figure it out automatically.

Vigil

Vigil is a Kubernetes controller that solves this. It watches for nodes with the startup taint, auto-discovers which DaemonSets should be running on each node, waits for them all to reach Ready, and then removes the taint.

The key insight is the auto-discovery. Vigil uses the same scheduling predicates that the Kubernetes scheduler uses — node selectors, affinities, tolerations — to determine which DaemonSets belong on a given node. No manual allowlist. If you add a new DaemonSet with a node selector that matches, Vigil automatically includes it in its readiness checks.

Here’s what the flow looks like:

New node launches
  → Node has taint: node.example.com/initializing:NoSchedule
  → Vigil detects the tainted node
  → Vigil discovers 8 DaemonSets should run on this node
  → Vigil watches: 5/8 Ready... 7/8 Ready... 8/8 Ready
  → Vigil removes the taint
  → Workload scheduling begins

DaemonSet pods themselves tolerate the startup taint (as they tolerate most taints by default), so they schedule and start normally. Regular workload pods don’t have the toleration, so they wait in Pending until Vigil clears the node.

There’s a safety valve too — a configurable timeout (default 120 seconds). If DaemonSets don’t come up in time, Vigil removes the taint anyway so the node isn’t stuck forever. This is a tradeoff: you’d rather have the node available with a degraded DaemonSet than have it completely unusable. The timeout gives your alerting time to fire while keeping the cluster functional.

One Critical Requirement

Vigil itself has to be running somewhere that doesn’t depend on the taints it manages. Think about it: if Vigil runs on nodes with the startup taint, and Vigil is what removes that taint, you have a deadlock.

There are two clean options:

A dedicated node pool without startup taints. A small set of nodes (even just one or two) reserved for cluster infrastructure like Vigil, where the taint isn’t applied.
Fargate (or equivalent serverless compute). If you’re on EKS, running Vigil on Fargate profiles means it doesn’t need a node at all — it runs in its own isolated compute environment.

Either way, Vigil must be able to come up independently of the nodes it’s managing.

Final Thoughts

The underlying problem here is that Kubernetes treats node readiness as a binary — the kubelet says “I’m ready” and the scheduler starts packing pods. In reality, readiness is a spectrum. A node isn’t truly ready until its infrastructure layer is running.

Vigil gives you a clean way to express that. It’s a small controller with a single purpose: make sure new nodes are actually ready before they receive workloads. No manual lists to maintain, no fragile webhooks, no crossing your fingers that DaemonSets will win the race.

If this is a problem you’ve hit, check out Vigil on GitHub. It’s open source and ready to use.

solving the kubernetes node readiness problem with vigil

About

The Problem

What We Tried First

The Idea

Vigil

One Critical Requirement

Final Thoughts

related posts

the dockerfile VOLUME instruction is a kubernetes footgun

managing multiple claude code profiles with claude-profile

real-time kubernetes cost optimization with lumina and veneer

all tags