☁️ Cloud & DevOps

Kubernetes Resilience: Agents, Backups, and Outages

📅 May 2, 2026

Marcus Cole

Cloud & DevOps Lead

Platform engineer who's been through every infrastructure era — bare metal, VMs, containers, serverless. Has strong opinions about YAML files and even stronger opinions about over-engineering.

cloud infrastructuredisaster recoveryKubernetes securityVelero backupsystem stability

The Reality Check

It is 2 AM. Your phone buzzes, shattering the quiet of the night. You open your laptop to a dashboard that looks like a Christmas tree on fire. Your cluster is throwing out-of-memory errors, a critical CVE just dropped, and when you try to pull the patch, the upstream package mirror is timing out.

This is the reality of modern cloud infrastructure. We spend years building elegant, highly abstracted microservice architectures, only to be brought to our knees because a single dependency we don't control decided to fail. The industry loves to talk about the latest shiny orchestration tools, but true Kubernetes resilience isn't about adding more complexity to your stack. It is about understanding the fundamentals of how your systems fail and ensuring you can recover when they do.

Today, we are looking at three distinct events that highlight the fragility of our ecosystems: the security nightmares of highly dynamic workloads, the shift of Velero to community governance, and a massive DDoS attack on Ubuntu's infrastructure.

The Core Problem: Unpredictable Dependencies

The real bottleneck in our infrastructure is not the technology itself; it is our assumption of predictability. We build systems assuming our workloads will behave in a linear fashion, our state will remain intact, and our external supply chains will always be available.

When we introduce highly dynamic workloads—like the autonomous agents currently flooding enterprise environments—we break the static security models of Kubernetes. When we rely solely on hypervisor snapshots, we ignore the complex state of our cluster's API. And when we assume apt-get update will always work, we leave our security posture at the mercy of the public internet.

Let's break down how these components actually interact under the hood, and how we can apply pragmatic, minimalist solutions to protect our sleep schedules.

Under the Hood: Securing Dynamic Workloads

Traditional Kubernetes workloads are predictable. A web server pod spins up, listens on port 8080, talks to a specific database, and writes to stdout. You can lock this down with strict Role-Based Access Control (RBAC) and network policies because you know exactly what the pod is going to do before it does it.

Dynamic workloads—often referred to as autonomous agents—throw this predictability out the window. According to a recent InfoQ deep dive, these workloads make runtime decisions on external service calls, hold multi-domain credentials, and consume resources unpredictably.

Think of a traditional microservice like a line cook in a restaurant kitchen. The cook stays at their station, receives tickets, cooks burgers, and puts them on the pass. You only need to give them access to the grill and the fridge.

A dynamic agent is like a roaming manager who might need to cook a burger, audit the cash register, call the delivery supplier, or fix the plumbing, all depending on the situation. If you give this roaming manager the keys to everything all the time, a single compromised manager brings down the entire restaurant.

The Pragmatic Solution: Jobs and Vault

Before we reach for a complex new security mesh, let's use the primitives Kubernetes already provides.

Instead of running these dynamic tasks as long-lived Deployments, run them as Kubernetes Jobs. A Job gives each execution its own isolated container, memory space, and, most importantly, a finite lifecycle. If a workload goes rogue and starts consuming massive resources, the Job completes or fails, and the resources are reclaimed.

For secrets, do not mount long-lived API keys into the pod. Use a tool like HashiCorp Vault to issue short-lived, dynamically generated credentials that expire the moment the Job finishes.

Workload Comparison

Characteristic	Traditional Microservice	Dynamic Agent Workload
Lifecycle	Long-running (`Deployment`)	Ephemeral (`Job`)
Resource Usage	Predictable, static limits	Highly variable, bursty
Network Access	Static, predefined egress	Dynamic, multi-domain egress
Credential Strategy	Long-lived service accounts	Short-lived, scoped tokens

Under the Hood: State and Harbor Logistics

Broadcom recently announced the donation of Velero to the Cloud Native Computing Foundation (CNCF). If you are not familiar with Velero, it is the industry standard for Kubernetes backup and disaster recovery. But why do we need a specific tool for Kubernetes backups? Can't we just snapshot the underlying virtual machines?

To understand why Velero is critical, think of a busy shipping harbor. The harbor has physical shipping containers (your persistent data/volumes), but it also has a Harbor Master's office filled with manifests, routing rules, and access logs (the Kubernetes API state, Custom Resource Definitions, and RBAC policies).

If a hurricane hits the harbor, taking a snapshot of the shipping containers isn't enough. When you rebuild the harbor, you have a pile of boxes but no idea where they go, who owns them, or which trucks are allowed to pick them up.

Velero operates at the Kubernetes API layer. It talks directly to etcd (via the API server) and backs up the manifests—the YAML definitions of your cluster state—alongside the physical volume snapshots.

The Pragmatic Solution: Vendor-Neutral Disaster Recovery

The donation of Velero to the CNCF as a Sandbox project is a massive win for operators. It ensures that the tool responsible for our disaster recovery is not locked behind a single vendor's roadmap.

The pragmatic approach here is simple: back up your state out-of-band. Ensure your Velero backups are shipped to an object storage bucket (like S3) that exists entirely outside your primary cloud region or provider. Test your restores. A backup you have never restored is just a theoretical concept.

Under the Hood: Upstream Outages and Broken Pipes

While we are busy securing our dynamic workloads and backing up our state, we often forget the fragile supply chain our infrastructure relies on.

Recently, servers operated by Ubuntu and Canonical were knocked offline by a sustained DDoS attack. This outage prevented users from downloading OS updates from the primary Ubuntu servers. The timing was brutal: it happened hours after researchers released exploit code for a critical vulnerability that allowed untrusted users to gain root control of Linux servers.

Think of this like your house's plumbing bursting. You know exactly how to fix it, and you have the money to buy the parts. But when you drive to the hardware store, the main bridge is washed out. You are stranded with a flooding basement.

When security.ubuntu.com goes down, your automated CI/CD pipelines that build Docker images will fail. Your configuration management tools trying to patch servers will hang. Your infrastructure becomes frozen in a vulnerable state.

The Pragmatic Solution: Local Mirrors and Immutable Infrastructure

Relying on the public internet for critical infrastructure pathing is a recipe for 3 AM alerts.

First, decouple your build processes from public endpoints. Set up a pull-through cache or a local package mirror (like Artifactory or Nexus) inside your network. When your build pipelines run apt-get install, they should talk to your local cache. If the upstream Ubuntu servers go down, your cache still serves the packages you need to build your images.

Second, embrace immutable infrastructure. Do not patch running servers in production. Build a new machine image (AMI) or container image in your secure, cached build environment, and roll that out. If the upstream goes down, your running servers aren't stuck halfway through an update script.

What You Should Do Next

Stop chasing the newest orchestration tool and spend this week auditing your failure domains.

1. Audit Long-Lived Credentials: Look at the service accounts attached to your pods. If a pod is executing dynamic, unpredictable tasks, move it to a Kubernetes Job and implement short-lived tokens.
2. Verify Your Backups: Check your Velero schedules. More importantly, schedule a fire drill this Friday to restore a non-production namespace from scratch using only your Velero backup.
3. Cache Your Dependencies: Review your Dockerfiles and CI/CD pipelines. If you see hardcoded calls to archive.ubuntu.com or public NPM registries without a fallback or cache, you are one DDoS attack away from a blocked deployment pipeline.

Takeaway

There is no perfect system. There are only recoverable systems.

FAQ

Why shouldn't I use standard Deployments for dynamic, unpredictable workloads?

Deployments are designed for long-running, predictable services (like web servers). Dynamic workloads often consume variable resources and require broad permissions. Running them as Jobs ensures they have a finite lifecycle, preventing runaway resource starvation and limiting the time window for compromised credentials.

How is Velero different from taking a VM snapshot of my Kubernetes nodes?

VM snapshots only capture the disk state at the hypervisor level. They do not understand Kubernetes API objects. Velero communicates with the Kubernetes API to back up the logical state of your cluster (CRDs, namespaces, RBAC), allowing you to restore your workloads onto entirely different hardware or a different cloud provider.

How can I protect my CI/CD pipelines from upstream package manager outages?

Implement a local pull-through cache or artifact repository (such as Nexus, Artifactory, or a local registry mirror). Configure your build nodes to request packages from this internal cache rather than the public internet. This ensures you can still build and deploy even if external servers like Ubuntu's package repositories go down.