K3s GPU Cluster — From 4 VMs to Serving an LLM

Why K3s

K3s: Kubernetes in a single binary

K3s is a lightweight, production-ready Kubernetes distribution by Rancher (SUSE). The entire control plane is a single ~70MB binary. No Ansible, no playbooks, no 20-minute install. One curl command on the server, one on each worker, done.

If Kubespray is hiring a general contractor with blueprints and a 20-person crew, K3s is like snapping together prefab walls. The result is the same — a functional building (K8s cluster) — but K3s gets there in 2 minutes instead of 20. Fewer moving parts, less to debug.

Kubespray

Ansible-based. Needs a deployer machine. ~20 min install. 500+ tasks. Great for complex, customized clusters. Overkill for 4 nodes.

K3s (what we're using)

Single binary. 1 curl command per node. ~2 min install. Built-in containerd, CoreDNS, Flannel, Traefik. Perfect for GPU inference clusters.

Feature	K3s	Kubespray (kubeadm)
Install time	~2 min	~20 min
Install method	Single `curl \| sh`	Ansible playbook
Binary size	~70 MB	~500 MB+ (multiple components)
Container runtime	Built-in containerd	Configurable (containerd/cri-o)
Networking	Built-in Flannel (or Calico/Cilium)	Calico (configurable)
etcd	Built-in SQLite or embedded etcd	Separate etcd cluster
GPU support	Via GPU Operator (same as full K8s)	Via GPU Operator (same)
Production ready	Yes (CNCF certified)	Yes
Best for	Small-medium clusters, GPU inference	Large, complex, customized clusters

The Setup

Our 4-VM cluster

VM	Hostname	IP	GPU	RAM	K3s Role
VM 1	`k3s-server`	10.0.0.10	H100 80GB	256 GB	Server (control plane)
VM 2	`gpu-node-1`	10.0.0.11	H100 80GB	256 GB	Agent (GPU worker)
VM 3	`gpu-node-2`	10.0.0.12	H100 80GB	256 GB	Agent (GPU worker)
VM 4	`gpu-node-3`	10.0.0.13	H100 80GB	256 GB	Agent (GPU worker)

Network Layout:

  ┌──────────────────────────────────────────────────────────────┐
  │                  Your Network (10.0.0.0/24)                  │
  │                                                              │
  │  ┌─────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐
  │  │ k3s-server  │  │ gpu-node-1 │  │ gpu-node-2 │  │ gpu-node-3 │
  │  │  10.0.0.10  │  │  10.0.0.11 │  │  10.0.0.12 │  │  10.0.0.13 │
  │  │  [H100]     │  │  [H100]    │  │  [H100]    │  │  [H100]    │
  │  │             │  │            │  │            │  │            │
  │  │  K3s Server │  │  K3s Agent │  │  K3s Agent │  │  K3s Agent │
  │  │  (control   │  │  (worker)  │  │  (worker)  │  │  (worker)  │
  │  │   plane)    │  │            │  │            │  │            │
  │  └─────────────┘  └────────────┘  └────────────┘  └────────────┘
  │                                                              │
  └──────────────────────────────────────────────────────────────┘

  K3s terminology:
    "Server" = control plane node (runs API server, scheduler, etcd)
    "Agent"  = worker node (runs pods, including GPU workloads)

Phase 1 — ~3 min

Install K3s cluster

This is the entire Kubernetes installation. No Ansible, no playbooks. Just curl.

Prep all 4 VMs (30 seconds each)

Same basics on each VM — Ubuntu 24.04, swap off, forwarding on:

# Run on ALL 4 VMs:
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Install K3s server (VM 1)

One command. This installs the full control plane.

# SSH into k3s-server (10.0.0.10):
curl -sfL https://get.k3s.io | sh -

# That's it. K3s is running. Verify:
sudo kubectl get nodes
# NAME         STATUS   ROLES                  AGE   VERSION
# k3s-server   Ready    control-plane,master   30s   v1.31.4+k3s1

# Grab the join token (agents need this to join):
sudo cat /var/lib/rancher/k3s/server/node-token
# → K10abc123...::server:xyz789...

What just happened? That single curl | sh installed: containerd, kubelet, kube-proxy, kube-apiserver, kube-scheduler, kube-controller-manager, etcd (embedded), CoreDNS, Flannel networking, metrics-server, and Traefik ingress. All in one binary. All running.

Join 3 GPU agents (VMs 2, 3, 4)

One command per worker. They auto-join the cluster.

# SSH into gpu-node-1 (10.0.0.11):
curl -sfL https://get.k3s.io | K3S_URL=https://10.0.0.10:6443 \
  K3S_TOKEN="K10abc123...::server:xyz789..." sh -

# SSH into gpu-node-2 (10.0.0.12): same command
curl -sfL https://get.k3s.io | K3S_URL=https://10.0.0.10:6443 \
  K3S_TOKEN="K10abc123...::server:xyz789..." sh -

# SSH into gpu-node-3 (10.0.0.13): same command
curl -sfL https://get.k3s.io | K3S_URL=https://10.0.0.10:6443 \
  K3S_TOKEN="K10abc123...::server:xyz789..." sh -

Verify the cluster

# Back on k3s-server:
sudo kubectl get nodes -o wide

NAME         STATUS   ROLES                  AGE   VERSION        INTERNAL-IP
k3s-server   Ready    control-plane,master   2m    v1.31.4+k3s1   10.0.0.10
gpu-node-1   Ready    <none>                 45s   v1.31.4+k3s1   10.0.0.11
gpu-node-2   Ready    <none>                 40s   v1.31.4+k3s1   10.0.0.12
gpu-node-3   Ready    <none>                 35s   v1.31.4+k3s1   10.0.0.13

# 4 nodes, all Ready. Kubernetes cluster is running.
# Total time: ~2 minutes.

That's it for Kubernetes. With Kubespray you'd still be watching Ansible tasks scroll by. K3s is like plugging in a power strip vs wiring a house from scratch. Both give you outlets (a working K8s cluster), but one takes 2 minutes.

Phase 2 — ~5 min

Install NVIDIA GPU Operator

The cluster is running but Kubernetes doesn't know about the GPUs yet. The GPU Operator installs drivers, container toolkit, and device plugins so pods can request nvidia.com/gpu resources.

Install Helm (if not already present)

# On k3s-server:
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# K3s stores its kubeconfig at a different path:
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

Install the GPU Operator

# Add NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
# Note: K3s uses containerd at /run/k3s/containerd/containerd.sock
helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --set toolkit.env[0].name=CONTAINERD_CONFIG \
  --set toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml \
  --set toolkit.env[1].name=CONTAINERD_SOCKET \
  --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
  --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
  --set toolkit.env[2].value=nvidia

# Wait for all pods to be running (~3-5 minutes):
kubectl -n gpu-operator get pods -w

K3s-specific config: K3s uses containerd at a non-standard path (/run/k3s/containerd/). The --set toolkit.env flags above tell the GPU Operator where to find it. Without these, the NVIDIA container toolkit can't configure the runtime and GPU pods will fail.

Verify GPUs are visible

# Check each node has a GPU:
kubectl get nodes -o json | jq -r \
  '.items[] | "\(.metadata.name): \(.status.allocatable["nvidia.com/gpu"] // "no-gpu")"'

# → k3s-server: 1
# → gpu-node-1: 1
# → gpu-node-2: 1
# → gpu-node-3: 1

# Run a quick GPU test pod:
kubectl run gpu-test --rm -it --restart=Never \
  --image=nvidia/cuda:12.6.0-base-ubuntu24.04 \
  --limits=nvidia.com/gpu=1 \
  -- nvidia-smi

# Should show your H100 details and CUDA version

Cluster state after GPU Operator:

  k3s-server (10.0.0.10):
  ┌──────────────────────────────────┐
  │ K3s Server (control plane)       │
  │ + API Server + etcd + scheduler  │
  │ + GPU Operator controller pods   │
  │ [H100 — nvidia.com/gpu: 1]      │
  └──────────────────────────────────┘
           │
     ┌─────┼──────────────┐
     ▼     ▼              ▼
  ┌──────────┐ ┌──────────┐ ┌──────────┐
  │gpu-node-1│ │gpu-node-2│ │gpu-node-3│
  │[H100]    │ │[H100]    │ │[H100]    │
  │nvidia/gpu│ │nvidia/gpu│ │nvidia/gpu│
  │: 1       │ │: 1       │ │: 1       │
  │          │ │          │ │          │
  │driver    │ │driver    │ │driver    │
  │toolkit   │ │toolkit   │ │toolkit   │
  │plugin    │ │plugin    │ │plugin    │
  └──────────┘ └──────────┘ └──────────┘

  Every node: NVIDIA driver + container toolkit + device plugin
  All installed automatically by the GPU Operator DaemonSet

Phase 3 — ~10 min

Deploy NVIDIA Dynamo for LLM serving

Install Dynamo Platform

export VERSION=0.9.0

# CRDs (Custom Resource Definitions)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${VERSION}.tgz
helm install dynamo-crds dynamo-crds-${VERSION}.tgz -n default

# Platform (operator + controllers)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${VERSION}.tgz
helm install dynamo-platform dynamo-platform-${VERSION}.tgz \
  -n dynamo-system --create-namespace

# HuggingFace token for model download
kubectl create secret generic hf-secret \
  -n dynamo-system \
  --from-literal=HF_TOKEN=hf_your_token_here

Deploy Llama-3-70B with disaggregated serving

1 prefill GPU + 2 decode GPUs across the 3 worker nodes:

cat > llama70b-disagg.yaml << 'EOF'
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: llama-70b
  namespace: dynamo-system
spec:
  services:
    Frontend:
      replicas: 1
    PrefillWorker:
      replicas: 1
      resources:
        limits:
          nvidia.com/gpu: "1"
      env:
        - name: MODEL_PATH
          value: "meta-llama/Llama-3-70B"
    DecodeWorker:
      replicas: 2
      resources:
        limits:
          nvidia.com/gpu: "1"
      env:
        - name: MODEL_PATH
          value: "meta-llama/Llama-3-70B"
EOF

kubectl apply -f llama70b-disagg.yaml

Verify and test

# Watch pods come up:
kubectl -n dynamo-system get pods -o wide -w

NAME                                 READY   STATUS    NODE
llama-70b-frontend-xxxxx             1/1     Running   k3s-server
llama-70b-prefill-worker-xxxxx       1/1     Running   gpu-node-1
llama-70b-decode-worker-0-xxxxx      1/1     Running   gpu-node-2
llama-70b-decode-worker-1-xxxxx      1/1     Running   gpu-node-3

# Test the API:
curl http://10.0.0.10:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3-70B",
    "messages": [{"role": "user", "content": "What is K3s?"}],
    "stream": true,
    "max_tokens": 200
  }'

Request flow through the cluster:

  You → curl http://10.0.0.10:8000/v1/chat/completions
         │
         ▼
  ┌──────────────────────────────────────────────────┐
  │ k3s-server (10.0.0.10)                           │
  │ [Frontend Pod] receives HTTP, tokenizes          │
  │ [Router Pod] checks KV cache → picks best GPU    │
  └──────┬───────────────────────────────────────────┘
         │
         ▼
  ┌──────────────────────────────────────────────────┐
  │ gpu-node-1 (10.0.0.11) — PREFILL                │
  │ Processes full prompt through 80 layers          │
  │ Generates KV cache (~1.3 GB)                     │
  │ GPU compute: ~85% utilized                       │
  │ Time: ~45ms                                      │
  └──────┬───────────────────────────────────────────┘
         │ NIXL KV cache transfer
         ▼
  ┌──────────────────────────────────────────────────┐
  │ gpu-node-2 (10.0.0.12) — DECODE                 │
  │ Receives KV cache, generates tokens 1-by-1       │
  │ GPU bandwidth: ~88% utilized                     │
  │ ~40ms per token → streams back to you            │
  └──────────────────────────────────────────────────┘

  gpu-node-3 handles decode for other concurrent users (batched)

Day 2 Operations

Scaling, adding nodes, and monitoring

Adding a 5th GPU node

# On the new VM (10.0.0.14):
curl -sfL https://get.k3s.io | K3S_URL=https://10.0.0.10:6443 \
  K3S_TOKEN="K10abc123...::server:xyz789..." sh -

# That's it. GPU Operator auto-installs drivers via DaemonSet.
# Dynamo Planner auto-discovers the new GPU and schedules work on it.

# Verify:
sudo kubectl get nodes
# → gpu-node-4   Ready   <none>   30s   v1.31.4+k3s1

Removing a node

# On the node you want to remove:
sudo /usr/local/bin/k3s-agent-uninstall.sh

# On k3s-server:
kubectl delete node gpu-node-4

Upgrading K3s

# On each node (server first, then agents):
curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=latest sh -

# K3s upgrades in-place. No drain, no cordon for minor versions.

Monitoring GPU utilization

# Quick check — GPU usage across all nodes:
kubectl exec -n gpu-operator $(kubectl get pods -n gpu-operator \
  -l app=nvidia-dcgm-exporter -o name | head -1) \
  -- dcgm-dmon -e 203,204 -d 1 -c 5

# For full observability: deploy Prometheus + Grafana
# DCGM exporter exposes GPU metrics at :9400/metrics
# Dynamo exposes inference metrics at :8000/metrics

Alternative

Quick serve without Dynamo (simpler, no disagg)

If you want to skip Dynamo and just serve a model directly on K3s (no disaggregated serving, no KV routing — just basic inference):

# Option A: vLLM directly in a pod
cat > vllm-simple.yaml << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3-8B"
          - "--port"
          - "8000"
        resources:
          limits:
            nvidia.com/gpu: "1"
        ports:
        - containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  type: NodePort
  selector:
    app: vllm
  ports:
  - port: 8000
    targetPort: 8000
    nodePort: 30080
EOF

kubectl apply -f vllm-simple.yaml

# Access at: http://10.0.0.10:30080/v1/chat/completions

When to use Dynamo vs direct serving: Direct vLLM/SGLang is fine for small models (<13B) on 1 GPU. Use Dynamo when you need disaggregated serving across multiple GPUs, KV-aware routing, autoscaling, or you're running models that don't fit on a single GPU (70B+).

Cheat Sheet

K3s commands you'll actually use

Task	Command
Install server	`curl -sfL https://get.k3s.io \| sh -`
Get join token	`sudo cat /var/lib/rancher/k3s/server/node-token`
Join agent	`curl -sfL https://get.k3s.io \| K3S_URL=https://SERVER:6443 K3S_TOKEN=xxx sh -`
Kubeconfig path	`/etc/rancher/k3s/k3s.yaml`
kubectl (on server)	`sudo kubectl get nodes` (or `sudo k3s kubectl`)
Check K3s status	`sudo systemctl status k3s`
K3s logs	`sudo journalctl -u k3s -f`
Uninstall server	`sudo /usr/local/bin/k3s-uninstall.sh`
Uninstall agent	`sudo /usr/local/bin/k3s-agent-uninstall.sh`
Containerd socket	`/run/k3s/containerd/containerd.sock`
Restart K3s	`sudo systemctl restart k3s`

Summary

The complete timeline

Phase	What	Time
0	4 Ubuntu 24.04 VMs with GPUs, swap off	~2 min
1	Install K3s server (`curl \| sh`)	~30 sec
2	Join 3 agents (`curl \| sh` × 3)	~90 sec
3	Install GPU Operator (Helm)	~5 min
4	Install Dynamo Platform (Helm)	~3 min
5	Deploy Llama-3-70B (`kubectl apply`)	~10 min (model DL)
6	Serve inference	immediate
Total: bare VMs → serving LLM		~22 min

The Stack:

  ┌─────────────────────────────────────────┐
  │    Your App / OpenAI SDK                │
  ├─────────────────────────────────────────┤
  │    Dynamo Frontend + Router + Workers   │
  │    NIXL (KV transfer)                   │
  ├─────────────────────────────────────────┤
  │    NVIDIA GPU Operator                  │
  │    (drivers + toolkit + device plugin)  │
  ├─────────────────────────────────────────┤
  │    K3s (lightweight Kubernetes)         │
  │    containerd + Flannel + CoreDNS       │
  ├─────────────────────────────────────────┤
  │    Ubuntu 24.04                         │
  ├─────────────────────────────────────────┤
  │    4× VMs with H100 GPUs               │
  └─────────────────────────────────────────┘

  Kubespray version: ~50 minutes
  K3s version:       ~22 minutes  ← you are here

Sources

References

K3s Official Docs → docs.k3s.io
K3s GitHub → github.com/k3s-io/k3s
K3s + NVIDIA GPU Operator (Vultr) → docs.vultr.com
NVIDIA GPU Operator Docs → docs.nvidia.com
NVIDIA K8s Device Plugin → github.com/NVIDIA/k8s-device-plugin
NVIDIA Dynamo → github.com/ai-dynamo/dynamo
Kubespray (for comparison) → github.com/kubernetes-sigs/kubespray

Built as an educational resource.
K3s · NVIDIA Dynamo