K3s + NVIDIA GPU Operator + Dynamo

K3s GPU Cluster: 4 VMs to Serving an LLM

No Kubespray, no heavy tools. Just K3s — a single binary, 1 command per node. Build a lightweight Kubernetes cluster on 4 GPU VMs and deploy NVIDIA Dynamo for LLM inference.

Why K3s

K3s: Kubernetes in a single binary

K3s is a lightweight, production-ready Kubernetes distribution by Rancher (SUSE). The entire control plane is a single ~70MB binary. No Ansible, no playbooks, no 20-minute install. One curl command on the server, one on each worker, done.

If Kubespray is hiring a general contractor with blueprints and a 20-person crew, K3s is like snapping together prefab walls. The result is the same — a functional building (K8s cluster) — but K3s gets there in 2 minutes instead of 20. Fewer moving parts, less to debug.

Kubespray

Ansible-based. Needs a deployer machine. ~20 min install. 500+ tasks. Great for complex, customized clusters. Overkill for 4 nodes.

K3s (what we're using)

Single binary. 1 curl command per node. ~2 min install. Built-in containerd, CoreDNS, Flannel, Traefik. Perfect for GPU inference clusters.

FeatureK3sKubespray (kubeadm)
Install time~2 min~20 min
Install methodSingle curl | shAnsible playbook
Binary size~70 MB~500 MB+ (multiple components)
Container runtimeBuilt-in containerdConfigurable (containerd/cri-o)
NetworkingBuilt-in Flannel (or Calico/Cilium)Calico (configurable)
etcdBuilt-in SQLite or embedded etcdSeparate etcd cluster
GPU supportVia GPU Operator (same as full K8s)Via GPU Operator (same)
Production readyYes (CNCF certified)Yes
Best forSmall-medium clusters, GPU inferenceLarge, complex, customized clusters
The Setup

Our 4-VM cluster

VMHostnameIPGPURAMK3s Role
VM 1k3s-server10.0.0.10H100 80GB256 GBServer (control plane)
VM 2gpu-node-110.0.0.11H100 80GB256 GBAgent (GPU worker)
VM 3gpu-node-210.0.0.12H100 80GB256 GBAgent (GPU worker)
VM 4gpu-node-310.0.0.13H100 80GB256 GBAgent (GPU worker)
Network Layout:

  ┌──────────────────────────────────────────────────────────────┐
  │                  Your Network (10.0.0.0/24)                  │
  │                                                              │
  │  ┌─────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐
  │  │ k3s-server  │  │ gpu-node-1 │  │ gpu-node-2 │  │ gpu-node-3 │
  │  │  10.0.0.10  │  │  10.0.0.11 │  │  10.0.0.12 │  │  10.0.0.13 │
  │  │  [H100]     │  │  [H100]    │  │  [H100]    │  │  [H100]    │
  │  │             │  │            │  │            │  │            │
  │  │  K3s Server │  │  K3s Agent │  │  K3s Agent │  │  K3s Agent │
  │  │  (control   │  │  (worker)  │  │  (worker)  │  │  (worker)  │
  │  │   plane)    │  │            │  │            │  │            │
  │  └─────────────┘  └────────────┘  └────────────┘  └────────────┘
  │                                                              │
  └──────────────────────────────────────────────────────────────┘

  K3s terminology:
    "Server" = control plane node (runs API server, scheduler, etcd)
    "Agent"  = worker node (runs pods, including GPU workloads)
Phase 1 — ~3 min

Install K3s cluster

This is the entire Kubernetes installation. No Ansible, no playbooks. Just curl.

1
Prep all 4 VMs (30 seconds each)

Same basics on each VM — Ubuntu 24.04, swap off, forwarding on:

# Run on ALL 4 VMs:
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
2
Install K3s server (VM 1)

One command. This installs the full control plane.

# SSH into k3s-server (10.0.0.10):
curl -sfL https://get.k3s.io | sh -

# That's it. K3s is running. Verify:
sudo kubectl get nodes
# NAME         STATUS   ROLES                  AGE   VERSION
# k3s-server   Ready    control-plane,master   30s   v1.31.4+k3s1

# Grab the join token (agents need this to join):
sudo cat /var/lib/rancher/k3s/server/node-token
# → K10abc123...::server:xyz789...

What just happened? That single curl | sh installed: containerd, kubelet, kube-proxy, kube-apiserver, kube-scheduler, kube-controller-manager, etcd (embedded), CoreDNS, Flannel networking, metrics-server, and Traefik ingress. All in one binary. All running.

3
Join 3 GPU agents (VMs 2, 3, 4)

One command per worker. They auto-join the cluster.

# SSH into gpu-node-1 (10.0.0.11):
curl -sfL https://get.k3s.io | K3S_URL=https://10.0.0.10:6443 \
  K3S_TOKEN="K10abc123...::server:xyz789..." sh -

# SSH into gpu-node-2 (10.0.0.12): same command
curl -sfL https://get.k3s.io | K3S_URL=https://10.0.0.10:6443 \
  K3S_TOKEN="K10abc123...::server:xyz789..." sh -

# SSH into gpu-node-3 (10.0.0.13): same command
curl -sfL https://get.k3s.io | K3S_URL=https://10.0.0.10:6443 \
  K3S_TOKEN="K10abc123...::server:xyz789..." sh -
4
Verify the cluster
# Back on k3s-server:
sudo kubectl get nodes -o wide

NAME         STATUS   ROLES                  AGE   VERSION        INTERNAL-IP
k3s-server   Ready    control-plane,master   2m    v1.31.4+k3s1   10.0.0.10
gpu-node-1   Ready    <none>                 45s   v1.31.4+k3s1   10.0.0.11
gpu-node-2   Ready    <none>                 40s   v1.31.4+k3s1   10.0.0.12
gpu-node-3   Ready    <none>                 35s   v1.31.4+k3s1   10.0.0.13

# 4 nodes, all Ready. Kubernetes cluster is running.
# Total time: ~2 minutes.

That's it for Kubernetes. With Kubespray you'd still be watching Ansible tasks scroll by. K3s is like plugging in a power strip vs wiring a house from scratch. Both give you outlets (a working K8s cluster), but one takes 2 minutes.

Phase 2 — ~5 min

Install NVIDIA GPU Operator

The cluster is running but Kubernetes doesn't know about the GPUs yet. The GPU Operator installs drivers, container toolkit, and device plugins so pods can request nvidia.com/gpu resources.

5
Install Helm (if not already present)
# On k3s-server:
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# K3s stores its kubeconfig at a different path:
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
6
Install the GPU Operator
# Add NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
# Note: K3s uses containerd at /run/k3s/containerd/containerd.sock
helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --set toolkit.env[0].name=CONTAINERD_CONFIG \
  --set toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml \
  --set toolkit.env[1].name=CONTAINERD_SOCKET \
  --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
  --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
  --set toolkit.env[2].value=nvidia

# Wait for all pods to be running (~3-5 minutes):
kubectl -n gpu-operator get pods -w

K3s-specific config: K3s uses containerd at a non-standard path (/run/k3s/containerd/). The --set toolkit.env flags above tell the GPU Operator where to find it. Without these, the NVIDIA container toolkit can't configure the runtime and GPU pods will fail.

7
Verify GPUs are visible
# Check each node has a GPU:
kubectl get nodes -o json | jq -r \
  '.items[] | "\(.metadata.name): \(.status.allocatable["nvidia.com/gpu"] // "no-gpu")"'

# → k3s-server: 1
# → gpu-node-1: 1
# → gpu-node-2: 1
# → gpu-node-3: 1

# Run a quick GPU test pod:
kubectl run gpu-test --rm -it --restart=Never \
  --image=nvidia/cuda:12.6.0-base-ubuntu24.04 \
  --limits=nvidia.com/gpu=1 \
  -- nvidia-smi

# Should show your H100 details and CUDA version
Cluster state after GPU Operator:

  k3s-server (10.0.0.10):
  ┌──────────────────────────────────┐
  │ K3s Server (control plane)       │
  │ + API Server + etcd + scheduler  │
  │ + GPU Operator controller pods   │
  │ [H100 — nvidia.com/gpu: 1]      │
  └──────────────────────────────────┘
           │
     ┌─────┼──────────────┐
     ▼     ▼              ▼
  ┌──────────┐ ┌──────────┐ ┌──────────┐
  │gpu-node-1│ │gpu-node-2│ │gpu-node-3│
  │[H100]    │ │[H100]    │ │[H100]    │
  │nvidia/gpu│ │nvidia/gpu│ │nvidia/gpu│
  │: 1       │ │: 1       │ │: 1       │
  │          │ │          │ │          │
  │driver    │ │driver    │ │driver    │
  │toolkit   │ │toolkit   │ │toolkit   │
  │plugin    │ │plugin    │ │plugin    │
  └──────────┘ └──────────┘ └──────────┘

  Every node: NVIDIA driver + container toolkit + device plugin
  All installed automatically by the GPU Operator DaemonSet
Phase 3 — ~10 min

Deploy NVIDIA Dynamo for LLM serving

8
Install Dynamo Platform
export VERSION=0.9.0

# CRDs (Custom Resource Definitions)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${VERSION}.tgz
helm install dynamo-crds dynamo-crds-${VERSION}.tgz -n default

# Platform (operator + controllers)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${VERSION}.tgz
helm install dynamo-platform dynamo-platform-${VERSION}.tgz \
  -n dynamo-system --create-namespace

# HuggingFace token for model download
kubectl create secret generic hf-secret \
  -n dynamo-system \
  --from-literal=HF_TOKEN=hf_your_token_here
9
Deploy Llama-3-70B with disaggregated serving

1 prefill GPU + 2 decode GPUs across the 3 worker nodes:

cat > llama70b-disagg.yaml << 'EOF'
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: llama-70b
  namespace: dynamo-system
spec:
  services:
    Frontend:
      replicas: 1
    PrefillWorker:
      replicas: 1
      resources:
        limits:
          nvidia.com/gpu: "1"
      env:
        - name: MODEL_PATH
          value: "meta-llama/Llama-3-70B"
    DecodeWorker:
      replicas: 2
      resources:
        limits:
          nvidia.com/gpu: "1"
      env:
        - name: MODEL_PATH
          value: "meta-llama/Llama-3-70B"
EOF

kubectl apply -f llama70b-disagg.yaml
10
Verify and test
# Watch pods come up:
kubectl -n dynamo-system get pods -o wide -w

NAME                                 READY   STATUS    NODE
llama-70b-frontend-xxxxx             1/1     Running   k3s-server
llama-70b-prefill-worker-xxxxx       1/1     Running   gpu-node-1
llama-70b-decode-worker-0-xxxxx      1/1     Running   gpu-node-2
llama-70b-decode-worker-1-xxxxx      1/1     Running   gpu-node-3

# Test the API:
curl http://10.0.0.10:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3-70B",
    "messages": [{"role": "user", "content": "What is K3s?"}],
    "stream": true,
    "max_tokens": 200
  }'
Request flow through the cluster:

  You → curl http://10.0.0.10:8000/v1/chat/completions
         │
         ▼
  ┌──────────────────────────────────────────────────┐
  │ k3s-server (10.0.0.10)                           │
  │ [Frontend Pod] receives HTTP, tokenizes          │
  │ [Router Pod] checks KV cache → picks best GPU    │
  └──────┬───────────────────────────────────────────┘
         │
         ▼
  ┌──────────────────────────────────────────────────┐
  │ gpu-node-1 (10.0.0.11) — PREFILL                │
  │ Processes full prompt through 80 layers          │
  │ Generates KV cache (~1.3 GB)                     │
  │ GPU compute: ~85% utilized                       │
  │ Time: ~45ms                                      │
  └──────┬───────────────────────────────────────────┘
         │ NIXL KV cache transfer
         ▼
  ┌──────────────────────────────────────────────────┐
  │ gpu-node-2 (10.0.0.12) — DECODE                 │
  │ Receives KV cache, generates tokens 1-by-1       │
  │ GPU bandwidth: ~88% utilized                     │
  │ ~40ms per token → streams back to you            │
  └──────────────────────────────────────────────────┘

  gpu-node-3 handles decode for other concurrent users (batched)
Day 2 Operations

Scaling, adding nodes, and monitoring

Adding a 5th GPU node

# On the new VM (10.0.0.14):
curl -sfL https://get.k3s.io | K3S_URL=https://10.0.0.10:6443 \
  K3S_TOKEN="K10abc123...::server:xyz789..." sh -

# That's it. GPU Operator auto-installs drivers via DaemonSet.
# Dynamo Planner auto-discovers the new GPU and schedules work on it.

# Verify:
sudo kubectl get nodes
# → gpu-node-4   Ready   <none>   30s   v1.31.4+k3s1

Removing a node

# On the node you want to remove:
sudo /usr/local/bin/k3s-agent-uninstall.sh

# On k3s-server:
kubectl delete node gpu-node-4

Upgrading K3s

# On each node (server first, then agents):
curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=latest sh -

# K3s upgrades in-place. No drain, no cordon for minor versions.

Monitoring GPU utilization

# Quick check — GPU usage across all nodes:
kubectl exec -n gpu-operator $(kubectl get pods -n gpu-operator \
  -l app=nvidia-dcgm-exporter -o name | head -1) \
  -- dcgm-dmon -e 203,204 -d 1 -c 5

# For full observability: deploy Prometheus + Grafana
# DCGM exporter exposes GPU metrics at :9400/metrics
# Dynamo exposes inference metrics at :8000/metrics
Alternative

Quick serve without Dynamo (simpler, no disagg)

If you want to skip Dynamo and just serve a model directly on K3s (no disaggregated serving, no KV routing — just basic inference):

# Option A: vLLM directly in a pod
cat > vllm-simple.yaml << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3-8B"
          - "--port"
          - "8000"
        resources:
          limits:
            nvidia.com/gpu: "1"
        ports:
        - containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  type: NodePort
  selector:
    app: vllm
  ports:
  - port: 8000
    targetPort: 8000
    nodePort: 30080
EOF

kubectl apply -f vllm-simple.yaml

# Access at: http://10.0.0.10:30080/v1/chat/completions

When to use Dynamo vs direct serving: Direct vLLM/SGLang is fine for small models (<13B) on 1 GPU. Use Dynamo when you need disaggregated serving across multiple GPUs, KV-aware routing, autoscaling, or you're running models that don't fit on a single GPU (70B+).

Cheat Sheet

K3s commands you'll actually use

TaskCommand
Install servercurl -sfL https://get.k3s.io | sh -
Get join tokensudo cat /var/lib/rancher/k3s/server/node-token
Join agentcurl -sfL https://get.k3s.io | K3S_URL=https://SERVER:6443 K3S_TOKEN=xxx sh -
Kubeconfig path/etc/rancher/k3s/k3s.yaml
kubectl (on server)sudo kubectl get nodes (or sudo k3s kubectl)
Check K3s statussudo systemctl status k3s
K3s logssudo journalctl -u k3s -f
Uninstall serversudo /usr/local/bin/k3s-uninstall.sh
Uninstall agentsudo /usr/local/bin/k3s-agent-uninstall.sh
Containerd socket/run/k3s/containerd/containerd.sock
Restart K3ssudo systemctl restart k3s
Summary

The complete timeline

PhaseWhatTime
04 Ubuntu 24.04 VMs with GPUs, swap off~2 min
1Install K3s server (curl | sh)~30 sec
2Join 3 agents (curl | sh × 3)~90 sec
3Install GPU Operator (Helm)~5 min
4Install Dynamo Platform (Helm)~3 min
5Deploy Llama-3-70B (kubectl apply)~10 min (model DL)
6Serve inferenceimmediate
Total: bare VMs → serving LLM~22 min
The Stack:

  ┌─────────────────────────────────────────┐
  │    Your App / OpenAI SDK                │
  ├─────────────────────────────────────────┤
  │    Dynamo Frontend + Router + Workers   │
  │    NIXL (KV transfer)                   │
  ├─────────────────────────────────────────┤
  │    NVIDIA GPU Operator                  │
  │    (drivers + toolkit + device plugin)  │
  ├─────────────────────────────────────────┤
  │    K3s (lightweight Kubernetes)         │
  │    containerd + Flannel + CoreDNS       │
  ├─────────────────────────────────────────┤
  │    Ubuntu 24.04                         │
  ├─────────────────────────────────────────┤
  │    4× VMs with H100 GPUs               │
  └─────────────────────────────────────────┘

  Kubespray version: ~50 minutes
  K3s version:       ~22 minutes  ← you are here
Sources

References

  1. K3s Official Docsdocs.k3s.io
  2. K3s GitHubgithub.com/k3s-io/k3s
  3. K3s + NVIDIA GPU Operator (Vultr)docs.vultr.com
  4. NVIDIA GPU Operator Docsdocs.nvidia.com
  5. NVIDIA K8s Device Plugingithub.com/NVIDIA/k8s-device-plugin
  6. NVIDIA Dynamogithub.com/ai-dynamo/dynamo
  7. Kubespray (for comparison)github.com/kubernetes-sigs/kubespray

Built as an educational resource.
K3s · NVIDIA Dynamo