No Kubespray, no heavy tools. Just K3s — a single binary, 1 command per node. Build a lightweight Kubernetes cluster on 4 GPU VMs and deploy NVIDIA Dynamo for LLM inference.
K3s is a lightweight, production-ready Kubernetes distribution by Rancher (SUSE). The entire control plane is a single ~70MB binary. No Ansible, no playbooks, no 20-minute install. One curl command on the server, one on each worker, done.
If Kubespray is hiring a general contractor with blueprints and a 20-person crew, K3s is like snapping together prefab walls. The result is the same — a functional building (K8s cluster) — but K3s gets there in 2 minutes instead of 20. Fewer moving parts, less to debug.
Ansible-based. Needs a deployer machine. ~20 min install. 500+ tasks. Great for complex, customized clusters. Overkill for 4 nodes.
Single binary. 1 curl command per node. ~2 min install. Built-in containerd, CoreDNS, Flannel, Traefik. Perfect for GPU inference clusters.
| Feature | K3s | Kubespray (kubeadm) |
|---|---|---|
| Install time | ~2 min | ~20 min |
| Install method | Single curl | sh | Ansible playbook |
| Binary size | ~70 MB | ~500 MB+ (multiple components) |
| Container runtime | Built-in containerd | Configurable (containerd/cri-o) |
| Networking | Built-in Flannel (or Calico/Cilium) | Calico (configurable) |
| etcd | Built-in SQLite or embedded etcd | Separate etcd cluster |
| GPU support | Via GPU Operator (same as full K8s) | Via GPU Operator (same) |
| Production ready | Yes (CNCF certified) | Yes |
| Best for | Small-medium clusters, GPU inference | Large, complex, customized clusters |
| VM | Hostname | IP | GPU | RAM | K3s Role |
|---|---|---|---|---|---|
| VM 1 | k3s-server | 10.0.0.10 | H100 80GB | 256 GB | Server (control plane) |
| VM 2 | gpu-node-1 | 10.0.0.11 | H100 80GB | 256 GB | Agent (GPU worker) |
| VM 3 | gpu-node-2 | 10.0.0.12 | H100 80GB | 256 GB | Agent (GPU worker) |
| VM 4 | gpu-node-3 | 10.0.0.13 | H100 80GB | 256 GB | Agent (GPU worker) |
Network Layout:
┌──────────────────────────────────────────────────────────────┐
│ Your Network (10.0.0.0/24) │
│ │
│ ┌─────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ │ k3s-server │ │ gpu-node-1 │ │ gpu-node-2 │ │ gpu-node-3 │
│ │ 10.0.0.10 │ │ 10.0.0.11 │ │ 10.0.0.12 │ │ 10.0.0.13 │
│ │ [H100] │ │ [H100] │ │ [H100] │ │ [H100] │
│ │ │ │ │ │ │ │ │
│ │ K3s Server │ │ K3s Agent │ │ K3s Agent │ │ K3s Agent │
│ │ (control │ │ (worker) │ │ (worker) │ │ (worker) │
│ │ plane) │ │ │ │ │ │ │
│ └─────────────┘ └────────────┘ └────────────┘ └────────────┘
│ │
└──────────────────────────────────────────────────────────────┘
K3s terminology:
"Server" = control plane node (runs API server, scheduler, etcd)
"Agent" = worker node (runs pods, including GPU workloads)
This is the entire Kubernetes installation. No Ansible, no playbooks. Just curl.
Same basics on each VM — Ubuntu 24.04, swap off, forwarding on:
# Run on ALL 4 VMs: sudo swapoff -a sudo sed -i '/ swap / s/^/#/' /etc/fstab echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf sudo sysctl -p
One command. This installs the full control plane.
# SSH into k3s-server (10.0.0.10): curl -sfL https://get.k3s.io | sh - # That's it. K3s is running. Verify: sudo kubectl get nodes # NAME STATUS ROLES AGE VERSION # k3s-server Ready control-plane,master 30s v1.31.4+k3s1 # Grab the join token (agents need this to join): sudo cat /var/lib/rancher/k3s/server/node-token # → K10abc123...::server:xyz789...
What just happened? That single curl | sh installed: containerd, kubelet, kube-proxy, kube-apiserver, kube-scheduler, kube-controller-manager, etcd (embedded), CoreDNS, Flannel networking, metrics-server, and Traefik ingress. All in one binary. All running.
One command per worker. They auto-join the cluster.
# SSH into gpu-node-1 (10.0.0.11): curl -sfL https://get.k3s.io | K3S_URL=https://10.0.0.10:6443 \ K3S_TOKEN="K10abc123...::server:xyz789..." sh - # SSH into gpu-node-2 (10.0.0.12): same command curl -sfL https://get.k3s.io | K3S_URL=https://10.0.0.10:6443 \ K3S_TOKEN="K10abc123...::server:xyz789..." sh - # SSH into gpu-node-3 (10.0.0.13): same command curl -sfL https://get.k3s.io | K3S_URL=https://10.0.0.10:6443 \ K3S_TOKEN="K10abc123...::server:xyz789..." sh -
# Back on k3s-server: sudo kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP k3s-server Ready control-plane,master 2m v1.31.4+k3s1 10.0.0.10 gpu-node-1 Ready <none> 45s v1.31.4+k3s1 10.0.0.11 gpu-node-2 Ready <none> 40s v1.31.4+k3s1 10.0.0.12 gpu-node-3 Ready <none> 35s v1.31.4+k3s1 10.0.0.13 # 4 nodes, all Ready. Kubernetes cluster is running. # Total time: ~2 minutes.
That's it for Kubernetes. With Kubespray you'd still be watching Ansible tasks scroll by. K3s is like plugging in a power strip vs wiring a house from scratch. Both give you outlets (a working K8s cluster), but one takes 2 minutes.
The cluster is running but Kubernetes doesn't know about the GPUs yet. The GPU Operator installs drivers, container toolkit, and device plugins so pods can request nvidia.com/gpu resources.
# On k3s-server: curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash # K3s stores its kubeconfig at a different path: export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
# Add NVIDIA Helm repo helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update # Install GPU Operator # Note: K3s uses containerd at /run/k3s/containerd/containerd.sock helm install gpu-operator nvidia/gpu-operator \ -n gpu-operator --create-namespace \ --set toolkit.env[0].name=CONTAINERD_CONFIG \ --set toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml \ --set toolkit.env[1].name=CONTAINERD_SOCKET \ --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \ --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \ --set toolkit.env[2].value=nvidia # Wait for all pods to be running (~3-5 minutes): kubectl -n gpu-operator get pods -w
K3s-specific config: K3s uses containerd at a non-standard path (/run/k3s/containerd/). The --set toolkit.env flags above tell the GPU Operator where to find it. Without these, the NVIDIA container toolkit can't configure the runtime and GPU pods will fail.
# Check each node has a GPU: kubectl get nodes -o json | jq -r \ '.items[] | "\(.metadata.name): \(.status.allocatable["nvidia.com/gpu"] // "no-gpu")"' # → k3s-server: 1 # → gpu-node-1: 1 # → gpu-node-2: 1 # → gpu-node-3: 1 # Run a quick GPU test pod: kubectl run gpu-test --rm -it --restart=Never \ --image=nvidia/cuda:12.6.0-base-ubuntu24.04 \ --limits=nvidia.com/gpu=1 \ -- nvidia-smi # Should show your H100 details and CUDA version
Cluster state after GPU Operator:
k3s-server (10.0.0.10):
┌──────────────────────────────────┐
│ K3s Server (control plane) │
│ + API Server + etcd + scheduler │
│ + GPU Operator controller pods │
│ [H100 — nvidia.com/gpu: 1] │
└──────────────────────────────────┘
│
┌─────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│gpu-node-1│ │gpu-node-2│ │gpu-node-3│
│[H100] │ │[H100] │ │[H100] │
│nvidia/gpu│ │nvidia/gpu│ │nvidia/gpu│
│: 1 │ │: 1 │ │: 1 │
│ │ │ │ │ │
│driver │ │driver │ │driver │
│toolkit │ │toolkit │ │toolkit │
│plugin │ │plugin │ │plugin │
└──────────┘ └──────────┘ └──────────┘
Every node: NVIDIA driver + container toolkit + device plugin
All installed automatically by the GPU Operator DaemonSet
export VERSION=0.9.0
# CRDs (Custom Resource Definitions)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${VERSION}.tgz
helm install dynamo-crds dynamo-crds-${VERSION}.tgz -n default
# Platform (operator + controllers)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${VERSION}.tgz
helm install dynamo-platform dynamo-platform-${VERSION}.tgz \
-n dynamo-system --create-namespace
# HuggingFace token for model download
kubectl create secret generic hf-secret \
-n dynamo-system \
--from-literal=HF_TOKEN=hf_your_token_here
1 prefill GPU + 2 decode GPUs across the 3 worker nodes:
cat > llama70b-disagg.yaml << 'EOF'
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: llama-70b
namespace: dynamo-system
spec:
services:
Frontend:
replicas: 1
PrefillWorker:
replicas: 1
resources:
limits:
nvidia.com/gpu: "1"
env:
- name: MODEL_PATH
value: "meta-llama/Llama-3-70B"
DecodeWorker:
replicas: 2
resources:
limits:
nvidia.com/gpu: "1"
env:
- name: MODEL_PATH
value: "meta-llama/Llama-3-70B"
EOF
kubectl apply -f llama70b-disagg.yaml
# Watch pods come up:
kubectl -n dynamo-system get pods -o wide -w
NAME READY STATUS NODE
llama-70b-frontend-xxxxx 1/1 Running k3s-server
llama-70b-prefill-worker-xxxxx 1/1 Running gpu-node-1
llama-70b-decode-worker-0-xxxxx 1/1 Running gpu-node-2
llama-70b-decode-worker-1-xxxxx 1/1 Running gpu-node-3
# Test the API:
curl http://10.0.0.10:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3-70B",
"messages": [{"role": "user", "content": "What is K3s?"}],
"stream": true,
"max_tokens": 200
}'
Request flow through the cluster:
You → curl http://10.0.0.10:8000/v1/chat/completions
│
▼
┌──────────────────────────────────────────────────┐
│ k3s-server (10.0.0.10) │
│ [Frontend Pod] receives HTTP, tokenizes │
│ [Router Pod] checks KV cache → picks best GPU │
└──────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ gpu-node-1 (10.0.0.11) — PREFILL │
│ Processes full prompt through 80 layers │
│ Generates KV cache (~1.3 GB) │
│ GPU compute: ~85% utilized │
│ Time: ~45ms │
└──────┬───────────────────────────────────────────┘
│ NIXL KV cache transfer
▼
┌──────────────────────────────────────────────────┐
│ gpu-node-2 (10.0.0.12) — DECODE │
│ Receives KV cache, generates tokens 1-by-1 │
│ GPU bandwidth: ~88% utilized │
│ ~40ms per token → streams back to you │
└──────────────────────────────────────────────────┘
gpu-node-3 handles decode for other concurrent users (batched)
# On the new VM (10.0.0.14): curl -sfL https://get.k3s.io | K3S_URL=https://10.0.0.10:6443 \ K3S_TOKEN="K10abc123...::server:xyz789..." sh - # That's it. GPU Operator auto-installs drivers via DaemonSet. # Dynamo Planner auto-discovers the new GPU and schedules work on it. # Verify: sudo kubectl get nodes # → gpu-node-4 Ready <none> 30s v1.31.4+k3s1
# On the node you want to remove: sudo /usr/local/bin/k3s-agent-uninstall.sh # On k3s-server: kubectl delete node gpu-node-4
# On each node (server first, then agents): curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=latest sh - # K3s upgrades in-place. No drain, no cordon for minor versions.
# Quick check — GPU usage across all nodes: kubectl exec -n gpu-operator $(kubectl get pods -n gpu-operator \ -l app=nvidia-dcgm-exporter -o name | head -1) \ -- dcgm-dmon -e 203,204 -d 1 -c 5 # For full observability: deploy Prometheus + Grafana # DCGM exporter exposes GPU metrics at :9400/metrics # Dynamo exposes inference metrics at :8000/metrics
If you want to skip Dynamo and just serve a model directly on K3s (no disaggregated serving, no KV routing — just basic inference):
# Option A: vLLM directly in a pod
cat > vllm-simple.yaml << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3-8B"
- "--port"
- "8000"
resources:
limits:
nvidia.com/gpu: "1"
ports:
- containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
type: NodePort
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
nodePort: 30080
EOF
kubectl apply -f vllm-simple.yaml
# Access at: http://10.0.0.10:30080/v1/chat/completions
When to use Dynamo vs direct serving: Direct vLLM/SGLang is fine for small models (<13B) on 1 GPU. Use Dynamo when you need disaggregated serving across multiple GPUs, KV-aware routing, autoscaling, or you're running models that don't fit on a single GPU (70B+).
| Task | Command |
|---|---|
| Install server | curl -sfL https://get.k3s.io | sh - |
| Get join token | sudo cat /var/lib/rancher/k3s/server/node-token |
| Join agent | curl -sfL https://get.k3s.io | K3S_URL=https://SERVER:6443 K3S_TOKEN=xxx sh - |
| Kubeconfig path | /etc/rancher/k3s/k3s.yaml |
| kubectl (on server) | sudo kubectl get nodes (or sudo k3s kubectl) |
| Check K3s status | sudo systemctl status k3s |
| K3s logs | sudo journalctl -u k3s -f |
| Uninstall server | sudo /usr/local/bin/k3s-uninstall.sh |
| Uninstall agent | sudo /usr/local/bin/k3s-agent-uninstall.sh |
| Containerd socket | /run/k3s/containerd/containerd.sock |
| Restart K3s | sudo systemctl restart k3s |
| Phase | What | Time |
|---|---|---|
| 0 | 4 Ubuntu 24.04 VMs with GPUs, swap off | ~2 min |
| 1 | Install K3s server (curl | sh) | ~30 sec |
| 2 | Join 3 agents (curl | sh × 3) | ~90 sec |
| 3 | Install GPU Operator (Helm) | ~5 min |
| 4 | Install Dynamo Platform (Helm) | ~3 min |
| 5 | Deploy Llama-3-70B (kubectl apply) | ~10 min (model DL) |
| 6 | Serve inference | immediate |
| Total: bare VMs → serving LLM | ~22 min | |
The Stack: ┌─────────────────────────────────────────┐ │ Your App / OpenAI SDK │ ├─────────────────────────────────────────┤ │ Dynamo Frontend + Router + Workers │ │ NIXL (KV transfer) │ ├─────────────────────────────────────────┤ │ NVIDIA GPU Operator │ │ (drivers + toolkit + device plugin) │ ├─────────────────────────────────────────┤ │ K3s (lightweight Kubernetes) │ │ containerd + Flannel + CoreDNS │ ├─────────────────────────────────────────┤ │ Ubuntu 24.04 │ ├─────────────────────────────────────────┤ │ 4× VMs with H100 GPUs │ └─────────────────────────────────────────┘ Kubespray version: ~50 minutes K3s version: ~22 minutes ← you are here
Built as an educational resource.
K3s · NVIDIA Dynamo