Kubernetes Cluster Setup
Contents
Kubernetes Cluster Setup¶
Note
This is a guide for cluster administrators on how to set up Kubernetes clusters for use with SkyPilot.
If you are a SkyPilot user and your cluster administrator has already set up a cluster and shared a kubeconfig file with you, Submitting tasks to Kubernetes explains how to submit tasks to your cluster.
SkyPilot’s Kubernetes support is designed to work with most Kubernetes distributions and deployment environments.
To connect to a Kubernetes cluster, SkyPilot needs:
An existing Kubernetes cluster running Kubernetes v1.20 or later.
A Kubeconfig file containing access credentials and namespace to be used.
Deployment Guides¶
Below we show minimal examples to set up a new Kubernetes cluster in different environments, including hosted services on the cloud, and generating kubeconfig files which can be used by SkyPilot.
Deploying locally on your laptop¶
To try out SkyPilot on Kubernetes on your laptop or run SkyPilot
tasks locally without requiring any cloud access, we provide the
sky local up
CLI to create a 1-node Kubernetes cluster locally.
Under the hood, sky local up
uses kind,
a tool for creating a Kubernetes cluster on your local machine.
It runs a Kubernetes cluster inside a container, so no setup is required.
Run
sky local up
to launch a Kubernetes cluster and automatically configure your kubeconfig file:$ sky local up
Run
sky check
and verify that Kubernetes is enabled in SkyPilot. You can now run SkyPilot tasks on this locally hosted Kubernetes cluster usingsky launch
.After you are done using the cluster, you can remove it with
sky local down
. This will terminate the KinD container and switch your kubeconfig back to it’s original context:$ sky local down
Note
We recommend allocating at least 4 or more CPUs to your docker runtime to ensure kind has enough resources. See instructions here.
Note
kind does not support multiple nodes and GPUs. It is not recommended for use in a production environment. If you want to run a private on-prem cluster, see the section on on-prem deployment for more.
Deploying on Google Cloud GKE¶
Create a GKE standard cluster with at least 1 node. We recommend creating nodes with at least 4 vCPUs.
Get the kubeconfig for your cluster. The following command will automatically update
~/.kube/config
with new kubecontext for the GKE cluster:$ gcloud container clusters get-credentials <cluster-name> --region <region> # Example: # gcloud container clusters get-credentials testcluster --region us-central1-c
[If using GPUs] If your GKE nodes have GPUs, you may need to to manually install nvidia drivers. You can do so by deploying the daemonset depending on the OS of your nodes:
# For Container Optimized OS (COS) based nodes: $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml # For Ubuntu based nodes: $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml
To verify if GPU drivers are set up, run
kubectl describe nodes
and verify thatnvidia.com/gpu
is listed under theCapacity
section.Verify your kubeconfig (and GPU support, if available) is correctly set up by running
sky check
:$ sky check
Note
GKE autopilot clusters are currently not supported. Only GKE standard clusters are supported.
Deploying on Amazon EKS¶
Create a EKS cluster with at least 1 node. We recommend creating nodes with at least 4 vCPUs.
Get the kubeconfig for your cluster. The following command will automatically update
~/.kube/config
with new kubecontext for the EKS cluster:$ aws eks update-kubeconfig --name <cluster-name> --region <region> # Example: # aws eks update-kubeconfig --name testcluster --region us-west-2
[If using GPUs] EKS clusters already come with Nvidia drivers set up. However, you will need to label the nodes with the GPU type. Use the SkyPilot node labelling tool to do so:
python -m sky.utils.kubernetes.gpu_labeler
This will create a job on each node to read the GPU type from nvidia-smi and assign a
skypilot.co/accelerator
label to the node. You can check the status of these jobs by running:kubectl get jobs -n kube-system
Verify your kubeconfig (and GPU support, if available) is correctly set up by running
sky check
:$ sky check
Deploying on on-prem clusters¶
You can also deploy Kubernetes on your on-prem clusters using off-the-shelf tools, such as kubeadm, k3s or Rancher. Please follow their respective guides to deploy your Kubernetes cluster.
Setting up GPU support¶
If your Kubernetes cluster has Nvidia GPUs, make sure you have the Nvidia
device plugin installed (i.e., nvidia.com/gpu
resource is available on each node).
Additionally, you will need to label each node in your cluster with the GPU type.
For example, a node with v100 GPUs must have a label skypilot.co/accelerators: v100
.
We provide a convenience script that automatically detects GPU types and labels each node. You can run it with:
$ python -m sky.utils.kubernetes.gpu_labeler
Created GPU labeler job for node ip-192-168-54-76.us-west-2.compute.internal
Created GPU labeler job for node ip-192-168-93-215.us-west-2.compute.internal
GPU labeling started - this may take a few minutes to complete.
To check the status of GPU labeling jobs, run `kubectl get jobs --namespace=kube-system -l job=sky-gpu-labeler`
You can check if nodes have been labeled by running `kubectl describe nodes` and looking for labels of the format `skypilot.co/accelerators: <gpu_name>`.
Note
GPU labelling is not required on GKE clusters - SkyPilot will automatically use GKE provided labels. However, you will still need to install drivers.
Note
If the GPU labelling process fails, you can run python -m sky.utils.kubernetes.gpu_labeler --cleanup
to clean up the failed jobs.
Once the cluster is deployed and you have placed your kubeconfig at ~/.kube/config
, verify your setup by running sky check
:
$ sky check
Observability for Administrators¶
All SkyPilot tasks are run in pods inside a Kubernetes cluster. As a cluster administrator,
you can inspect running pods (e.g., with kubectl get pods -n namespace
) to check which
tasks are running and how many resources they are consuming on the cluster.
Additionally, you can also deploy tools such as the Kubernetes dashboard for easily viewing and managing SkyPilot tasks running on your cluster.

As a demo, we provide a sample Kubernetes dashboard deployment manifest that you can deploy with:
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/scripts/dashboard.yaml
To access the dashboard, run:
$ kubectl proxy
In a browser, open http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/ and click on Skip when prompted for credentials.
Note that this dashboard can only be accessed from the machine where the kubectl proxy
command is executed.
Note
The demo dashboard is not secure and should not be used in production. Please refer to the Kubernetes documentation for more information on how to set up access control for the dashboard.