SkyPilot’s job queue allows multiple jobs to be scheduled on a cluster.
Each task submitted by
sky exec is automatically queued and scheduled
for execution on an existing cluster:
# Launch the job 5 times. sky exec mycluster task.yaml -d sky exec mycluster task.yaml -d sky exec mycluster task.yaml -d sky exec mycluster task.yaml -d sky exec mycluster task.yaml -d
-d / --detach flag detaches logging from the terminal, which is useful for launching many long-running jobs concurrently.
To show a cluster’s jobs and their statuses:
# Show a cluster's jobs (job IDs, statuses). sky queue mycluster
To show the output for each job:
# Stream the outputs of a job. sky logs mycluster JOB_ID
To cancel a job:
# Cancel a job. sky cancel mycluster JOB_ID # Cancel all jobs on a cluster. sky cancel mycluster --all
Jobs that run on multiple nodes are also supported by the job queue.
First, create a
cluster.yaml to specify the desired cluster:
num_nodes: 4 resources: accelerators: V100:8 workdir: ... setup: | # Install dependencies. ...
sky launch -c mycluster cluster.yaml to provision a 4-node (each having 8 V100 GPUs) cluster.
num_nodes field is used to specify how many nodes are required.
Next, create a
task.yaml to specify each task:
num_nodes: 2 resources: accelerators: V100:4 run: | # Run training script. ...
This specifies a task that needs to be run on 2 nodes, each of which must have 4 free V100s.
sky exec mycluster task.yaml to submit this task, which will be scheduled correctly by the job queue.
See Distributed Jobs on Many VMs for more details.
The environment variable
CUDA_VISIBLE_DEVICES will be automatically set to
the devices allocated to each task on each node. This variable is set
when a task’s
run commands are invoked.
task.yaml above launches a 4-GPU task on each node that has 8
GPUs, so the task’s
run commands will be invoked with
CUDA_VISIBLE_DEVICES populated with 4 device IDs.
run commands use Docker/
docker run, simply pass
the correct environment variable would be set inside the container (only the
allocated device IDs will be set).
Example: Grid Search¶
To submit multiple trials with different hyperparameters to a cluster:
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3 $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3 $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4 $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2 $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-6
--gpus: specify the resource requirement for each job.
--detach: detach the run and logging from the terminal, allowing multiple trials to run concurrently.
If there are only 4 V100 GPUs on the cluster, SkyPilot will queue 1 job while the other 4 run in parallel. Once a job finishes, the next job will begin executing immediately. See below for more details on SkyPilot’s scheduling behavior.
You can also use environment variables to set different arguments for each trial.
Example: Fractional GPUs¶
To run multiple trials per GPU, use fractional GPUs in the resource requirement.
For example, use
--gpus V100:0.5 to make 2 trials share 1 GPU:
$ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 1e-3 $ sky exec mycluster --gpus V100:0.5 -d -- python train.py --lr 3e-3 ...
When sharing a GPU, ensure that the GPU’s memory is not oversubscribed (otherwise, out-of-memory errors could occur).
SkyPilot’s scheduler serves two goals:
Preventing resource oversubscription: SkyPilot schedules jobs on a cluster using their resource requirements—either specified in a task YAML’s
resourcesfield, or via the
--gpusoption of the
sky execCLI command. SkyPilot honors these resource requirements while ensuring that no resource in the cluster is oversubscribed. For example, if a node has 4 GPUs, it cannot host a combination of tasks whose sum of GPU requirements exceeds 4.
Minimizing resource idleness: If a resource is idle, SkyPilot will schedule a queued job that can utilize that resource.
We illustrate the scheduling behavior by revisiting Tutorial: DNN Training. In that tutorial, we have a task YAML that specifies these resource requirements:
# dnn.yaml ... resources: accelerators: V100:4 ...
Since a new cluster was created when we ran
sky launch -c lm-cluster
dnn.yaml, SkyPilot provisioned the cluster with exactly the same resources as those
required for the task. Thus,
lm-cluster has 4 V100 GPUs.
While this initial job is running, let us submit more tasks:
$ # Launch 4 jobs, perhaps with different hyperparameters. $ # You can override the task name with `-n` (optional) and $ # the resource requirement with `--gpus` (optional). $ sky exec lm-cluster dnn.yaml -d -n job2 --gpus=V100:1 $ sky exec lm-cluster dnn.yaml -d -n job3 --gpus=V100:1 $ sky exec lm-cluster dnn.yaml -d -n job4 --gpus=V100:4 $ sky exec lm-cluster dnn.yaml -d -n job5 --gpus=V100:2
Because the cluster has only 4 V100 GPUs, we will see the following sequence of events:
sky launchjob is running and occupies 4 GPUs; all other jobs are pending (no free GPUs).
The first two
sky execjobs (job2, job3) then start running and occupy 1 GPU each.
The third job (job4) will be pending, since it requires 4 GPUs and there is only 2 free GPUs left.
The fourth job (job5) will start running, since its requirement is fulfilled with the 2 free GPUs.
Once all but job5 finish, the cluster’s 4 GPUs become free again and job4 will transition from pending to running.
Thus, we may see the following job statuses on this cluster:
$ sky queue lm-cluster ID NAME USER SUBMITTED STARTED STATUS 5 job5 user 10 mins ago 10 mins ago RUNNING 4 job4 user 10 mins ago - PENDING 3 job3 user 10 mins ago 9 mins ago RUNNING 2 job2 user 10 mins ago 9 mins ago RUNNING 1 huggingface user 10 mins ago 1 min ago SUCCEEDED