Ollama: Run quantized LLMs on CPUs and GPUs#

Ollama is popular library for running LLMs on both CPUs and GPUs. It supports a wide range of models, including quantized versions of llama2, llama2:70b, mistral, phi, gemma:7b and many more. You can use SkyPilot to run these models on CPU instances on any cloud provider, Kubernetes cluster, or even on your local machine. And if your instance has GPUs, Ollama will automatically use them for faster inference.

In this example, you will run a quantized version of Llama2 on 4 CPUs with 8GB of memory, and then scale it up to more replicas with SkyServe.


To get started, install the latest version of SkyPilot:

pip install "skypilot-nightly[all]"

For detailed installation instructions, please refer to the installation guide.

Once installed, run sky check to verify you have cloud access.

[Optional] Running locally on your machine#

If you do not have cloud access, you also can run this recipe on your local machine by creating a local Kubernetes cluster with sky local up.

Make sure you have KinD installed and Docker running with 5 or more CPUs and 10GB or more of memory allocated to the Docker runtime.

To create a local Kubernetes cluster, run:

sky local up
Example outputs:
$ sky local up
Creating local cluster...
To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-04-09-19-14-03-599730/local_up.log
I 04-09 19:14:33 log_utils.py:79] Kubernetes is running.
I 04-09 19:15:33 log_utils.py:117] SkyPilot CPU image pulled.
I 04-09 19:15:49 log_utils.py:123] Nginx Ingress Controller installed.
⠸ Running sky check...
Local Kubernetes cluster created successfully with 16 CPUs.
`sky launch` can now run tasks locally.
Hint: To change the number of CPUs, change your docker runtime settings. See https://kind.sigs.k8s.io/docs/user/quick-start/#settings-for-docker-desktop for more info.

After running this, sky check should show that you have access to a Kubernetes cluster.

SkyPilot YAML#

To run Ollama with SkyPilot, create a YAML file with the following content:

Click to see the full recipe YAML
  MODEL_NAME: llama2  # mistral, phi, other ollama supported models
  OLLAMA_HOST:  # Host and port for Ollama to listen on

  cpus: 4+
  memory: 8+  # 8 GB+ for 7B models, 16 GB+ for 13B models, 32 GB+ for 33B models
  # accelerators: L4:1  # No GPUs necessary for Ollama, but you can use them to run inference faster
  ports: 8888

  replicas: 2
  # An actual request for readiness probe.
    path: /v1/chat/completions
      model: $MODEL_NAME
        - role: user
          content: Hello! What is your name?
      max_tokens: 1

setup: |
  # Install Ollama
  if [ "$(uname -m)" == "aarch64" ]; then
    # For apple silicon support
    sudo curl -L https://ollama.com/download/ollama-linux-arm64 -o /usr/bin/ollama
    sudo curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/bin/ollama
  sudo chmod +x /usr/bin/ollama
  # Start `ollama serve` and capture PID to kill it after pull is done
  ollama serve &
  # Wait for ollama to be ready
  for i in {1..20};
    do ollama list && IS_READY=true && break;
    sleep 5;
  if [ "$IS_READY" = false ]; then
      echo "Ollama was not ready after 100 seconds. Exiting."
      exit 1
  # Pull the model
  ollama pull $MODEL_NAME
  echo "Model $MODEL_NAME pulled successfully."
  # Kill `ollama serve` after pull is done
  kill $OLLAMA_PID

run: |
  # Run `ollama serve` in the foreground
  echo "Serving model $MODEL_NAME"
  ollama serve

You can also get the full YAML here.

Serving Llama2 with a CPU instance#

Start serving Llama2 on a 4 CPU instance with the following command:

sky launch ollama.yaml -c ollama --detach-run

Wait until the model command returns successfully.

Example outputs:
== Optimizer ==
Target: minimizing cost
Estimated cost: $0.0 / hour

Considered resources (1 node):
 CLOUD        INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN   
 Kubernetes   4CPU--8GB           4       8         -              kubernetes      0.00          ✔     
 AWS          c6i.xlarge          4       8         -              us-east-1       0.17                
 Azure        Standard_F4s_v2     4       8         -              eastus          0.17                
 GCP          n2-standard-4       4       16        -              us-central1-a   0.19                
 Fluidstack   rec3pUyh6pNkIjCaL   6       24        RTXA4000:1     norway_4_eu     0.64                

💡Tip: You can further reduce costs by using the --use-spot flag to run on spot instances.

To launch a different model, use the MODEL_NAME environment variable:

sky launch ollama.yaml -c ollama --detach-run --env MODEL_NAME=mistral

Ollama supports llama2, llama2:70b, mistral, phi, gemma:7b and many more models. See the full list here.

Once the sky launch command returns successfully, you can interact with the model via

  • Standard OpenAPI-compatible endpoints (e.g., /v1/chat/completions)

  • Ollama API

To curl /v1/chat/completions:

ENDPOINT=$(sky status --endpoint 8888 ollama)
curl $ENDPOINT/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
       "model": "llama2",
       "messages": [
           "role": "system",
           "content": "You are a helpful assistant."
           "role": "user",
           "content": "Who are you?"
Example curl response:
  "id": "chatcmpl-322",
  "object": "chat.completion",
  "created": 1712015174,
  "model": "llama2",
  "system_fingerprint": "fp_ollama",
  "choices": [
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello there! *adjusts glasses* I am Assistant, your friendly and helpful AI companion. My purpose is to assist you in any way possible, from answering questions to providing information on a wide range of topics. Is there something specific you would like to know or discuss? Feel free to ask me anything!"
      "finish_reason": "stop"
  "usage": {
    "prompt_tokens": 29,
    "completion_tokens": 68,
    "total_tokens": 97

💡Tip: To speed up inference, you can use GPUs by specifying the accelerators field in the YAML.

To stop the instance:

sky stop ollama

To shut down all resources:

sky down ollama

If you are using a local Kubernetes cluster created with sky local up, shut it down with:

sky local down

Serving LLMs on CPUs at scale with SkyServe#

After experimenting with the model, you can deploy multiple replicas of the model with autoscaling and load-balancing using SkyServe.

With no change to the YAML, launch a fully managed service on your infra:

sky serve up ollama.yaml -n ollama

Wait until the service is ready:

watch -n10 sky serve status ollama
Example outputs:
ollama  1        3m 15s  READY   2/2

Service Replicas
ollama        1   1   4 mins ago  1x GCP(vCPU=4)  READY   us-central1
ollama        2   1  4 mins ago  1x GCP(vCPU=4)  READY   us-central1

Get a single endpoint that load-balances across replicas:

ENDPOINT=$(sky serve status --endpoint ollama)

💡Tip: SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs.

To curl the endpoint:

curl -L $ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
       "model": "llama2",
       "messages": [
           "role": "system",
           "content": "You are a helpful assistant."
           "role": "user",
           "content": "Who are you?"

To shut down all resources:

sky serve down ollama

See more details in SkyServe docs.