Code Llama: Serve Your Private Code Model#

Code Llama is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets, sampling more data from that same dataset for longer. On Jan 29th, 2024, Meta released the Code Llama 70B, the largest and best-performing model in the Code Llama family.

The followings are the demos of Code Llama 70B hosted by SkyPilot Serve (aka SkyServe) (see more details about the setup in later sections):


Coding Assistant: Connect to hosted Code Llama with Tabby in VScode
Chat: Connect to hosted Code Llama with FastChat


Why use SkyPilot to deploy over commercial hosted solutions?#

  • Get the best GPU availability by utilizing multiple resources pools across multiple regions and clouds.

  • Pay absolute minimum — SkyPilot picks the cheapest resources across regions and clouds. No managed solution markups.

  • Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint

  • Everything stays in your cloud account (your VMs & buckets)

  • Completely private - no one else sees your chat history

Running your own Code Llama with SkyPilot#

After installing SkyPilot, run your own Code Llama on vLLM with SkyPilot in 1-click:

  1. Start serving Code Llama 70B on a single instance with any available GPU in the list specified in endpoint.yaml with a vLLM powered OpenAI-compatible endpoint:

sky launch -c code-llama -s endpoint.yaml

CLOUD   INSTANCE                    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN   
 Azure   Standard_NC48ads_A100_v4    48      440       A100-80GB:2    eastus          7.35          ✔     
 GCP     g2-standard-96              96      384       L4:8           us-east4-a      7.98                
 GCP     a2-ultragpu-2g              24      340       A100-80GB:2    us-central1-a   10.06               
 Azure   Standard_NC96ads_A100_v4    96      880       A100-80GB:4    eastus          14.69               
 GCP     a2-highgpu-4g               48      340       A100:4         us-central1-a   14.69               
 AWS     g5.48xlarge                 192     768       A10G:8         us-east-1       16.29               
 GCP     a2-ultragpu-4g              48      680       A100-80GB:4    us-central1-a   20.11               
 Azure   Standard_ND96asr_v4         96      900       A100:8         eastus          27.20               
 GCP     a2-highgpu-8g               96      680       A100:8         us-central1-a   29.39               
 AWS     p4d.24xlarge                96      1152      A100:8         us-east-1       32.77               
 Azure   Standard_ND96amsr_A100_v4   96      1924      A100-80GB:8    eastus          32.77               
 GCP     a2-ultragpu-8g              96      1360      A100-80GB:8    us-central1-a   40.22               
 AWS     p4de.24xlarge               96      1152      A100-80GB:8    us-east-1       40.97               

Launching a cluster 'code-llama'. Proceed? [Y/n]: 
  1. Send a request to the endpoint for code completion:

IP=$(sky status --ip code-llama)

curl -L http://$IP:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "codellama/CodeLlama-70b-Instruct-hf",
      "prompt": "def quick_sort(a: List[int]):",
      "max_tokens": 512
  }' | jq -r '.choices[0].text'

This returns the following completion:

    if len(a) <= 1:
        return a
    pivot = a.pop(len(a)//2)
    b = []
    c = []
    for i in a:
        if i > pivot:
    b = quick_sort(b)
    c = quick_sort(c)
    res = []
    return res

Scale up the service with SkyServe#

  1. With SkyServe, a serving library built on top of SkyPilot, scaling up the Code Llama service is as simple as running:

sky serve up -n code-llama ./endpoint.yaml

This will start the service with multiple replicas on the cheapest available locations and accelerators. SkyServe will automatically manage the replicas, monitor their health, autoscale based on load, and restart them when needed.

A single endpoint will be returned and any request sent to the endpoint will be routed to the ready replicas.

  1. To check the status of the service, run:

sky serve status code-llama

After a while, you will see the following output:

code-llama  1        -       READY         2/2  

Service Replicas
code-llama    1   1        -   2 mins ago  1x Azure({'A100-80GB': 2}) READY  eastus  
code-llama    2   1        -   2 mins ago  1x GCP({'L4': 8})          READY  us-east4-a 

As shown, the service is now backed by 2 replicas, one on Azure and one on GCP, and the accelerator type is chosen to be the cheapest available one on the clouds. That said, it maximizes the availability of the service while minimizing the cost.

  1. To access the model, we use the same curl command to send the request to the endpoint:

ENDPOINT=$(sky serve status --endpoint code-llama)

curl -L http://$ENDPOINT/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "codellama/CodeLlama-70b-Instruct-hf",
      "prompt": "def quick_sort(a: List[int]):",
      "max_tokens": 512
  }' | jq -r '.choices[0].text'

Optional: Accessing Code Llama with Chat API#

We can also access the Code Llama service with the openAI Chat API.

ENDPOINT=$(sky serve status --endpoint code-llama)

curl -L http://$ENDPOINT/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "codellama/CodeLlama-70b-Instruct-hf",
      "messages": [
          "role": "system",
          "content": "You are a helpful and honest code assistant expert in Python."
          "role": "user",
          "content": "Show me the python code for quick sorting a list of integers."
      "max_tokens": 512
  }' | jq -r '.choices[0].message.content'

You can see something similar as below:

def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)

# Example usage:
numbers = [10, 2, 44, 15, 30, 11, 50]
sorted_numbers = quicksort(numbers)

This code defines a function `quicksort` that takes a list of integers as input. It divides the list into three parts based on the pivot element, which is the middle element of the list. It then recursively sorts the left and right partitions and combines them with the middle partition.

Alternatively, we could access the model through python with OpenAI’s API (see


Optional: Accessing Code Llama with Chat GUI#

It is also possible to access the Code Llama service with a GUI using FastChat. Please check the demo.

  1. Start the chat web UI:

sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)
  1. Then, we can access the GUI at the returned gradio link:

| INFO | stdout | Running on public URL:

Note that you may get better results to use a higher temperature and top_p value.

Optional: Using Code Llama as Coding Assistant in VScode#

Tabby is an open-source, self-hosted AI coding assistant. It allows you to connect to your own AI models and use them as a coding assistant in VScode. Please check the demo at the top.

To start a Tabby server that connects to the Code Llama service, run:

sky launch -c tabby ./tabby.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)

To get the endpoint for Tabby server, run:

IP=$(sky status --ip tabby)
echo Endpoint: http://$IP:8080

Then, you can connect to the Tabby server from VScode by installing the Tabby extension and configuring the API Endpoint under Tabby settings.

Note that Code Llama 70B does not have the full infiling functionality [1], so the performance of Tabby with Code Llama may be limited.

To get infiling functionality, you can use the smaller Code Llama models, e.g., Code Llama 7B and 13B, and replace prompt_template with "<|fim▁begin|>{prefix}<|fim▁hole|>{suffix}<|fim▁end|>" in the yaml or the command above.

For better performance, we recommend using Tabby with the recommended models in the Tabby documentation and our Tabby example.