Gemma: Open-source Gemini#

Google released Gemma and has made a big wave in the AI community. It opens the opportunity for the open-source community to serve and finetune private Gemini.

Serve Gemma on any Cloud#

Serving Gemma on any cloud is easy with SkyPilot. With serve.yaml in this directory, you host the model on any cloud with a single command.

Prerequsites#

Apply for access to the Gemma model

Go to the application page and click Acknowledge license to apply for access to the model weights.

Get the access token from huggingface

Generate a read-only access token on huggingface here, and make sure your huggingface account can access the Gemma models here.

Install SkyPilot

pip install "skypilot-nightly[all]"

For detailed installation instructions, please refer to the installation guide.

Host on a Single Instance#

We can host the model with a single instance:

HF_TOKEN="xxx" sky launch -c gemma serve.yaml --env HF_TOKEN

After the cluster is launched, we can access the model with the following command:

IP=$(sky status --ip gemma)

curl http://$IP:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "google/gemma-7b-it",
      "prompt": "My favourite condiment is",
      "max_tokens": 25
  }' | jq .

Chat API is also supported:

IP=$(sky status --ip gemma)

curl http://$IP:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "google/gemma-7b-it",
      "messages": [
        {
          "role": "user",
          "content": "Hello! What is your name?"
        }
      ],
      "max_tokens": 25
  }'

Scale the Serving with SkyServe#

Using the same YAML, we can easily scale the model serving across multiple instances, regions and clouds with SkyServe:

HF_TOKEN="xxx" sky serve up -n gemma serve.yaml --env HF_TOKEN

Notice the only change is from sky launch to sky serve up. The same YAML can be used without changes.

After the cluster is launched, we can access the model with the following command:

ENDPOINT=$(sky serve status --endpoint gemma)

curl http://$ENDPOINT/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "google/gemma-7b-it",
      "prompt": "My favourite condiment is",
      "max_tokens": 25
  }' | jq .

Chat API is also supported:

curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "google/gemma-7b-it",
      "messages": [
        {
          "role": "user",
          "content": "Hello! What is your name?"
        }
      ],
      "max_tokens": 25
  }'