LoRAX: Multi-LoRA Inference Server#

LoRAX

LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned LLMs on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency. It works by dynamically loading multiple fine-tuned “adapters” (LoRAs, etc.) on top of a single base model at runtime. Concurrent requests for different adapters can be processed together in a single batch, allowing LoRAX to maintain near linear throughput scaling as the number of adapters increases.

Launch a deployment#

Create a YAML configuration file called lorax.yaml:

resources:
  accelerators: {A10G, A10, L4, A100, A100-80GB}
  memory: 32+
  ports: 
    - 8080

envs:
  MODEL_ID: mistralai/Mistral-7B-Instruct-v0.1

run: |
  docker run --gpus all --shm-size 1g -p 8080:80 -v ~/data:/data \
    ghcr.io/predibase/lorax:latest \
    --model-id $MODEL_ID

In the above example, we’re asking SkyPilot to provision an AWS instance with 1 Nvidia A10G GPU and at least 32GB of RAM. Once the node is provisioned, SkyPilot will launch the LoRAX server using our latest pre-built Docker image.

Let’s launch our LoRAX job:

sky launch -c lorax-cluster lorax.yaml

By default, this config will deploy Mistral-7B-Instruct, but this can be overridden by running sky launch with the argument --env MODEL_ID=<my_model>.

NOTE: This config will launch the instance on a public IP. It’s highly recommended to secure the instance within a private subnet. See the Advanced Configurations section of the SkyPilot docs for options to run within VPC and setup private IPs.

Prompt LoRAX w/ base model#

In a separate window, obtain the IP address of the newly created instance:

sky status --ip lorax-cluster

Now we can prompt the base model deployment using a simple REST API:

IP=$(sky status --ip lorax-cluster)

curl http://$IP:8080/generate \
    -X POST \
    -d '{
        "inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
        "parameters": {
            "max_new_tokens": 64
        }
    }' \
    -H 'Content-Type: application/json'

Prompt LoRAX w/ adapter#

To improve the quality of the response, we can add a single parameter adapter_id pointing to a valid LoRA adapter from the HuggingFace Hub.

In this example, we’ll use the adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k that fine-tuned the base model to improve its math reasoning:

curl http://$IP:8080/generate \
    -X POST \
    -d '{
        "inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
        "parameters": {
            "max_new_tokens": 64,
            "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"
        }
    }' \
    -H 'Content-Type: application/json'

Here are some other interesting Mistral-7B fine-tuned models to test out:

You can find more LoRA adapters here, or try fine-tuning your own with PEFT or Ludwig.

Stop the deployment#

Stopping the deployment will shut down the instance, but keep the storage volume:

sky stop lorax-cluster

Because we set docker run ... -v ~/data:/data in our config from before, this means any model weights or adapters we downloaded will be persisted the next time we run sky launch. The LoRAX Docker image will also be cached, meaning tags like latest won’t be updated on restart unless you add docker pull to your run configuration.

Delete the deployment#

To completely delete the deployment, including the storage volume:

sky down lorax-cluster

The next time you run sky launch, the deployment will be recreated from scratch.