Serve multiple generative AI models and multiple LoRAs for the base AI models¶
A company may need to deploy multiple large language models (LLMs) in a cluster to support different workloads. For example, a Llama model could power a chatbot interface, while a DeepSeek model might serve a recommendation application. One approach is to expose these models on separate Layer 7 (L7) URL paths and follow the steps in the Getting Started (Latest/Main) guide for each model.
However, one may also need to serve multiple models from the same L7 URL path. To achieve this, the system needs to extract information (such as the model name) from the request body (i.e., the LLM prompt). This pattern of serving multiple models behind a single endpoint is common among providers and is generally expected by clients. The OpenAI API format requires the model name to be specified in the request body. For such model-aware routing, use the Body-Based Routing (BBR) feature described in this guide.
Additionally, each base AI model can have multiple Low-Rank Adaptations (LoRAs). LoRAs associated with the same base model are served by the same backend inference server that hosts the base model. A LoRA name is also provided as the model name in the request body.
How¶
Body-Based Router (BBR) extracts the model name from the request body and adds it to the X-Gateway-Model-Name header. This header is then used for matching and routing the request to the appropriate InferencePool and its associated Endpoint Picker Extension (EPP) instances.
Example Model-Aware Routing using Body-Based Routing (BBR)¶
This guide assumes you have already setup the cluster for basic model serving as described in the Getting started (Latest/Main) guide. In what follows, this guide describes the additional steps required to deploy and test routing across multiple models and multiple LoRAs, where several LoRAs may be associated with a single base model.
Deploy Body-Based Routing Extension¶
To enable body-based routing, deploy the BBR ext_proc server using Helm. This server is independent of EPP. Once installed, it is automatically added as the first filter in the gateway’s filter chain, ahead of other ext_proc servers such as EPP.
Select an appropriate tab depending on your Gateway provider:
helm install body-based-router \
--set provider.name=gke \
--version v0 \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/body-based-routing
helm install body-based-router \
--set provider.name=istio \
--version v0 \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/body-based-routing
Kgateway does not require the Body-Based Routing Extension, and instead natively implements Body-Based Routing.
To use Body Based Routing, apply an AgentgatewayPolicy:
apiVersion: gateway.kgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
name: bbr
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
traffic:
phase: PreRouting
transformation:
request:
set:
- name: X-Gateway-Model-Name
value: 'json(request.body).model'
helm install body-based-router \
--version v0 \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/body-based-routing
After the installation, verify that the BBR pod is running without errors:
kubectl get pods
Serving a Second Base Model¶
The example uses a vLLM simulator since this is the least common denominator configuration that can be run in every environment. The model, deepseek/vllm-deepseek-r1, will be served from the same / L7 path, as in the previous example from the Getting Started (Latest/Main) guide.
Deploy the second base model:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/sim-deployment-1.yaml
The overall setup is as follows. Two base models are deployed: meta-llama/Llama-3.1-8B-Instruct and deepseek/vllm-deepseek-r1. Additionally, the food-review-1 LoRA is associated with meta-llama/Llama-3.1-8B-Instruct, while the ski-resorts and movie-critique LoRAs are associated with deepseek/vllm-deepseek-r1.
⚠️ Note: LoRA names must be unique across the base AI models (i.e., across the backend inference server deployments)
Review the YAML definition.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deepseek-r1
spec:
replicas: 1
selector:
matchLabels:
app: vllm-deepseek-r1
template:
metadata:
labels:
app: vllm-deepseek-r1
spec:
containers:
- name: vllm-sim
image: ghcr.io/llm-d/llm-d-inference-sim:v0.4.0
imagePullPolicy: Always
args:
- --model
- deepseek/vllm-deepseek-r1
- --port
- "8000"
- --max-loras
- "2"
- --lora-modules
- '{"name": "ski-resorts"}'
- '{"name": "movie-critique"}'
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
ports:
- containerPort: 8000
name: http
protocol: TCP
resources:
requests:
scpu: 10m
Verify that the second base model pod is running without errors:
kubectl get pods
Deploy the 2nd InferencePool and Endpoint Picker Extension¶
Set the Helm chart version (unless already set).
export IGW_CHART_VERSION=v0
Select a tab to follow the provider-specific instructions.
export GATEWAY_PROVIDER=gke
helm install vllm-deepseek-r1 \
--set inferencePool.modelServers.matchLabels.app=vllm-deepseek-r1 \
--set provider.name=$GATEWAY_PROVIDER \
--version $IGW_CHART_VERSION \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool
export GATEWAY_PROVIDER=istio
helm install vllm-deepseek-r1 \
--set inferencePool.modelServers.matchLabels.app=vllm-deepseek-r1 \
--set provider.name=$GATEWAY_PROVIDER \
--version $IGW_CHART_VERSION \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool
export GATEWAY_PROVIDER=none
helm install vllm-deepseek-r1 \
--set inferencePool.modelServers.matchLabels.app=vllm-deepseek-r1 \
--set provider.name=$GATEWAY_PROVIDER \
--version $IGW_CHART_VERSION \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool
export GATEWAY_PROVIDER=none
helm install vllm-deepseek-r1 \
--set inferencePool.modelServers.matchLabels.app=vllm-deepseek-r1 \
--set provider.name=$GATEWAY_PROVIDER \
--version $IGW_CHART_VERSION \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool
After the installation, verify that you have two InferencePools and two EPP pods, one per base model type, running without errors
kubectl get inferencepools
kubectl get pods
Configure HTTPRoutes¶
Before configuring the HTTPRoutes for the models and their LoRAs, delete the existing HTTPRoute for the meta-llama/Llama-3.1-8B-Instruct model. The new routes will match the model name in the X-Gateway-Model-Name HTTP header, which is inserted by the BBR extension after parsing the model name from the LLM request body.
kubectl delete httproute llm-route
Now configure new HTTPRoutes for the two simulated models and their LoRAs that we want to serve via BBR using the following command which configures both routes.
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/bbr-example/httproute_bbr_lora.yaml
Also examine the manifest file (see the yaml below), to see how the X-Gateway-Model-Name is used for a header match in the Gateway's rules to route requests to the correct Backend based on the model name.
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-llama-route
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: vllm-llama3-8b-instruct
matches:
- path:
type: PathPrefix
value: /
headers:
- type: Exact
name: X-Gateway-Model-Name
value: 'meta-llama/Llama-3.1-8B-Instruct'
- path:
type: PathPrefix
value: /
headers:
- type: Exact
name: X-Gateway-Model-Name
value: 'food-review-1'
timeouts:
request: 300s
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-deepseek-route #give this HTTPRoute any name that helps you to group and track the matchers
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: vllm-deepseek-r1
matches:
- path:
type: PathPrefix
value: /
headers:
- type: Exact
name: X-Gateway-Model-Name
value: 'deepseek/vllm-deepseek-r1'
- path:
type: PathPrefix
value: /
headers:
- type: Exact
name: X-Gateway-Model-Name
value: 'ski-resorts'
- path:
type: PathPrefix
value: /
headers:
- type: Exact
name: X-Gateway-Model-Name
value: 'movie-critique'
timeouts:
request: 300s
---
⚠️ Note : Kubernetes API Gateway limits the total number of matchers per HTTPRoute to be less than 128.
Before testing the setup, confirm that the HTTPRoute status conditions include Accepted=True and ResolvedRefs=True for both routes using the following commands.
kubectl get httproute llm-llama-route -o yaml
kubectl get httproute llm-deepseek-route -o yaml
Try the setup¶
-
Send a few requests to Llama model to test that it works as before, as follows:
curl -X POST -i ${IP}:${PORT}/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "max_tokens": 100, "temperature": 0, "messages": [ { "role": "developer", "content": "You are a helpful assistant." }, { "role": "user", "content": "Linux is said to be an open source kernel because " } ] }' -
Send a few requests to Deepseek model to test that it works, as follows:
curl -X POST -i ${IP}:${PORT}/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek/vllm-deepseek-r1", "max_tokens": 100, "temperature": 0, "messages": [ { "role": "developer", "content": "You are a helpful assistant." }, { "role": "user", "content": "Linux is said to be an open source kernel because " } ] }' -
Send a few requests to the LoRA of the Llama model as follows:
curl -X POST -i ${IP}:${PORT}/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "food-review-1", "max_tokens": 100, "temperature": 0, "messages": [ { "role": "reviewer", "content": "You are a helpful assistant." }, { "role": "user", "content": "Write a review of the best restaurans in San-Francisco" } ] }' -
Send a few requests to one LoRA of the Deepseek model as follows:
curl -X POST -i ${IP}:${PORT}/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "movie-critique", "max_tokens": 100, "temperature": 0, "messages": [ { "role": "reviewer", "content": "You are a helpful assistant." }, { "role": "user", "content": "What are the best movies of 2025?" } ] }' -
Send a few requests to another LoRA of the Deepseek model as follows:
curl -X POST -i ${IP}:${PORT}/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "ski-resorts", "max_tokens": 100, "temperature": 0, "messages": [ { "role": "reviewer", "content": "You are a helpful assistant." }, { "role": "user", "content": "Tell mne about ski deals" } ] }'
-
Send a few requests to Llama model's LoRA as follows:
curl -X POST -i ${IP}:${PORT}/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "food-review-1", "prompt": "Write as if you were a critic: San Francisco ", "max_tokens": 100, "temperature": 0 }' -
Send a few requests to the first Deepseek LoRA as follows:
curl -X POST -i ${IP}:${PORT}/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "ski-resorts", "prompt": "What is the best ski resort in Austria?", "max_tokens": 20, "temperature": 0 }' -
Send a few requests to the second Deepseek LoRA as follows:
curl -X POST -i ${IP}:${PORT}/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "movie-critique", "prompt": "Tell me about movies", "max_tokens": 20, "temperature": 0 }'