For the answer to “How do I serve Qwen3-VL-Embedding-2B (or 8B)?” see the last section.
The rest of this post is how I got there.
I needed to deploy Qwen3-VL-Embedding-8B for a customer recently. I knew of Qwen’s models: very popular, open-weights, and competitively performant. Qwen’s models play nicely with model serving frameworks like vLLM and SGLang due to their dev’s very active participation in those server framework’s repos.
However, I was unaware of this particular offering from them. A vision language (hence VL) embedding model? Well their model card on Hugging Face boasts more:
Qwen3-VL-Embedding-8B has the following features:
- Model Type: MultiModal Embedding
- Supported Languages: 30+ Languages
- Supported Input Modalities: Text, images, screenshots, videos, and arbitrary multimodal combinations (e.g., text + image, text + video)
- …
It’s a pretty killer set of features for an embedding model.
I’ve successfully setup past Qwen models with vLLM. I’ve also deployed embedding models. I thought I’d depend on my past experience to do it once again.
For a cost-effective exploration run, I went with the smaller 2B variant. The 8B model has higher GPU vRAM requirements.
“Straightforward”
“This should be pretty straightforward,” said I, naively. So, I ran it using the good old blue whale:
docker run \
--rm \
--gpus all \
vllm/vllm-openai:v0.15.1 \
--model Qwen/Qwen3-VL-Embedding-2B \
--runner pooling \
--host 0.0.0.0
Then I POST-ed to /v1/embeddings to get embeddings:
$ curl http://localhost:8000/v1/embeddings \
-H 'Content-type: application/json' \
-d '{
"input": [
"Hello, world",
"Dayman, fighter of the Nightman"
],
"dimensions": 64
}'
{
"id": "embd-8821835b5f55fced",
"object": "list",
"created": 1771737628,
"model": "Qwen/Qwen3-VL-Embedding-2B",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [
-0.15565800666809082,
-0.10350227355957031,
...
]
},
{
"index": 1,
"object": "embedding",
"embedding": [
0.0031883527990430593,
-0.45133176445961,
...
]
}
],
"usage": {
"prompt_tokens": 13,
"total_tokens": 13,
"completion_tokens": 0,
"prompt_tokens_details": null
}
}
Great, it works! But, hang on, how do I pass in an image URL? The OpenAI spec has nothing to say for it.
These are not the docs you’re hoping for
After some Googling, the vLLM OpenAI-compatible docs chimed in:
You can pass multi-modal inputs to embedding models by defining a custom chat template for the server and passing a list of messages in the request. Refer to the examples below for illustration.
Whoa, whoa. Slow down, Hoss. Who said anything about custom chat templates? List of messages… what? I want embeddings!
Let’s look at this code…
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
response = create_chat_embeddings(
client,
model="TIGER-Lab/VLM2Vec-Full",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Represent the given image."},
],
}
],
encoding_format="float",
)
print("Image embedding output:", response.data[0].embedding)
This doesn’t look like an embedding request; in fact, it looks like chat completions. What does create_chat_embeddings() do?
The link at the bottom of the section points to something rather interesting: a cobbled-together script that screams this is not production code. This is where we find the definition for create_chat_mbeddings()
def create_chat_embeddings(
client: OpenAI,
*,
messages: list[ChatCompletionMessageParam],
model: str,
encoding_format: Literal["base64", "float"] | NotGiven = NOT_GIVEN,
continue_final_message: bool = False,
add_special_tokens: bool = False,
) -> CreateEmbeddingResponse:
"""
Convenience function for accessing vLLM's Chat Embeddings API,
which is an extension of OpenAI's existing Embeddings API.
"""
return client.post(
"/embeddings",
cast_to=CreateEmbeddingResponse,
body={
"messages": messages,
"model": model,
"encoding_format": encoding_format,
"continue_final_message": continue_final_message,
"add_special_tokens": add_special_tokens,
},
)
If it looks weird, you’re in good company. My brain clocked out while parsing this. Twice.
I’ll spell the code out for you:
- Use the OpenAI
clientobject… (Okay, fine.) - to POST to
/embeddings(Wait, what? Why notclient.embeddings.create?) - The inputs are (among other keys):
messages, a list ofChatCompletionMessageParam(??? Never mind, I give up.)
Essentially, you are sending a templated chat message to get embeddings back. I’m talking about a system prompt, user messages, and all the bells and whistles; like so:
<|im_start|>system
Represent the user input.<|im_end|>
<|im_start|>user
Picture 1: <|vision_start|><|image_pad|><|vision_end|>
A cat in the snow.<|im_end|>
<|im_start|>assistant
<|im_end|>
to generate an embedding. Confused yet?
You don’t have to be a ML engineer to see that this is the kind of jank that deserves its own extensive docs; I’m talking about tables of JSON input keys, types, description, and examples.
Sadly no, this is it: this one Python file. This is the docs you’re looking for except it’s far from the docs you’re hoping for.
The missing schema
Though battered, I had still high hopes. This is vLLM, the model serving framework that’s been crushing it left and right. I knew the complete schema for this endpoint is tucked in some obscure page. “Maybe the openapi.json that vLLM serves,” exclaimed I.
I cURL-ed to http://localhost:8000/openapi.json, opened it using Swagger, scrolled down to /v1/embeddings, clicked on schema, and there it is:
#1 EmbeddingChatRequest object
encoding_format string
embed_dtype string
endianness string
dimensions (integer | null)
normalize (boolean | null)
messages array<(object | object | object | object | object | object | object | object)>
add_generation_prompt boolean
continue_final_message boolean
add_special_tokens boolean
chat_template (string | null)
chat_template_kwargs (object | null)
model (string | null)
user (string | null)
truncate_prompt_tokens (integer | null)
request_id string
priority integer
mm_processor_kwargs (object | null)
Additional properties allowed
The second request object type for the /v1/embeddings endpoint, tucked away in the openapi.json.
This was a pain to discover. The request structure is convoluted and non-trivial and the docs needed a sidequest to acquire. There is no excuse. Dear vLLM team, please improve your embedding docs.
Serving it at last
After going to Gemini 3 Fast for bonus points, here’s my take on a feature-complete deployment of Qwen3-VL-Embedding-2B via vllm serve. These commands are compatible with the 8B variant too.
Server
The --hf-overrides argument below makes the dimensions input key work (to generate different embedding sizes on-the-fly). Not to mention, this vllm serve argument is not well-documented either.
docker run \
--gpus all \
--rm \
-p 8000:8000 \
vllm/vllm-openai:v0.15.1 \
--model Qwen/Qwen3-VL-Embedding-2B \
--runner pooling \
--max-model-len 32768 \
--hf-overrides '{"is_matryoshka": true}'
cURL
Caveat: you cannot batch multimodal inputs (as of vLLM v0.15.1). You can batch text-only embeddings via the input key in the EmbeddingCompletionRequest object.
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-Embedding-2B",
"messages": [
{
"role": "system",
"content": [{"type": "text", "text": "Represent the user input."}]
},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/cat_snow.jpg"}},
{"type": "text", "text": "A cat in the snow."}
]
},
{
"role": "assistant",
"content": [{"type": "text", "text": ""}]
}
],
"continue_final_message": true,
"add_special_tokens": true,
"dimensions": 256
}'
OpenAI python client
If you’d like to use the OpenAI python client, I think the create_chat_messages example above is a good start. No better options AFAIK.
vLLM should really consider buidling a Python client like the OpenAI Python client that’s widely used. vLLM is much richer in terms of features and model support and they should double down. Fork the OpenAI Python client codebase, or whatever, but make it a fully-featured client for vLLM-served models.
Updates:
- Minor typo fixes and rephrasing.