syaffers.xyz

Serving Qwen3-VL-Embedding-2B on vLLM: a case of bad docs

#vllm #mlops #rant

I’m about to rant about bad documentation. If you’re looking for the answer on how to serve Qwen4-VL-Embedding-2B (or 8B), see the last section.


I needed to deploy Qwen3-VL-Embedding-8B for a customer recently. I knew of Qwen’s models: very popular, open-weights, and competitively performant. Qwen’s models play nicely with model serving frameworks like vLLM and SGLang due to their dev’s very active participation in those server framework’s repos.

However, I was unaware of this particular offering from them. A vision language (hence VL) embedding model? Well their model card on Hugging Face boasts more:

Qwen3-VL-Embedding-8B has the following features:

  • Model Type: MultiModal Embedding
  • Supported Languages: 30+ Languages
  • Supported Input Modalities: Text, images, screenshots, videos, and arbitrary multimodal combinations (e.g., text + image, text + video)

It’s a pretty killer set of features for an embedding model.

Knowing that I’ve successfully setup past Qwen models with vLLM, I thought I’d set it up as I had before.

“Straightforward”

There’s a 2B model and there’s an 8B model. For a cost-effective test run, I went with the smaller 2B variant. “This should be pretty straightforward,” said I, naively.

I ran it using the good old blue whale:

docker run \
    --rm \
    --gpus all \
    vllm/vllm-openai:v0.15.1 \
    --model Qwen/Qwen3-VL-Embedding-2B \
    --runner pooling \
    --host 0.0.0.0

Then I POST-ed to /embeddings to get embeddings:

curl http://localhost:8000/v1/embeddings \
    -H 'Content-type: application/json' \
    -d '{
    "input": [
        "Hello, world",
        "Dayman, fighter of the Nightman"
    ],
    "dimensions": 64
}'

But, hang on, how do I pass in an image URL? The OpenAI spec has nothing to say for it.

These are not the docs you’re hoping for

After some Googling, the vLLM OpenAI-compatible docs chimed in:

You can pass multi-modal inputs to embedding models by defining a custom chat template for the server and passing a list of messages in the request. Refer to the examples below for illustration.

Whoa, whoa. Slow down, Hoss. Who said anything about custom chat templates? List of messages… what? I want embeddings! Show me the code.

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)
image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

response = create_chat_embeddings(
    client,
    model="TIGER-Lab/VLM2Vec-Full",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Represent the given image."},
            ],
        }
    ],
    encoding_format="float",
)

print("Image embedding output:", response.data[0].embedding)

This doesn’t look like an embedding request; in fact, it looks like chat completions. What does create_chat_embeddings() do?

The link at the bottom of the section points to something rather interesting: a cobbled-together script that screams this is not production code. This is where we find the definition for create_chat_mbeddings()

def create_chat_embeddings(
    client: OpenAI,
    *,
    messages: list[ChatCompletionMessageParam],
    model: str,
    encoding_format: Literal["base64", "float"] | NotGiven = NOT_GIVEN,
    continue_final_message: bool = False,
    add_special_tokens: bool = False,
) -> CreateEmbeddingResponse:
    """
    Convenience function for accessing vLLM's Chat Embeddings API,
    which is an extension of OpenAI's existing Embeddings API.
    """
    return client.post(
        "/embeddings",
        cast_to=CreateEmbeddingResponse,
        body={
            "messages": messages,
            "model": model,
            "encoding_format": encoding_format,
            "continue_final_message": continue_final_message,
            "add_special_tokens": add_special_tokens,
        },
    )

If it looks weird, you’re in good company. My brain clocked out while parsing this. Twice.

I’ll spell the code out for you:

  1. Use the OpenAI client object… (Okay, fine.)
  2. to POST to /embeddings (Wait, what? Why not client.embeddings.create?)
  3. The inputs are (among other keys):
    • messages, a list of ChatCompletionMessageParam (Never mind, I give up.)
    • continue_final_message, add_special_tokens (???)

Essentially, you are sending a templated chat message to get embeddings back. I’m talking about your client sending chat completion messages, with a system prompt, user messages, and all the bells and whistles; like so:

<|im_start|>system
Represent the user input.<|im_end|>
<|im_start|>user
Picture 1: <|vision_start|><|image_pad|><|vision_end|>
A cat in the snow.<|im_end|>
<|im_start|>assistant
<|im_end|>

Confused yet?

You don’t have to be a ML engineer to see that this is the kind of jank that deserves its own extensive docs; I’m talking about tables of JSON input keys, types, description, and examples.

Sadly no, this is it: this one Python file. This is the docs you’re looking for except it’s far from the docs you’re hoping for.

Usability over freshness

I had high hopes still. This is vLLM, the model serving framework that’s been crushing it left and right. I knew the complete schema for this endpoint is tucked in some obscure page. “Maybe the OpenAPI JSON that vLLM serves,” exclaimed I, naively.

But it’s not there, I checked. You get the plain OpenAI-compliant input model. The team at vLLM should push the breaks on the next minor release. Focus on usability and documentation a bit more.

Grow up, vLLM

Look, being OpenAI-compliant is great for users who want to use the OpenAI client; it’s flexiblity in the right direction. But, at the same time, vLLM is growing far beyond the reaches of OpenAI with its diverse model support. This shouldn’t come at the expense of good documentation and a solid schema to really make those fringe models shine.

Is there fear that breaking established API conventions by a pioneering company will be poorly met? Everyone in the model serving domain is at the precipice of groundbreaking tech. Be out with old and in with the new already. Grow up, vLLM, and pave your own API paths (you can still pay homage to the old guards).

Serving it at last

After going to Gemini 3 Fast for help, here’s my take on a feature-complete deployment of Qwen3-VL-Embedding-2B via vllm serve.

You can swap in the 8B model with this too.

Server

If you want to make the dimensions input key work (to generate different embedding sizes on-the-fly), you need the --hf-overrides argument below.

docker run \
    --gpus all \
    --rm \
    -p 8000:8000 \
    vllm/vllm-openai:v0.15.1 \
    --model Qwen/Qwen3-VL-Embedding-2B \
    --runner pooling \
    --max-model-len 32768 \
    --hf-overrides '{"is_matryoshka": true}'

Client

Caveat: you cannot batch multimodal inputs (as of vLLM v0.15.1). You can batch text-only embeddings via the input key.

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-Embedding-2B",
    "messages": [
      {
        "role": "system",
        "content": [{"type": "text", "text": "Represent the user input."}]
      },
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/cat_snow.jpg"}},
          {"type": "text", "text": "A cat in the snow."}
        ]
      },
      {
        "role": "assistant",
        "content": [{"type": "text", "text": ""}]
      }
    ],
    "continue_final_message": true,
    "add_special_tokens": true,
    "dimensions": 256
  }'