iApp vLLM Gateway - API Guide

How to Call

Base URL:

http://api3-siamai.aieat.or.th/v1

Pick a model alias from the table below and use it as the model field in your request:

from openai import OpenAI

client = OpenAI(base_url="http://api3-siamai.aieat.or.th/v1", api_key="none")

response = client.chat.completions.create(
    model="qwen3.6-35b-multi",    # use alias from table below
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Choose a Model

Use this alias	Model	Type	GPUs
qwen3.6-35b-multi	Qwen3.6-35B-A3B-FP8	Base	6,7
qwen3.5-35b-multi	Qwen3.5-35B-A3B-FP8	Base	3
qwen3-reranker-8b	Qwen3-Reranker-8B	Base	5
qwen3-embedding-8b	Qwen3-Embedding-8B	Base	4

Base models and LoRA adapters that share a GPU run on the same vLLM instance. Call each by its alias — routing is automatic.

Basic Request

curl

Python

curl http://api3-siamai.aieat.or.th/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6-35b-multi",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

from openai import OpenAI

client = OpenAI(base_url="http://api3-siamai.aieat.or.th/v1", api_key="none")

response = client.chat.completions.create(
    model="qwen3.6-35b-multi",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Base Model vs LoRA Adapter

A base model and its LoRA adapters share the same GPU. The only difference in calling them is the model field — everything else stays the same.

curl

Python

# Call the BASE model
curl http://api3-siamai.aieat.or.th/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<base-model-alias>",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Call a LORA adapter (fine-tuned from same base)
curl http://api3-siamai.aieat.or.th/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<lora-adapter-alias>",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Same endpoint, same format — only "model" changes.
# Use aliases from the "Choose a Model" table above.

from openai import OpenAI

client = OpenAI(base_url="http://api3-siamai.aieat.or.th/v1", api_key="none")

# Call the BASE model
base_response = client.chat.completions.create(
    model="<base-model-alias>",
    messages=[{"role": "user", "content": "Hello"}]
)

# Call a LORA adapter (fine-tuned from same base)
lora_response = client.chat.completions.create(
    model="<lora-adapter-alias>",
    messages=[{"role": "user", "content": "Hello"}]
)

# Both use the same client, same endpoint.
# The gateway routes to the correct backend automatically.
# Base and LoRA share the same GPU — no extra cost.

Streaming

curl

Python

curl http://api3-siamai.aieat.or.th/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6-35b-multi",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

stream = client.chat.completions.create(
    model="qwen3.6-35b-multi",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Thinking Mode

Qwen3.5 models think by default for better reasoning. Disable with enable_thinking: false or append /no_think to the prompt.

Default (on)

Disable

# Thinking is enabled by default
response = client.chat.completions.create(
    model="qwen3.6-35b-multi",
    messages=[{"role": "user", "content": "Solve: 2x + 3 = 7"}]
)
# Response includes reasoning in <think>...</think> tags

# Option 1: API parameter
response = client.chat.completions.create(
    model="qwen3.6-35b-multi",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

# Option 2: Prompt suffix
messages=[{"role": "user", "content": "Hello /no_think"}]

Structured JSON Output

Force valid JSON output using response_format.

JSON mode

JSON schema

response = client.chat.completions.create(
    model="qwen3.6-35b-multi",
    messages=[{"role": "user", "content": "List 3 countries as JSON"}],
    response_format={"type": "json_object"}
)

response = client.chat.completions.create(
    model="qwen3.6-35b-multi",
    messages=[{
        "role": "user",
        "content": "Extract: John is 30 and lives in Tokyo"
    }],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "city": {"type": "string"}
                },
                "required": ["name", "age", "city"]
            }
        }
    }
)
# {"name": "John", "age": 30, "city": "Tokyo"}

Best Practices (avoid rate-limit / WAF blocks)

A WAF sits in front of the public endpoint and trips on high request rates or many short-lived connections. Shape your traffic to fit:

Reuse the TCP connection. Use httpx.Client() / requests.Session() / OpenAI SDK client once and reuse it. Don't open a new connection per request.
Batch in one call where the endpoint supports it. Embeddings: pass many texts in input. Reranker: pass many documents in documents. Cuts request count 10–100×.
Stream long chat responses. "stream": true releases the connection as tokens arrive instead of holding it for the full reply.
Cap client-side concurrency. Use a semaphore (≈16 for chat, ≈32 for embeddings). Firing hundreds in parallel both triggers the WAF and exceeds max_concurrent here (gateway responds 429).
Handle 429 with backoff. If the gateway returns 429 or the WAF returns 403/406, wait 1–2 s and retry; double the wait on repeated failures.
Accept gzip. The gateway now gzips responses ≥1 KB; SDKs send Accept-Encoding: gzip by default — keep it on.

Embeddings

Qwen3-Embedding uses the /v1/embeddings endpoint. Vectors come back L2-normalized (4096 dim by default; request smaller dims with dimensions — matryoshka is enabled).

curl

Python

curl http://api3-siamai.aieat.or.th/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-embedding-8b",
    "input": ["first text", "second text"],
    "dimensions": 1024
  }'

# `input` accepts a string or a list of strings (batched).
# `dimensions` is optional; omit it to get the full 4096-dim vector.

from openai import OpenAI

client = OpenAI(base_url="http://api3-siamai.aieat.or.th/v1", api_key="none")

resp = client.embeddings.create(
    model="qwen3-embedding-8b",
    input=["first text", "second text"],
    dimensions=1024,   # optional; default 4096
)
for item in resp.data:
    print(item.index, len(item.embedding), item.embedding[:3])

# Tip — for asymmetric retrieval (query vs. passage), use the same model and
# prepend an instruction to the QUERY only (passages stay raw):
#   "Instruct: Given a search query, retrieve the most relevant passage\n"
#   "Query: "

Reranker

Qwen3-Reranker uses a different endpoint (/v1/rerank) and expects the query in <Instruct>/<Query> format. Without the instruction prefix the scores are noisy.

curl

Python

curl http://api3-siamai.aieat.or.th/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-reranker-8b",
    "query": "<Instruct>: Given a search query, retrieve the most relevant passage that answers the query\n<Query>: your actual question here",
    "documents": ["doc 1 text", "doc 2 text", "doc 3 text"]
  }'

# Response: results pre-sorted by relevance_score (high to low),
# each item has the original `index` so you can map back to your list.

import httpx

INSTRUCTION = "Given a search query, retrieve the most relevant passage that answers the query"

def rerank(query: str, documents: list[str]):
    r = httpx.post(
        "http://api3-siamai.aieat.or.th/v1/rerank",
        json={
            "model": "qwen3-reranker-8b",
            "query": f"<Instruct>: {INSTRUCTION}\n<Query>: {query}",
            "documents": documents,
        },
        timeout=60,
    )
    return r.json()["results"]

# Tips:
# - Generic instruction works; domain-specific is better.
# - Works with any language (Thai, English, etc.) — model is multilingual.
# - Without the <Instruct>/<Query> prefix, scoring is much noisier.

Parameters

Parameter	Default	Description
temperature	0.7	Randomness (0.0 - 2.0)
max_tokens	-	Response length limit
top_p	0.9	Nucleus sampling (0.0 - 1.0)
top_k	20	Vocabulary sampling
frequency_penalty	0.0	Repetition penalty
stop	-	Stop sequences