How to Call
Base URL:
Pick a model alias from the table below and use it as the model field in your request:
from openai import OpenAI
client = OpenAI(base_url="http://api3-siamai.aieat.or.th/v1", api_key="none")
response = client.chat.completions.create(
model="qwen3.6-35b-multi", # use alias from table below
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
Choose a Model
| Use this alias | Model | Type | GPUs |
|---|---|---|---|
| qwen3.6-35b-multi | Qwen3.6-35B-A3B-FP8 | Base | 6,7 |
| qwen3.5-35b-multi | Qwen3.5-35B-A3B-FP8 | Base | 3 |
| qwen3-reranker-8b | Qwen3-Reranker-8B | Base | 5 |
| qwen3-embedding-8b | Qwen3-Embedding-8B | Base | 4 |
Base models and LoRA adapters that share a GPU run on the same vLLM instance. Call each by its alias — routing is automatic.
Basic Request
curl http://api3-siamai.aieat.or.th/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6-35b-multi",
"messages": [{"role": "user", "content": "Hello"}]
}'
from openai import OpenAI
client = OpenAI(base_url="http://api3-siamai.aieat.or.th/v1", api_key="none")
response = client.chat.completions.create(
model="qwen3.6-35b-multi",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
Base Model vs LoRA Adapter
A base model and its LoRA adapters share the same GPU. The only difference in calling them is the model field — everything else stays the same.
# Call the BASE model
curl http://api3-siamai.aieat.or.th/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<base-model-alias>",
"messages": [{"role": "user", "content": "Hello"}]
}'
# Call a LORA adapter (fine-tuned from same base)
curl http://api3-siamai.aieat.or.th/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<lora-adapter-alias>",
"messages": [{"role": "user", "content": "Hello"}]
}'
# Same endpoint, same format — only "model" changes.
# Use aliases from the "Choose a Model" table above.
from openai import OpenAI
client = OpenAI(base_url="http://api3-siamai.aieat.or.th/v1", api_key="none")
# Call the BASE model
base_response = client.chat.completions.create(
model="<base-model-alias>",
messages=[{"role": "user", "content": "Hello"}]
)
# Call a LORA adapter (fine-tuned from same base)
lora_response = client.chat.completions.create(
model="<lora-adapter-alias>",
messages=[{"role": "user", "content": "Hello"}]
)
# Both use the same client, same endpoint.
# The gateway routes to the correct backend automatically.
# Base and LoRA share the same GPU — no extra cost.
Streaming
curl http://api3-siamai.aieat.or.th/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6-35b-multi",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'
stream = client.chat.completions.create(
model="qwen3.6-35b-multi",
messages=[{"role": "user", "content": "Hello"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Thinking Mode
Qwen3.5 models think by default for better reasoning. Disable with enable_thinking: false or append /no_think to the prompt.
# Thinking is enabled by default
response = client.chat.completions.create(
model="qwen3.6-35b-multi",
messages=[{"role": "user", "content": "Solve: 2x + 3 = 7"}]
)
# Response includes reasoning in <think>...</think> tags
# Option 1: API parameter
response = client.chat.completions.create(
model="qwen3.6-35b-multi",
messages=[{"role": "user", "content": "Hello"}],
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
# Option 2: Prompt suffix
messages=[{"role": "user", "content": "Hello /no_think"}]
Structured JSON Output
Force valid JSON output using response_format.
response = client.chat.completions.create(
model="qwen3.6-35b-multi",
messages=[{"role": "user", "content": "List 3 countries as JSON"}],
response_format={"type": "json_object"}
)
response = client.chat.completions.create(
model="qwen3.6-35b-multi",
messages=[{
"role": "user",
"content": "Extract: John is 30 and lives in Tokyo"
}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "person",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"city": {"type": "string"}
},
"required": ["name", "age", "city"]
}
}
}
)
# {"name": "John", "age": 30, "city": "Tokyo"}
Best Practices (avoid rate-limit / WAF blocks)
A WAF sits in front of the public endpoint and trips on high request rates or many short-lived connections. Shape your traffic to fit:
- Reuse the TCP connection. Use
httpx.Client()/requests.Session()/ OpenAI SDK client once and reuse it. Don't open a new connection per request. - Batch in one call where the endpoint supports it. Embeddings: pass many texts in
input. Reranker: pass many documents indocuments. Cuts request count 10–100×. - Stream long chat responses.
"stream": truereleases the connection as tokens arrive instead of holding it for the full reply. - Cap client-side concurrency. Use a semaphore (≈16 for chat, ≈32 for embeddings). Firing hundreds in parallel both triggers the WAF and exceeds
max_concurrenthere (gateway responds 429). - Handle 429 with backoff. If the gateway returns 429 or the WAF returns 403/406, wait 1–2 s and retry; double the wait on repeated failures.
- Accept gzip. The gateway now gzips responses ≥1 KB; SDKs send
Accept-Encoding: gzipby default — keep it on.
Embeddings
Qwen3-Embedding uses the /v1/embeddings endpoint. Vectors come back L2-normalized (4096 dim by default; request smaller dims with dimensions — matryoshka is enabled).
curl http://api3-siamai.aieat.or.th/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-embedding-8b",
"input": ["first text", "second text"],
"dimensions": 1024
}'
# `input` accepts a string or a list of strings (batched).
# `dimensions` is optional; omit it to get the full 4096-dim vector.
from openai import OpenAI
client = OpenAI(base_url="http://api3-siamai.aieat.or.th/v1", api_key="none")
resp = client.embeddings.create(
model="qwen3-embedding-8b",
input=["first text", "second text"],
dimensions=1024, # optional; default 4096
)
for item in resp.data:
print(item.index, len(item.embedding), item.embedding[:3])
# Tip — for asymmetric retrieval (query vs. passage), use the same model and
# prepend an instruction to the QUERY only (passages stay raw):
# "Instruct: Given a search query, retrieve the most relevant passage\n"
# "Query: "
Reranker
Qwen3-Reranker uses a different endpoint (/v1/rerank) and expects the query in <Instruct>/<Query> format. Without the instruction prefix the scores are noisy.
curl http://api3-siamai.aieat.or.th/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-reranker-8b",
"query": "<Instruct>: Given a search query, retrieve the most relevant passage that answers the query\n<Query>: your actual question here",
"documents": ["doc 1 text", "doc 2 text", "doc 3 text"]
}'
# Response: results pre-sorted by relevance_score (high to low),
# each item has the original `index` so you can map back to your list.
import httpx
INSTRUCTION = "Given a search query, retrieve the most relevant passage that answers the query"
def rerank(query: str, documents: list[str]):
r = httpx.post(
"http://api3-siamai.aieat.or.th/v1/rerank",
json={
"model": "qwen3-reranker-8b",
"query": f"<Instruct>: {INSTRUCTION}\n<Query>: {query}",
"documents": documents,
},
timeout=60,
)
return r.json()["results"]
# Tips:
# - Generic instruction works; domain-specific is better.
# - Works with any language (Thai, English, etc.) — model is multilingual.
# - Without the <Instruct>/<Query> prefix, scoring is much noisier.
Parameters
| Parameter | Default | Description |
|---|---|---|
| temperature | 0.7 | Randomness (0.0 - 2.0) |
| max_tokens | - | Response length limit |
| top_p | 0.9 | Nucleus sampling (0.0 - 1.0) |
| top_k | 20 | Vocabulary sampling |
| frequency_penalty | 0.0 | Repetition penalty |
| stop | - | Stop sequences |