Skip to main content

Serve Qwen 3 with vLLM

Run Qwen3-8B as an OpenAI-compatible API. Query it from your local machine or plug it into any app that speaks the OpenAI format.
cassian.yaml
name: qwen-server

gpu:
  count: 1
  type: a100-sxm

disk: 50G

storage: true

ports:
  - "8000:8000"
cassian up
cassian exec -d "vllm serve Qwen/Qwen3-8B \
  --port 8000 --host 0.0.0.0 \
  --enable-reasoning --reasoning-parser deepseek_r1"
cassian forward
Now from your machine:
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "Explain attention mechanisms in one paragraph"}],
    "max_tokens": 512
  }'
Or point any OpenAI SDK at it:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
resp = client.chat.completions.create(
    model="Qwen/Qwen3-8B",
    messages=[{"role": "user", "content": "Write a Python quicksort in 10 lines"}],
)
print(resp.choices[0].message.content)
For the 72B model, use 4 GPUs:
gpu:
  count: 4
  type: a100-sxm
cassian exec -d "vllm serve Qwen/Qwen3-72B --tensor-parallel-size 4 --port 8000 --host 0.0.0.0"

Text-to-Speech with Kokoro

Run Kokoro TTS locally and generate speech from text. Great for building voice apps or generating training data.
cassian.yaml
name: tts-server

gpu:
  count: 1
  type: rtx3090

disk: 50G

storage: true

ports:
  - "8080:8080"
cassian up
cassian exec "pip install --break-system-packages -q kokoro-onnx soundfile fastapi uvicorn"
cassian exec "wget -q https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.onnx \
  && wget -q https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/voices-v1.0.bin"
cassian exec -d "python serve_tts.py"
cassian forward
Example serve_tts.py:
from fastapi import FastAPI
from fastapi.responses import Response
from kokoro_onnx import Kokoro
import soundfile as sf
import io

app = FastAPI()
kokoro = Kokoro("kokoro-v1.0.onnx", "voices-v1.0.bin")

@app.post("/tts")
async def tts(text: str, voice: str = "af_heart"):
    samples, sr = kokoro.create(text, voice=voice, speed=1.0)
    buf = io.BytesIO()
    sf.write(buf, samples, sr, format="WAV")
    return Response(content=buf.getvalue(), media_type="audio/wav")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)
Generate speech locally:
curl -X POST "http://localhost:8080/tts?text=Hello%20from%20Cassian&voice=af_heart" \
  --output speech.wav

Transcribe audio with Whisper

Run Whisper large-v3-turbo for fast audio transcription.
cassian.yaml
name: whisper

gpu:
  count: 1
  type: rtx3090

disk: 50G

ports:
  - "9000:9000"
cassian up
cassian exec "pip install --break-system-packages -q faster-whisper fastapi uvicorn python-multipart"
cassian exec -d "python serve_whisper.py"
cassian forward
Example serve_whisper.py:
from fastapi import FastAPI, UploadFile
from faster_whisper import WhisperModel
import tempfile

app = FastAPI()
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")

@app.post("/transcribe")
async def transcribe(file: UploadFile):
    with tempfile.NamedTemporaryFile(suffix=".wav") as f:
        f.write(await file.read())
        f.flush()
        segments, info = model.transcribe(f.name, beam_size=5)
        text = " ".join(s.text for s in segments)
    return {"text": text.strip(), "language": info.language, "duration": info.duration}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=9000)
Transcribe locally:
curl -X POST http://localhost:9000/transcribe \
  -F "file=@recording.wav"

Fine-tune Qwen with LoRA

Train a LoRA adapter on your own dataset. Model weights cache in cloud storage so you don’t re-download on restart.
cassian.yaml
name: finetune

gpu:
  count: 1
  type: a100-sxm

disk: 50G

storage: true

workspace:
  no_sync:
    - "checkpoints/"
  exclude:
    - "__pycache__/"
    - "wandb/"
cassian up
cassian exec "HF_HOME=/workspace/storage/hf python finetune.py \
  --model Qwen/Qwen3-8B \
  --dataset ./data/train.jsonl \
  --output /workspace/checkpoints \
  --epochs 3 \
  --lora-rank 16"
cassian down
  • checkpoints/ persists across sessions but doesn’t sync locally
  • Model weights in /workspace/storage survive restarts without eating disk
  • wandb/ is excluded since W&B syncs to their own cloud

Image generation with FLUX

Serve FLUX.1-schnell for fast image generation.
cassian.yaml
name: flux-server

gpu:
  count: 1
  type: a100-sxm

disk: 50G

storage: true

ports:
  - "8080:8080"
cassian up
cassian exec "pip install --break-system-packages -q diffusers torch fastapi uvicorn"
cassian exec -d "HF_HOME=/workspace/storage/hf python serve_flux.py"
cassian forward
Example serve_flux.py:
from fastapi import FastAPI
from fastapi.responses import Response
from diffusers import FluxPipeline
import torch, io

app = FastAPI()
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
).to("cuda")

@app.post("/generate")
async def generate(prompt: str, steps: int = 4):
    image = pipe(prompt, num_inference_steps=steps, guidance_scale=0.0).images[0]
    buf = io.BytesIO()
    image.save(buf, format="PNG")
    return Response(content=buf.getvalue(), media_type="image/png")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)
curl -X POST "http://localhost:8080/generate?prompt=a+cat+astronaut+on+mars" \
  --output image.png

Multi-GPU distributed training

Scale to multiple GPUs with torchrun.
cassian.yaml
name: distributed

gpu:
  count: 4
  type: a100-sxm

disk: 100G

storage: true
cassian up
cassian exec "torchrun --nproc_per_node=4 train.py \
  --batch_size 128 \
  --output /workspace/storage/checkpoints"
cassian down

Jupyter on a GPU

Run notebooks with full CUDA access.
cassian.yaml
name: notebook

gpu:
  count: 1
  type: rtx3090

disk: 50G

ports:
  - "8888:8888"
cassian up
cassian exec -d "jupyter lab --ip 0.0.0.0 --port 8888 \
  --no-browser --allow-root \
  --NotebookApp.token='cassian'"
cassian forward
Open localhost:8888?token=cassian in your browser.