Skip to main contentThe vLLM template lets you spin up an inference server in minutes. Pick a model from the catalog or provide your own weights.
Available models
- Falcon3 3B
- Falcon3 Mamba-7B
- Falcon3 7B
- Falcon3 10B
These ship as ready-to-run choices in the vLLM template. You can also mount your own models from storage or pull from public registries.
Coming soon
- Llama family
- Mistral family
- Qwen family
- GPT-OSS
How to deploy (quick)
- Create instance → choose vLLM inference template.
- Select a model and size (GPU or CPU as needed).
- Launch. The server exposes an HTTP endpoint by default; add ports/SSH if required.
Host hardware: AMD EPYC 7713; GPU sizes listed in GPU types and sizes.