Default setup
When you create an inference server in the Hivenet console, vLLM is configured automatically with sensible defaults. That means you can pick a model and get started right away. No need to adjust low-level parameters unless you want to. By default, vLLM applies:- Standard memory allocation (GPU-first, minimal CPU offload).
- Context length up to 8192 tokens.
- Batch scheduling for multiple concurrent requests.
- Balanced sampling settings for deterministic yet flexible outputs.
- HTTPS endpoint access, with optional TCP/UDP or Jupyter Notebook.
Optional configuration
If you want more control, you can open the vLLM configuration panel during setup. Here you’ll find three groups of settings:1. Basic
- Authentication token: Used to securely call your inference endpoint.
- Context length: Maximum tokens processed per request.
- Concurrent requests: How many clients can connect at once.
- Tensor parallel size & precision: Let vLLM auto-configure or set manually.
- Prefix caching: Speeds up repeated prompts.
2. Sampling
Controls how the model generates responses:- Temperature: Higher = more random, lower = more deterministic.
- Top-p (nucleus sampling): Probability cutoff for token selection.
- Top-k: Limits selection to top-k tokens at each step.
- Repetition penalty: Discourages repeated outputs.
3. Advanced
For fine-tuning performance and stability:- GPU memory fraction: How much GPU memory vLLM can use.
- CPU offload: Offload some memory to CPU if GPU is full.
- Max batched tokens: Total tokens across parallel requests.
- Quantization and KV cache settings: Optimize memory and speed.
When to adjust settings
Most users won’t need to touch these options. The defaults are optimized for balanced performance. Adjustments are useful if you need:- Longer sequences (increase context length).
- Higher concurrency (more clients at once).
- Specific generation behavior (sampling parameters).
- Memory optimization (quantization or CPU offload).
Changing advanced settings can reduce stability or performance if misconfigured. Only adjust if you know your workload’s needs.