Configuration
KTransformers uses YAML configuration files to customize inference behavior.
Basic Configuration
# config.yaml
backend: torch
device: cuda:0
model:
name: deepseek-ai/DeepSeek-R1-671B
quantization: Q4_K_M
inference:
max_tokens: 2048
temperature: 0.7
top_p: 0.9
Offloading
Enable MoE offloading for large models:
offload:
enabled: true
ratio: 0.8 # 80% of experts on CPU
device: cpu
Memory Optimization
memory:
kv_cache_dtype: float16
attention_backend: flash_attention_2
gradient_checkpointing: false
Multi-GPU
distributed:
enabled: true
devices: [0, 1, 2, 3]
strategy: tensor_parallel
Environment Variables
| Variable | Description | Default |
|---|---|---|
KT_CACHE_DIR | Model cache directory | ~/.cache/ktransformers |
KT_LOG_LEVEL | Logging level | INFO |
KT_NUM_THREADS | CPU threads | Auto |