Configuration

KTransformers uses YAML configuration files to customize inference behavior.

Basic Configuration

# config.yaml
backend: torch
device: cuda:0

model:
  name: deepseek-ai/DeepSeek-R1-671B
  quantization: Q4_K_M

inference:
  max_tokens: 2048
  temperature: 0.7
  top_p: 0.9

Offloading

Enable MoE offloading for large models:

offload:
  enabled: true
  ratio: 0.8  # 80% of experts on CPU
  device: cpu

Memory Optimization

memory:
  kv_cache_dtype: float16
  attention_backend: flash_attention_2
  gradient_checkpointing: false

Multi-GPU

distributed:
  enabled: true
  devices: [0, 1, 2, 3]
  strategy: tensor_parallel

Environment Variables

Variable	Description	Default
`KT_CACHE_DIR`	Model cache directory	`~/.cache/ktransformers`
`KT_LOG_LEVEL`	Logging level	`INFO`
`KT_NUM_THREADS`	CPU threads	Auto