KTransformers

Configuration

KTransformers uses YAML configuration files to customize inference behavior.

Basic Configuration

# config.yaml
backend: torch
device: cuda:0

model:
  name: deepseek-ai/DeepSeek-R1-671B
  quantization: Q4_K_M

inference:
  max_tokens: 2048
  temperature: 0.7
  top_p: 0.9

Offloading

Enable MoE offloading for large models:

offload:
  enabled: true
  ratio: 0.8  # 80% of experts on CPU
  device: cpu

Memory Optimization

memory:
  kv_cache_dtype: float16
  attention_backend: flash_attention_2
  gradient_checkpointing: false

Multi-GPU

distributed:
  enabled: true
  devices: [0, 1, 2, 3]
  strategy: tensor_parallel

Environment Variables

VariableDescriptionDefault
KT_CACHE_DIRModel cache directory~/.cache/ktransformers
KT_LOG_LEVELLogging levelINFO
KT_NUM_THREADSCPU threadsAuto