Low-VRAM, Full-Precision Inference
KTransformers uses CPU/GPU heterogeneous computing, leveraging CPU memory and compute
to deploy top-tier 100B+ parameter models locally with just a single RTX 5090 (32GB VRAM).
Low-VRAM Full-Parameter Fine-Tuning
Fine-tune 100B+ parameter models with full parameters on consumer GPUs — no expensive multi-GPU clusters needed.
Why KTransformers?
Built for developers who want to run large models on accessible hardware without sacrificing performance.
Heterogeneous Computing
Optimize inference using CPU, GPU, and other accelerators together. Run large models on consumer hardware.
Full-Precision Inference
No quantization needed. Preserve the original model precision to ensure uncompromised inference quality.
Full-Stack Inference & Fine-Tuning
A complete local deployment toolchain from inference to fine-tuning, all-in-one for your development needs.
Multi-Model Support
Supports DeepSeek, Kimi, GLM, Qwen, MiniMax and more mainstream large models for diverse use cases.
Powered by SGLang
GPU inference powered by SGLang, combining strengths for outstanding inference performance.
Active Community
Join thousands of users sharing benchmarks, configurations, and best practices.
Performance Highlights
MiniMax-M2.1 FP8 full precision, single GPU benchmark (32K tokens input)
2,540
Prefill Speed (tokens/s)
1x RTX 5090 (32GB) + 2x AMD EPYC 9355
27.6
Decode Speed (tokens/s)
1x RTX 5090 (32GB) + 2x AMD EPYC 9355
4.5x
Prefill Speedup
vs. llama.cpp (Q8_0 quantization)
Fine-Tuning Performance
Low-VRAM full-parameter fine-tuning benchmarks
--
Training Throughput (tokens/s)
Coming soon
--
VRAM Usage
Coming soon
--
vs. Full-GPU Training
Coming soon
Ready to get started?
Join the community and start running large models on your hardware today.