Low-VRAM, Full-Precision Inference

KTransformers uses CPU/GPU heterogeneous computing, leveraging CPU memory and compute
to deploy top-tier 100B+ parameter models locally with just a single RTX 5090 (32GB VRAM).

Low-VRAM Full-Parameter Fine-Tuning

Fine-tune 100B+ parameter models with full parameters on consumer GPUs — no expensive multi-GPU clusters needed.

Why KTransformers?

Built for developers who want to run large models on accessible hardware without sacrificing performance.

Heterogeneous Computing
Optimize inference using CPU, GPU, and other accelerators together. Run large models on consumer hardware.
Full-Precision Inference
No quantization needed. Preserve the original model precision to ensure uncompromised inference quality.
Full-Stack Inference & Fine-Tuning
A complete local deployment toolchain from inference to fine-tuning, all-in-one for your development needs.
Multi-Model Support
Supports DeepSeek, Kimi, GLM, Qwen, MiniMax and more mainstream large models for diverse use cases.
SGLang
Powered by SGLang
GPU inference powered by SGLang, combining strengths for outstanding inference performance.
Active Community
Join thousands of users sharing benchmarks, configurations, and best practices.

Performance Highlights

MiniMax-M2.1 FP8 full precision, single GPU benchmark (32K tokens input)

2,540
Prefill Speed (tokens/s)

1x RTX 5090 (32GB) + 2x AMD EPYC 9355

27.6
Decode Speed (tokens/s)

1x RTX 5090 (32GB) + 2x AMD EPYC 9355

4.5x
Prefill Speedup

vs. llama.cpp (Q8_0 quantization)

Fine-Tuning Performance

Low-VRAM full-parameter fine-tuning benchmarks

--
Training Throughput (tokens/s)

Coming soon

--
VRAM Usage

Coming soon

--
vs. Full-GPU Training

Coming soon

Ready to get started?

Join the community and start running large models on your hardware today.