Low-VRAM, Full-Precision Inference

KTransformers uses CPU/GPU heterogeneous computing, leveraging CPU memory and compute
to deploy top-tier 100B+ parameter models locally with just a single RTX 5090 (32GB VRAM).

Low-VRAM Full-Parameter Fine-Tuning

Fine-tune 100B+ parameter models with full parameters on consumer GPUs — no expensive multi-GPU clusters needed.

Start Inference View on GitHub

Why KTransformers?

Built for developers who want to run large models on accessible hardware without sacrificing performance.

Heterogeneous Computing

Optimize inference using CPU, GPU, and other accelerators together. Run large models on consumer hardware.

Full-Precision Inference

No quantization needed. Preserve the original model precision to ensure uncompromised inference quality.

Full-Stack Inference & Fine-Tuning

A complete local deployment toolchain from inference to fine-tuning, all-in-one for your development needs.

Multi-Model Support

Supports DeepSeek, Kimi, GLM, Qwen, MiniMax and more mainstream large models for diverse use cases.

GPU inference powered by SGLang, combining strengths for outstanding inference performance.

Active Community

Join thousands of users sharing benchmarks, configurations, and best practices.

Performance Highlights

MiniMax-M2.1 FP8 full precision, single GPU benchmark (32K tokens input)

2,540

Prefill Speed (tokens/s)

1x RTX 5090 (32GB) + 2x AMD EPYC 9355

27.6

Decode Speed (tokens/s)

1x RTX 5090 (32GB) + 2x AMD EPYC 9355

4.5x

Prefill Speedup

vs. llama.cpp (Q8_0 quantization)

Fine-Tuning Performance

Low-VRAM full-parameter fine-tuning benchmarks

Training Throughput (tokens/s)

Coming soon

VRAM Usage

Coming soon

vs. Full-GPU Training

Coming soon

Ready to get started?

Join the community and start running large models on your hardware today.

Download KTransformers