theme
Paste model name here:
quantization
FP32 — 4 bytes/param
BF16 — 2 bytes/param
FP16 — 2 bytes/param
FP8 (E4M3/E5M2) — 1 byte/param
INT8 — 1 byte/param
Q4_K_M (llama.cpp) — ~0.55 bytes/param
INT4 — 0.5 bytes/param
Q3_K_M (llama.cpp) — ~0.45 bytes/param
context length:
enable extended context (256k / 512k / 1M)
batch size
— concurrent sequences; KV scales linearly. 1 for single-user inference.
LoRA fine-tuning (rank 16, adds adapter overhead)
Total Usage ≈
—
weights —
KV cache —
available —
MoE: GPU fit is computed against
total
weights — every expert must be resident even though only a few activate per token.
NVIDIA Enterprise
Nvidia consumer
Apple