JANG
The GGUF for MLX
Better quality at every model size.
GGUF gave llama.cpp K-quants — smart bit allocation that protects important layers. MLX has no equivalent. JANG fills that gap. It assigns more bits to attention and fewer to MLP, so models stay coherent at 2–3 bits where standard quantization produces garbage.
Same model, same size, same speed — just better output. Models stay quantized in GPU memory using MLX’s native kernels. No float16 expansion, no speed penalty. Open source under Apache 2.0.
Variable bit widths based on layer sensitivity
Standard quantization applies the same bit width to every tensor. Attention layers (~12% of parameters) are more sensitive to precision loss than MLP layers — when quantized too aggressively, attention scores flatten, positional encoding degrades, and output degenerates.
JANG classifies tensors into sensitivity tiers and assigns bit widths accordingly. Attention layers get 5–8 bits while MLP compresses to 2–4 bits. The overhead is ~0.3 extra bits on average.
JANG vs MLX — side by side
Each JANG model compared against the closest MLX method by size. 200-question MMLU (20 per subject × 10 subjects), thinking disabled, temp 0.0. Apple M4 Max 128 GB.
MiniMax-M2.5 (230B) — JANG vs MLX
MLX is completely broken on MiniMax at every bit level — 4-bit (26.5%), 3-bit (24.5%), and 2-bit (25%) all score near random. JANG_2L at just 2.10 bits is the only way to run MiniMax quantized on Apple Silicon.
Qwen3.5-122B-A10B — ~4 bits
Qwen3.5-122B-A10B — ~2 bits
Qwen3.5-35B-A3B — ~4 bits
Qwen3.5-35B-A3B — ~2 bits
Download: JANG_4K 122B · JANG_2S 122B · JANG_4K 35B · JANG_2S 35B · JANG_1L 122B
Three-way comparison on basic prompts
Side-by-side on 6 factual prompts. All methods use MLX’s native Metal kernels. Temperature 0.0, max 80 tokens. M4 Max 128 GB.
MLX’s mixed_2_6 mode protects select v_proj and down_proj layers at 6-bit, but does not account for GatedDeltaNet linear attention layers, MoE expert routing tensors, or hybrid architecture components. JANG’s tier system classifies these architecture-specific tensors explicitly.
JANG_2L: 74% MMLU (200q) at 82.5 GB RAM — 3x higher than MLX 4-bit at 120 GB
On this hybrid MoE model, MLX mixed_2_6 does not improve over 2-bit. The mixed_2_6 heuristic targets v_proj and down_proj in standard transformer layers but misses GatedDeltaNet attention and MoE routing tensors that are critical for this architecture.
<think> reasoning preserved at 2.19 bits
Size, speed, and scores at ~2 bits
| Model | Method | Bits | RAM | MMLU |
|---|---|---|---|---|
| Qwen3.5-122B-A10B | JANG_2M | 2.14 | 44.7 GB | 79% |
| JANG_1L | 2.24 | 46.0 GB | 73% | |
| JANG_2L | 2.19 | 45.3 GB | — | |
| MLX mixed_2_6 | ~2.5 | 45 GB | 46% | |
| 2-bit | 2.0 | 36 GB | 56.5% | |
| Qwen3.5-35B-A3B | JANG_4K | 3.99 | 20.1 GB | 77.5% |
| MLX 4-bit | 4.0 | 18.2 GB | 75.5% | |
| JANG_4S | 4.04 | 20.4 GB | 82% | |
| JANG_2S | 2.17 | 12.8 GB | 65.5% | |
| JANG_2L v2 | 2.28 | 13.3 GB | 56% | |
| MLX mixed_2_6 | ~2.5 | 12.8 GB | ~40% | |
| MiniMax-M2.5 (230B) | JANG_2S | 2.06 | 81.6 GB | — |
| JANG_2L | 2.10 | 82.5 GB | 74% | |
| MLX 4-bit | 4.0 | 119.8 GB | 26.5% | |
| MLX 2-bit | 2.0 | 66.6 GB | 25.0% | |
All on Apple M4 Max 128 GB · MMLU: 200-question MMLU (10 subjects × 20), thinking disabled · Experiment 055, 2026-03-16
Pipeline verification: JANG_4S matches MLX 4-bit exactly on 35B MMLU (82% = 82%), confirming the quantization pipeline is lossless at matched bit widths.
JANG_2M uses similar RAM to MLX mixed_2_6 (44.7 GB vs 45 GB) while scoring 79% vs 56.5% on MMLU (200q).
MiniMax-M2.5 (230B): JANG_2L scores 74% MMLU (200q) at 82.5 GB RAM vs MLX 4-bit at 26.5% (119.8 GB) and MLX 2-bit at 25.0% (66.6 GB). Nearly 3x accuracy at 37 GB less RAM.
On 35B, JANG_2S at 12.8 GB RAM scores 65.5% vs mixed_2_6 at 12.8 GB scoring ~40%. JANG_2L v2 at the same ~2.3 bits scores 56% MMLU. At 4-bit, JANG_4S and MLX match exactly (82% MMLU).
Dense model comparisons (1B–7B)
Comparisons at the degradation boundary — the bit width where standard quantization starts producing degenerate output. Same prompts, same temperature, same model. All on M4 Max.
At 2.5 effective bits, JANG_2S gets 6/6 correct while 2-bit gets 0/6. JANG protects the 8 critical full-attention layers at 6-bit while compressing the 24 linear-attention layers and all MLP at 2-bit.
Highlights — 7B models
JANG_3M (3.4 bits)
3-bit (3.5 bits)
JANG_3L (3.6 bits)
3-bit (3.5 bits)
JANG_4S (4.1 bits)
4-bit (4.5 bits)
JANG_2S (2.5 bits)
2-bit (2.5 bits)
More 7B results
JANG_3L (3.6 bits)
3-bit
JANG_3M (3.4 bits)
3-bit
JANG_3L (3.6 bits)
3-bit
JANG_2M (2.7 bits)
2-bit
JANG_4L (4.5 bits)
4-bit
JANG_2S (2.5 bits)
2-bit
Smaller models (1B–3B)
JANG_3M (3.4 bits)
3-bit
JANG_2S (2.5 bits)
2-bit
JANG_4S (4.1 bits)
4-bit
JANG_4L (4.5 bits)
4-bit
JANG (4.12 bits)
4-bit
JANG_4S (4.1 bits)
4-bit
JANG at 3.37 bits beats 4-bit
Qwen2.5-3B · “What is 2+2?” · Logit MSE vs bf16 reference · Lower is better
Apple M4 Max · 107 GB unified memory · affine quantization · group_size=64
JANG at 3.37 bits (MSE 11.10) beats 4.00 bits (MSE 11.31) — 16% fewer bits with better quality.
All models tested
| Model | Params | Architecture | Tests | Failure mode |
|---|---|---|---|---|
| Mistral-7B | 7B | Mistral GQA 4:1, sliding window | 13 | 3-bit → number sequences, 4b → loops |
| TinyLlama-1.1B | 1.1B | Llama GQA 8:1 | 11 | 4-bit → topic derail |
| SmolLM2-1.7B | 1.7B | Llama MHA | 11 | 3-bit → number sequences |
| Phi-2 | 2.7B | Phi MHA, GELU MLP | 9 | 2-bit → empty output |
| Qwen2.5-7B | 7B | Qwen GQA 4:1 | 9 | 3-bit → repetition loop |
| Qwen2.5-3B | 3B | Qwen GQA 8:1 | 6 | 4-bit → echo/loop |
| Qwen3.5-4B | 4B | Hybrid: 24 linear + 8 full attn | 6 | 2-bit → 0/6 correct |
All tests: Apple M4 Max · 107 GB unified memory · MLX affine quantization · group_size=64 · same tokenizer · same prompt template · 45 experiments · 8 models · Qwen3.5-9B downloaded, testing pending
JANG_{bits}{size}
11 predefined profiles from ultra-compressed to near-lossless. S = Small (most compression), M = Medium (balanced), L = Large (best quality).
| Profile | MLP | Attention | Embed | lm_head | Avg Bits |
|---|---|---|---|---|---|
| JANG_1L | 2-bit | 8-bit | 8-bit | 8-bit | ~2.2 |
| JANG_2S | 2-bit | 6-bit | 4-bit | 6-bit | ~2.5 |
| JANG_2M | 2-bit | 8-bit | 4-bit | 8-bit | ~2.7 |
| JANG_2L | 2-bit | 8-bit | 6-bit | 8-bit | ~2.9 |
| JANG_3S | 3-bit | 4-bit | 4-bit | 6-bit | ~3.1 |
| JANG_3M | 3-bit | 6-bit | 4-bit | 6-bit | ~3.4 |
| JANG_3L | 3-bit | 8-bit | 4-bit | 8-bit | ~3.6 |
| JANG_4S | 4-bit | 5-bit | 4-bit | 6-bit | ~4.1 |
| JANG_4M | 4-bit | 6-bit | 4-bit | 6-bit | ~4.2 |
| JANG_4L | 4-bit | 8-bit | 4-bit | 8-bit | ~4.5 |
| JANG_6M | 6-bit | 8-bit | 6-bit | 8-bit | ~6.2 |
Swift + Metal inference engine
14 custom Metal GPU kernels. Zero-copy mmap loading. Fused dequantization for decode and prefill.
Dequant + GEMV
Fused dequantization + matrix-vector multiply for single-token decode. All bit widths (2, 3, 4, 5, 6, 8) in one kernel.
Dequant + GEMM
Fused dequantization + matrix-matrix multiply for prompt prefill. Tiled for Apple GPU threadgroup memory.
GQA Attention
Grouped-query attention decode + causal prefill. Supports standard, sliding window, and hybrid architectures.
RMSNorm + RoPE
Fused normalization and rotary position embedding. Traditional and non-traditional RoPE variants.
SwiGLU
Fused SiLU activation + element-wise multiply for gated feed-forward networks.
Quantized Embedding
Direct embedding lookup from quantized weights. No full-table dequantization needed.
Convert any model
Python tooling to convert HuggingFace models to .jang format. Pick a profile, choose your quantization method, and go. Supports RTN, MSE-optimal grid search, and GPTQ (Hessian-guided) quantization.
6+ architecture families: Llama, Qwen, Gemma, Phi, Mistral, Mamba/SSM, MoE, and hybrid models including Qwen 3.5.
Run bigger models on less RAM
JANG_3M saves 25% vs 4-bit with comparable quality on 7B+ models. Fit models in unified memory that wouldn't fit before.
Pre-quantized models on HuggingFace
Ready to download. Compatible with vMLX Engine / MLX Studio via the JANG loader.
Run JANG models in MLX Studio
MLX Studio has native JANG support with OpenAI-compatible API,
prefix caching, paged KV cache, KV quantization (q4/q8), continuous batching,
and 20+ agentic coding tools. Load any .jang model and serve it locally —
works with Cursor, Continue, Aider, and any OpenAI API client.
Powered by vMLX Engine,
now open source — pip install vmlx.