JANG
Smaller than MLX. Sharper output.
Only the cases where JANG wins by a long shot while using less memory or fewer bits.
The benchmark page is now filtered hard: no close wins, no larger JANG configurations, no estimates. If the comparison is listed here, JANG is smaller than the MLX baseline and the quality gap is obvious.
The strongest proof points are MiniMax-M2.5 at 82.5 GB beating MLX 4-bit at 119.8 GB by +47.5 MMLU points, and Qwen3.5-122B at 44.7 GB beating MLX mixed_2_6 at 45 GB by +33 points.
Variable bit widths based on layer sensitivity
Standard quantization applies the same bit width to every tensor. Attention layers (~12% of parameters) are more sensitive to precision loss than MLP layers — when quantized too aggressively, attention scores flatten, positional encoding degrades, and output degenerates.
JANG classifies tensors into sensitivity tiers and assigns bit widths accordingly. Attention layers get 5–8 bits while MLP compresses to 2–4 bits. The overhead is ~0.3 extra bits on average.
Only the smaller blowout wins.
Filtered to proven comparisons where JANG is smaller than the MLX baseline and wins by a large MMLU or coherency margin. Close wins, larger JANG configs, and untested estimates were removed.
MiniMax-M2.5 (230B) — MMLU blowout, smaller than MLX 4-bit
Kept because JANG is both dramatically better and dramatically smaller: 82.5 GB vs 119.8 GB, with MLX 4-bit, 3-bit, and 2-bit all near random.
Qwen3.5-122B-A10B — same-size MLX mixed baseline, JANG still smaller
Kept because this is the closest same-memory MLX comparison: JANG is slightly smaller and still lands a +33 point MMLU gap.
JANG_3M
MLX 3-bit
JANG_4S
MLX 4-bit
JANG
MLX 4-bit
JANG_3M
MLX 3-bit
JANG_4S
MLX 4-bit
Qwen2.5-3B: 3.37 bits beats 4.00-bit MSE
Lower is better. JANG MLP=3 / attention=6 reaches 11.10 MSE at 3.37 bits vs MLX 4-bit at 11.31 MSE.
Shown models: decisive smaller wins only
| Model | JANG | MLX baseline | Why shown |
|---|---|---|---|
| MiniMax-M2.5 | JANG_2L · 82.5 GB · 74% | 4-bit · 119.8 GB · 26.5% | +47.5 MMLU, 37.3 GB smaller |
| Qwen3.5-122B-A10B | JANG_2M · 44.7 GB · 79% | mixed_2_6 · 45 GB · 46% | +33 MMLU, slightly smaller |
| Mistral-7B | JANG_3M / JANG_4S | 3-bit / 4-bit MLX | Fewer bits, coherent output |
| Qwen2.5-3B | JANG_4S / 3.37-bit proof | 4-bit MLX | Fewer bits, better MSE/coherency |
| SmolLM2-1.7B | JANG_3M · 3.4 bits | 3-bit MLX · 3.5 bits | Smaller and answers directly |
| TinyLlama-1.1B | JANG_4S · 4.1 bits | 4-bit MLX · 4.5 bits | Smaller and avoids topic derail |
JANG_{bits}{size}
11 predefined profiles from ultra-compressed to near-lossless. S = Small (most compression), M = Medium (balanced), L = Large (best quality).
| Profile | MLP | Attention | Embed | lm_head | Avg Bits |
|---|---|---|---|---|---|
| JANG_1L | 2-bit | 8-bit | 8-bit | 8-bit | ~2.2 |
| JANG_2S | 2-bit | 6-bit | 4-bit | 6-bit | ~2.5 |
| JANG_2M | 2-bit | 8-bit | 4-bit | 8-bit | ~2.7 |
| JANG_2L | 2-bit | 8-bit | 6-bit | 8-bit | ~2.9 |
| JANG_3S | 3-bit | 4-bit | 4-bit | 6-bit | ~3.1 |
| JANG_3M | 3-bit | 6-bit | 4-bit | 6-bit | ~3.4 |
| JANG_3L | 3-bit | 8-bit | 4-bit | 8-bit | ~3.6 |
| JANG_4S | 4-bit | 5-bit | 4-bit | 6-bit | ~4.1 |
| JANG_4M | 4-bit | 6-bit | 4-bit | 6-bit | ~4.2 |
| JANG_4L | 4-bit | 8-bit | 4-bit | 8-bit | ~4.5 |
| JANG_6M | 6-bit | 8-bit | 6-bit | 8-bit | ~6.2 |
Swift + Metal inference engine
14 custom Metal GPU kernels. Zero-copy mmap loading. Fused dequantization for decode and prefill.
Dequant + GEMV
Fused dequantization + matrix-vector multiply for single-token decode. All bit widths (2, 3, 4, 5, 6, 8) in one kernel.
Dequant + GEMM
Fused dequantization + matrix-matrix multiply for prompt prefill. Tiled for Apple GPU threadgroup memory.
GQA Attention
Grouped-query attention decode + causal prefill. Supports standard, sliding window, and hybrid architectures.
RMSNorm + RoPE
Fused normalization and rotary position embedding. Traditional and non-traditional RoPE variants.
SwiGLU
Fused SiLU activation + element-wise multiply for gated feed-forward networks.
Quantized Embedding
Direct embedding lookup from quantized weights. No full-table dequantization needed.
Convert any model
Python tooling to convert HuggingFace models to .jang format. Pick a profile, choose your quantization method, and go. Supports RTN, MSE-optimal grid search, and GPTQ (Hessian-guided) quantization.
6+ architecture families: Llama, Qwen, Gemma, Phi, Mistral, Mamba/SSM, MoE, and hybrid models including Qwen 3.5.
Run bigger models on less RAM
JANG_3M saves 25% vs 4-bit with comparable quality on 7B+ models. Fit models in unified memory that wouldn't fit before.
Proven smaller-win releases
This homepage now only surfaces model releases tied to the curated smaller-win evidence above. The full Hugging Face account is still linked, but the on-page list no longer shows unrelated recent models.
Run JANG models in MLX Studio
MLX Studio has native JANG support with OpenAI-compatible API,
prefix caching, paged KV cache, KV quantization (q4/q8), continuous batching,
and 20+ agentic coding tools. Load any .jang model and serve it locally —
works with Cursor, Continue, Aider, and any OpenAI API client.
Powered by vMLX Engine,
now open source — pip install vmlx.