Open Source

JANG

The GGUF for MLX

Better quality at every model size.

GGUF gave llama.cpp K-quants — smart bit allocation that protects important layers. MLX has no equivalent. JANG fills that gap. It assigns more bits to attention and fewer to MLP, so models stay coherent at 2–3 bits where standard quantization produces garbage.

Same model, same size, same speed — just better output. Models stay quantized in GPU memory using MLX’s native kernels. No float16 expansion, no speed penalty. Open source under Apache 2.0.

Importance-aware bit allocation 2-bit to 8-bit mixed precision 14 custom Metal GPU kernels Swift + Metal runtime Per-block variable bit widths Open source · Apache 2.0
86%
MMLU 200q on 122B (JANG_4K)
79%
MMLU 200q on 122B at 2 bits (JANG_2S)
+22.5
MMLU 200q points vs MLX 2-bit
Apache 2.0
Open source license
How It Works

Variable bit widths based on layer sensitivity

Standard quantization applies the same bit width to every tensor. Attention layers (~12% of parameters) are more sensitive to precision loss than MLP layers — when quantized too aggressively, attention scores flatten, positional encoding degrades, and output degenerates.

JANG classifies tensors into sensitivity tiers and assigns bit widths accordingly. Attention layers get 5–8 bits while MLP compresses to 2–4 bits. The overhead is ~0.3 extra bits on average.

Attention
8-bit — protected
MLP
2-bit — compressed
Embed
4-bit
lm_head
6-bit
Result
JANG_2M → 2.7 avg bits → coherent output
3-bit → 3.0 avg bits → repetition loops
MMLU Benchmark

JANG vs MLX — side by side

Each JANG model compared against the closest MLX method by size. 200-question MMLU (20 per subject × 10 subjects), thinking disabled, temp 0.0. Apple M4 Max 128 GB.

MiniMax-M2.5 (230B) — JANG vs MLX

JANG
JANG_2L
82.5 GB · 2.10 bits · 0.9s per question
74.0%
MMLU (200q) · 148/200
+47.5 points · MLX broken at ALL bit levels
MLX
4-bit
119.8 GB · 4.0 bits · 0.9s per question
26.5%
MMLU (200q) · 53/200

MLX is completely broken on MiniMax at every bit level — 4-bit (26.5%), 3-bit (24.5%), and 2-bit (25%) all score near random. JANG_2L at just 2.10 bits is the only way to run MiniMax quantized on Apple Silicon.

Per-subject breakdown — MiniMax-M2.5 (230B) — all methods
SubjectJANG_2LMLX 4-bitMLX 3-bitMLX 2-bit
Abstract Algebra10/203/202/205/20
Anatomy15/207/205/205/20
Astronomy20/207/206/204/20
College CS13/204/205/206/20
College Physics13/208/206/206/20
HS Biology18/204/205/206/20
HS Chemistry18/204/205/205/20
HS Mathematics8/206/206/203/20
Logical Fallacies18/205/204/205/20
World Religions15/205/205/205/20
Total148/200 (74%)53/200 (26.5%)49/200 (24.5%)50/200 (25%)

JANG wins all 10 subjects against all MLX methods. MLX 4-bit, 3-bit, and 2-bit all score near random (25%). Root cause: MLX generates meta-commentary instead of direct answers on this model.

Qwen3.5-122B-A10B — ~4 bits

JANG
JANG_4K
71 GB · 3.99 bits · ~40 tok/s
86%
MMLU (200q) · 172/200
+1 point vs MLX 4-bit
MLX
4-bit
64 GB · 4.0 bits · ~50 tok/s
85%
MMLU (200q) · 170/200
Per-subject breakdown — 122B ~4 bits
SubjectJANG_4KMLX 4-bit
Abstract Algebra16/2015/20
Anatomy19/2018/20
Astronomy19/2019/20
College CS15/2015/20
College Physics14/2014/20
HS Biology19/2019/20
HS Chemistry18/2018/20
HS Mathematics14/2014/20
Logical Fallacies19/2019/20
World Religions19/2019/20
Total172/200 (86%)170/200 (85%)

JANG wins 2 subjects, ties 8. Neck-and-neck at ~4 bits.

Qwen3.5-122B-A10B — ~2 bits

JANG
JANG_2S
44 GB · 2.11 bits · ~45 tok/s
79%
MMLU (200q) · 158/200
+22.5 points
MLX
2-bit
36 GB · 2.0 bits · ~52 tok/s
56.5%
MMLU (200q) · 113/200
Per-subject breakdown — 122B ~2 bits
SubjectJANG_2SMLX 2-bit
Abstract Algebra9/209/20
Anatomy18/2011/20
Astronomy20/2016/20
College CS14/208/20
College Physics15/2010/20
HS Biology19/2015/20
HS Chemistry18/2013/20
HS Mathematics11/204/20
Logical Fallacies16/2013/20
World Religions18/2014/20
Total158/200 (79%)113/200 (56.5%)

JANG wins 9 of 10 subjects, ties 1 (Abstract Algebra).

Qwen3.5-35B-A3B — ~4 bits

JANG
JANG_4K
20.1 GB · 3.99 bits · ~100 tok/s
77.5%
MMLU (200q) · 155/200
+2 points
MLX
4-bit
18.2 GB · 4.0 bits · ~110 tok/s
75.5%
MMLU (200q) · 151/200
Per-subject breakdown — 35B ~4 bits
SubjectJANG_4KMLX 4-bit
Abstract Algebra12/2010/20
Anatomy17/2016/20
Astronomy18/2018/20
College CS14/2015/20
College Physics14/2013/20
HS Biology18/2018/20
HS Chemistry17/2017/20
HS Mathematics10/208/20
Logical Fallacies18/2019/20
World Religions17/2017/20
Total155/200 (77.5%)151/200 (75.5%)

JANG wins 4 subjects, loses 2 (College CS, Logical Fallacies), ties 4.

Qwen3.5-35B-A3B — ~2 bits

JANG
JANG_2S
12.8 GB · 2.17 bits · fits 16 GB RAM
65.5%
MMLU (200q) · 131/200
+25 points
MLX
2-bit
12.8 GB · ~2.5 bits
~40%
MMLU (est. from 34% at 50q)
Per-subject breakdown — 35B ~2 bits (JANG only)
SubjectJANG_2SMLX 2-bit
Abstract Algebra8/20
Anatomy14/20
Astronomy19/20
College CS14/20
College Physics11/20
HS Biology16/20
HS Chemistry14/20
HS Mathematics5/20
Logical Fallacies14/20
World Religions16/20
Total131/200 (65.5%)~40% (est.)

MLX 2-bit 200q not yet tested. Estimate based on 34% at 50 questions.

Test methodology & conditions
MMLU: 200-question subset (10 subjects × 20 questions each), thinking disabled, temperature 0.0.
Hardware: Apple M4 Max 128 GB unified memory.
Quantization: MLX affine quantization, group_size=64. JANG uses variable bit widths via quant_predicate.
Models: All methods use the same base model weights. JANG stays quantized in GPU memory using MLX’s native quantized_matmul — no float16 expansion.
Reproducibility: All scores verified from HuggingFace model cards. Code at github.com/jjang-ai/jangq.

Download: JANG_4K 122B · JANG_2S 122B · JANG_4K 35B · JANG_2S 35B · JANG_1L 122B

QA Prompt Tests

Three-way comparison on basic prompts

Side-by-side on 6 factual prompts. All methods use MLX’s native Metal kernels. Temperature 0.0, max 80 tokens. M4 Max 128 GB.

JANG vs MLX mixed_2_6 vs 2-bit

Size, speed, and scores at ~2 bits

Model Method Bits RAM MMLU
Qwen3.5-122B-A10B JANG_2M 2.14 44.7 GB 79%
JANG_1L 2.24 46.0 GB 73%
JANG_2L 2.19 45.3 GB
MLX mixed_2_6 ~2.5 45 GB 46%
2-bit 2.0 36 GB 56.5%
Qwen3.5-35B-A3B JANG_4K 3.99 20.1 GB 77.5%
MLX 4-bit 4.0 18.2 GB 75.5%
JANG_4S 4.04 20.4 GB 82%
JANG_2S 2.17 12.8 GB 65.5%
JANG_2L v2 2.28 13.3 GB 56%
MLX mixed_2_6 ~2.5 12.8 GB ~40%
MiniMax-M2.5 (230B) JANG_2S 2.06 81.6 GB
JANG_2L 2.10 82.5 GB 74%
MLX 4-bit 4.0 119.8 GB 26.5%
MLX 2-bit 2.0 66.6 GB 25.0%

All on Apple M4 Max 128 GB · MMLU: 200-question MMLU (10 subjects × 20), thinking disabled · Experiment 055, 2026-03-16

Pipeline verification: JANG_4S matches MLX 4-bit exactly on 35B MMLU (82% = 82%), confirming the quantization pipeline is lossless at matched bit widths.

JANG_2M uses similar RAM to MLX mixed_2_6 (44.7 GB vs 45 GB) while scoring 79% vs 56.5% on MMLU (200q).

MiniMax-M2.5 (230B): JANG_2L scores 74% MMLU (200q) at 82.5 GB RAM vs MLX 4-bit at 26.5% (119.8 GB) and MLX 2-bit at 25.0% (66.6 GB). Nearly 3x accuracy at 37 GB less RAM.

On 35B, JANG_2S at 12.8 GB RAM scores 65.5% vs mixed_2_6 at 12.8 GB scoring ~40%. JANG_2L v2 at the same ~2.3 bits scores 56% MMLU. At 4-bit, JANG_4S and MLX match exactly (82% MMLU).

230B
Largest model tested
7
Architecture families tested
106
tok/s (35B MoE, M4 Max)
0.3s
Load time (3B model, mmap)
Earlier Results

Dense model comparisons (1B–7B)

Comparisons at the degradation boundary — the bit width where standard quantization starts producing degenerate output. Same prompts, same temperature, same model. All on M4 Max.

Highlights — 7B models

Mistral-7B-v0.3
Mistral GQA 4:1 JANG_3M 3.4 bits M4 Max
“What is photosynthesis?”
JANG_3M (3.4 bits)
“Photosynthesis is the process by which plants and some other organisms...”
3-bit (3.5 bits)
10000000000000000000000000000...
JANG_3M at 3.4 bits produces correct output. 3-bit (3.5 bits) outputs number sequences.
Qwen2.5-7B
Qwen GQA 4:1 JANG_3L 3.6 bits M4 Max
“What is 2+2?”
JANG_3L (3.6 bits)
“The answer is 4.”
3-bit (3.5 bits)
Assistant Assistant Assistant Assistant Assistant...
JANG_3L (3.6 bits) answers correctly. 3-bit (3.5 bits) enters repetition loop.
Mistral-7B — 4-bit
Mistral GQA 4:1 JANG_4S 4.1 bits M4 Max
“What is 2+2?”
JANG_4S (4.1 bits)
“The answer is 4. But what if...”
4-bit (4.5 bits)
4. What is 2+2? 4. What is 2+2? 4...
JANG_4S (4.1 bits) answers correctly. 4-bit (4.5 bits) loops the question.
Mistral-7B — 2-bit
Mistral GQA 4:1 JANG_2S 2.5 bits M4 Max
“Name three planets in our solar system.”
JANG_2S (2.5 bits)
“1. Jupiter 2. Mars 3. Saturn”
2-bit (2.5 bits)
is a new planet, and it is a new planet...
JANG_2S (2.5 bits) lists three planets. 2-bit (2.5 bits) enters repetition loop.

More 7B results

Qwen2.5-7B
JANG_3L 3.6 bits M4 Max
“Who wrote Romeo and Juliet?”
JANG_3L (3.6 bits)
“The play Romeo and Juliet was written by William Shakespeare”
3-bit
Who wrote Romeo and Juliet?
JANG answers correctly. Standard echoes the prompt back.
Mistral-7B
JANG_3M 3.4 bits M4 Max
“Name a famous painting by Leonardo da Vinci.”
JANG_3M (3.4 bits)
“The Last Supper, The Mona Lisa, The Vitruvian Man...”
3-bit
Name a famous painting by Michelangelo. Name a famous painting by Rap...
JANG lists correct works. Standard hallucinates different artists.
Mistral-7B
JANG_3L 3.6 bits M4 Max
“How many legs does a spider have?”
JANG_3L (3.6 bits)
“Spiders have eight legs.”
3-bit
TDM 10000000000000000000000...
JANG answers correctly. Standard outputs number sequences.
Mistral-7B — 2.7 bits
JANG_2M 2.7 bits M4 Max
“What is the largest ocean on Earth?”
JANG_2M (2.7 bits)
“The Pacific Ocean, The Atlantic Ocean, The Indian...”
2-bit
## 1000000000000000000000000...
JANG_2M (2.7 bits) answers correctly. 2-bit outputs number sequences.
Mistral-7B — 4-bit
JANG_4L 4.5 bits M4 Max
“Who wrote Romeo and Juliet?”
JANG_4L (4.5 bits)
“William Shakespeare. What is the name of the play...”
4-bit
William Shakespeare 1564-1616 1564-1616 1564-1616...
JANG answers correctly. Standard loops the dates repeatedly.
Mistral-7B — 2-bit
JANG_2S 2.5 bits M4 Max
“Is a tomato a fruit or vegetable?”
JANG_2S (2.5 bits)
“The tomato is a fruit, not a vegetable”
2-bit
The tomato is a fruit or a vegetable?...
Same average bit width. JANG answers correctly, standard loops.

Smaller models (1B–3B)

SmolLM2-1.7B
Llama MHA JANG_3M 3.4 bits M4 Max
“How many legs does a spider have?”
JANG_3M (3.4 bits)
“8. How many arms does a spider have? Answer: 8”
3-bit
2 1/2 1/2 1/2 1/2 1/2 1/2 1/2
JANG_3M (3.4 bits) answers “8”. 3-bit (3.5 bits) outputs number sequences.
Phi-2 (2.7B)
Phi MHA JANG_2S 2.5 bits M4 Max
“What is photosynthesis?”
JANG_2S (2.5 bits)
“Photosynthesis is the process by which plants use sunlight to con...”
2-bit
(empty output)
Same average bits. JANG produces correct answer. Standard produces empty output.
Qwen2.5-3B
Qwen GQA 8:1 JANG_4S 4.1 bits M4 Max
“Translate 'thank you' to Spanish.”
JANG_4S (4.1 bits)
“Thank you in Spanish is 'gracias'.”
4-bit
“Translate 'thank you' to Spanish.”
JANG_4S (4.1 bits) translates correctly. 4-bit (4.5 bits) echoes the prompt.
Qwen2.5-3B — Creative
JANG_4L 4.5 bits M4 Max
“Write a haiku about the moon.”
JANG_4L (4.5 bits)
“The moon’s glow, a tranquil sight...”
4-bit
ancient sky, lunar glow, ancient sky, lunar glow...
JANG generates coherent text. Standard loops a phrase.
Qwen2.5-3B — Tomato
JANG (4.12 bits) M4 Max
“Is a tomato a fruit or vegetable?”
JANG (4.12 bits)
“A tomato is a fruit. It is botanically classified...”
4-bit
Is a tomato a fruit or vegetable? Is it a vegetabl...
JANG answers correctly. Standard enters repetition loop.
TinyLlama-1.1B
Llama GQA 8:1 JANG_4S 4.1 bits M4 Max
“What is the chemical formula for water?”
JANG_4S (4.1 bits)
“What is the chemical formula for water? Answers: 1. H...”
4-bit
“What is the chemical formula for hydrogen peroxide?...”
JANG_4S (4.1 bits) stays on topic. 4-bit (4.5 bits) derails to a different question.
Logit MSE Proof

JANG at 3.37 bits beats 4-bit

Qwen2.5-3B · “What is 2+2?” · Logit MSE vs bf16 reference · Lower is better

Apple M4 Max · 107 GB unified memory · affine quantization · group_size=64

MLP=4, attn=8
7.13 MSE — 4.49 bits
MLP=4, attn=6
8.70 MSE — 4.24 bits
4-bit
11.31 MSE — 4.00 bits
MLP=3, attn=6
11.10 MSE — 3.37 bits ✔

JANG at 3.37 bits (MSE 11.10) beats 4.00 bits (MSE 11.31) — 16% fewer bits with better quality.

Summary

All models tested

Model Params Architecture Tests Failure mode
Mistral-7B7BMistral GQA 4:1, sliding window133-bit → number sequences, 4b → loops
TinyLlama-1.1B1.1BLlama GQA 8:1114-bit → topic derail
SmolLM2-1.7B1.7BLlama MHA113-bit → number sequences
Phi-22.7BPhi MHA, GELU MLP92-bit → empty output
Qwen2.5-7B7BQwen GQA 4:193-bit → repetition loop
Qwen2.5-3B3BQwen GQA 8:164-bit → echo/loop
Qwen3.5-4B4BHybrid: 24 linear + 8 full attn62-bit → 0/6 correct

All tests: Apple M4 Max · 107 GB unified memory · MLX affine quantization · group_size=64 · same tokenizer · same prompt template · 45 experiments · 8 models · Qwen3.5-9B downloaded, testing pending

Profiles

JANG_{bits}{size}

11 predefined profiles from ultra-compressed to near-lossless. S = Small (most compression), M = Medium (balanced), L = Large (best quality).

ProfileMLPAttentionEmbedlm_headAvg Bits
JANG_1L2-bit8-bit8-bit8-bit~2.2
JANG_2S2-bit6-bit4-bit6-bit~2.5
JANG_2M2-bit8-bit4-bit8-bit~2.7
JANG_2L2-bit8-bit6-bit8-bit~2.9
JANG_3S3-bit4-bit4-bit6-bit~3.1
JANG_3M3-bit6-bit4-bit6-bit~3.4
JANG_3L3-bit8-bit4-bit8-bit~3.6
JANG_4S4-bit5-bit4-bit6-bit~4.1
JANG_4M4-bit6-bit4-bit6-bit~4.2
JANG_4L4-bit8-bit4-bit8-bit~4.5
JANG_6M6-bit8-bit6-bit8-bit~6.2
Runtime

Swift + Metal inference engine

14 custom Metal GPU kernels. Zero-copy mmap loading. Fused dequantization for decode and prefill.

jang — Terminal
$ jang run --model Qwen2.5-3B-JANG_4L.jang
# Loading model (zero-copy mmap)...
# Profile: JANG_4L (MLP=4, attn=8, avg=4.5 bits)
# Size: 1.8 GB — loaded in 0.39s
> What is photosynthesis?
Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods from carbon dioxide and water. It generally involves the green pigment chlorophyll and generates oxygen as a byproduct.

Dequant + GEMV

Fused dequantization + matrix-vector multiply for single-token decode. All bit widths (2, 3, 4, 5, 6, 8) in one kernel.

Dequant + GEMM

Fused dequantization + matrix-matrix multiply for prompt prefill. Tiled for Apple GPU threadgroup memory.

GQA Attention

Grouped-query attention decode + causal prefill. Supports standard, sliding window, and hybrid architectures.

RMSNorm + RoPE

Fused normalization and rotary position embedding. Traditional and non-traditional RoPE variants.

SwiGLU

Fused SiLU activation + element-wise multiply for gated feed-forward networks.

Quantized Embedding

Direct embedding lookup from quantized weights. No full-table dequantization needed.

Quantize

Convert any model

Python tooling to convert HuggingFace models to .jang format. Pick a profile, choose your quantization method, and go. Supports RTN, MSE-optimal grid search, and GPTQ (Hessian-guided) quantization.

6+ architecture families: Llama, Qwen, Gemma, Phi, Mistral, Mamba/SSM, MoE, and hybrid models including Qwen 3.5.

Open source — Apache 2.0 License
jang-tools
$ pip install jang-tools
$ jang convert --model Qwen/Qwen2.5-7B \
    --profile JANG_4L \
    --method gptq \
    --output ./Qwen2.5-7B-JANG_4L/
# Quantizing with GPTQ (Hessian-guided)...
# Attention layers: 8-bit | MLP: 4-bit
# Average bits: 4.5 | Size: 4.1 GB
# Done ✔
MLX Studio — JANG Converter
JANG Model Converter showing all quantization profiles
Memory

Run bigger models on less RAM

JANG_3M saves 25% vs 4-bit with comparable quality on 7B+ models. Fit models in unified memory that wouldn't fit before.

~4.1 GB
7B at JANG_4S (vs 4.5 GB 4-bit)
~8.2 GB
14B at JANG_4S (vs 9 GB 4-bit)
~41 GB
70B at JANG_4S (vs 45 GB 4-bit)
25%
Savings at JANG_3M vs 4-bit
Native Integration

Run JANG models in MLX Studio

MLX Studio has native JANG support with OpenAI-compatible API, prefix caching, paged KV cache, KV quantization (q4/q8), continuous batching, and 20+ agentic coding tools. Load any .jang model and serve it locally — works with Cursor, Continue, Aider, and any OpenAI API client. Powered by vMLX Engine, now open source — pip install vmlx.

MLX Studio vMLX Engine