Open Source · Curated Wins Only

JANG

Smaller than MLX. Sharper output.

Only the cases where JANG wins by a long shot while using less memory or fewer bits.

The benchmark page is now filtered hard: no close wins, no larger JANG configurations, no estimates. If the comparison is listed here, JANG is smaller than the MLX baseline and the quality gap is obvious.

The strongest proof points are MiniMax-M2.5 at 82.5 GB beating MLX 4-bit at 119.8 GB by +47.5 MMLU points, and Qwen3.5-122B at 44.7 GB beating MLX mixed_2_6 at 45 GB by +33 points.

Smaller-than-MLX proof set MMLU blowouts only Coherency failures filtered No close wins No larger JANG configs Open source · Apache 2.0

View on GitHub See Curated Wins

+47.5

MMLU over MLX 4-bit

37.3

GB smaller on MiniMax

+33

MMLU over MLX mixed_2_6

3.37-bit

Fewer bits, better MSE

How It Works

Variable bit widths based on layer sensitivity

Standard quantization applies the same bit width to every tensor. Attention layers (~12% of parameters) are more sensitive to precision loss than MLP layers — when quantized too aggressively, attention scores flatten, positional encoding degrades, and output degenerates.

JANG classifies tensors into sensitivity tiers and assigns bit widths accordingly. Attention layers get 5–8 bits while MLP compresses to 2–4 bits. The overhead is ~0.3 extra bits on average.

Attention

8-bit — protected

MLP

2-bit — compressed

Embed

4-bit

lm_head

6-bit

Result

                JANG_2M
                 → 2.7 avg bits → 
                coherent output
              

                3-bit
                 → 3.0 avg bits → 
                repetition loops
              

Curated Proof Set

Only the smaller blowout wins.

Filtered to proven comparisons where JANG is smaller than the MLX baseline and wins by a large MMLU or coherency margin. Close wins, larger JANG configs, and untested estimates were removed.

MiniMax-M2.5 (230B) — MMLU blowout, smaller than MLX 4-bit

JANG

JANG_2L

82.5 GB · 2.10 bits · 0.9s/question

74.0%

MMLU (200q) · 148/200

+47.5 points · 37.3 GB smaller

MLX baseline

4-bit

119.8 GB · 4.0 bits · 0.9s/question

26.5%

MMLU (200q) · 53/200

Kept because JANG is both dramatically better and dramatically smaller: 82.5 GB vs 119.8 GB, with MLX 4-bit, 3-bit, and 2-bit all near random.

Per-subject proof — MiniMax-M2.5

Subject	JANG_2L	MLX 4-bit	MLX 3-bit	MLX 2-bit
Abstract Algebra	10/20	3/20	2/20	5/20
Anatomy	15/20	7/20	5/20	5/20
Astronomy	20/20	7/20	6/20	4/20
College CS	13/20	4/20	5/20	6/20
College Physics	13/20	8/20	6/20	6/20
HS Biology	18/20	4/20	5/20	6/20
HS Chemistry	18/20	4/20	5/20	5/20
HS Mathematics	8/20	6/20	6/20	3/20
Logical Fallacies	18/20	5/20	4/20	5/20
World Religions	15/20	5/20	5/20	5/20
Total	148/200 (74%)	53/200 (26.5%)	49/200 (24.5%)	50/200 (25%)

Qwen3.5-122B-A10B — same-size MLX mixed baseline, JANG still smaller

JANG

JANG_2M

44.7 GB · 2.14 bits

79%

MMLU (200q) · 158/200

+33 points · 0.3 GB smaller

MLX baseline

mixed_2_6

45 GB · ~2.5 bits

46%

MMLU (200q) · 92/200

Kept because this is the closest same-memory MLX comparison: JANG is slightly smaller and still lands a +33 point MMLU gap.

Mistral-7B-v0.3 — photosynthesis

JANG_3M3.4 bits vs 3.5-bit MLX

“What is photosynthesis?”

JANG_3M

Correct explanation of plants using sunlight.

MLX 3-bit

Number-sequence degeneration.

Kept: JANG uses fewer bits and stays coherent.

Mistral-7B — arithmetic

JANG_4S4.1 bits vs 4.5-bit MLX

“What is 2+2?”

JANG_4S

“4”

MLX 4-bit

Loops the question.

Kept: smaller bit width and decisive coherency win.

Qwen2.5-3B — translation / factual QA

JANG_4S4.1–4.12 bits vs 4.5-bit MLX

“Translate 'thank you' to Spanish.” / “Is a tomato a fruit or vegetable?”

JANG

Answers directly: “gracias”; tomato is a fruit.

MLX 4-bit

Echoes or repeats the prompt.

Kept: smaller than 4-bit with clear coherency wins.

SmolLM2-1.7B — spider legs

JANG_3M3.4 bits vs 3.5-bit MLX

“How many legs does a spider have?”

JANG_3M

Answers 8.

MLX 3-bit

Number-sequence output.

Kept: fewer bits and direct answer.

TinyLlama-1.1B — water formula

JANG_4S4.1 bits vs 4.5-bit MLX

“What is the chemical formula for water?”

JANG_4S

Stays on topic.

MLX 4-bit

Derails to a different chemistry question.

Kept: smaller than 4-bit and more coherent.

Logit MSE Proof

Qwen2.5-3B: 3.37 bits beats 4.00-bit MSE

Lower is better. JANG MLP=3 / attention=6 reaches 11.10 MSE at 3.37 bits vs MLX 4-bit at 11.31 MSE.

MLX 4-bit

11.31 MSE — 4.00 bits

JANG

11.10 MSE — 3.37 bits

Summary

Shown models: decisive smaller wins only

Model	JANG	MLX baseline	Why shown
MiniMax-M2.5	JANG_2L · 82.5 GB · 74%	4-bit · 119.8 GB · 26.5%	+47.5 MMLU, 37.3 GB smaller
Qwen3.5-122B-A10B	JANG_2M · 44.7 GB · 79%	mixed_2_6 · 45 GB · 46%	+33 MMLU, slightly smaller
Mistral-7B	JANG_3M / JANG_4S	3-bit / 4-bit MLX	Fewer bits, coherent output
Qwen2.5-3B	JANG_4S / 3.37-bit proof	4-bit MLX	Fewer bits, better MSE/coherency
SmolLM2-1.7B	JANG_3M · 3.4 bits	3-bit MLX · 3.5 bits	Smaller and answers directly
TinyLlama-1.1B	JANG_4S · 4.1 bits	4-bit MLX · 4.5 bits	Smaller and avoids topic derail

Profiles

JANG_{bits}{size}

11 predefined profiles from ultra-compressed to near-lossless. S = Small (most compression), M = Medium (balanced), L = Large (best quality).

Profile	MLP	Attention	Embed	lm_head	Avg Bits
JANG_1L	2-bit	8-bit	8-bit	8-bit	~2.2
JANG_2S	2-bit	6-bit	4-bit	6-bit	~2.5
JANG_2M	2-bit	8-bit	4-bit	8-bit	~2.7
JANG_2L	2-bit	8-bit	6-bit	8-bit	~2.9
JANG_3S	3-bit	4-bit	4-bit	6-bit	~3.1
JANG_3M	3-bit	6-bit	4-bit	6-bit	~3.4
JANG_3L	3-bit	8-bit	4-bit	8-bit	~3.6
JANG_4S	4-bit	5-bit	4-bit	6-bit	~4.1
JANG_4M	4-bit	6-bit	4-bit	6-bit	~4.2
JANG_4L	4-bit	8-bit	4-bit	8-bit	~4.5
JANG_6M	6-bit	8-bit	6-bit	8-bit	~6.2

Runtime

Swift + Metal inference engine

14 custom Metal GPU kernels. Zero-copy mmap loading. Fused dequantization for decode and prefill.

jang — Terminal

$ jang run --model Qwen2.5-3B-JANG_4L.jang

# Loading model (zero-copy mmap)...

# Profile: JANG_4L (MLP=4, attn=8, avg=4.5 bits)

# Size: 1.8 GB — loaded in 0.39s

> What is photosynthesis?

Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods from carbon dioxide and water. It generally involves the green pigment chlorophyll and generates oxygen as a byproduct.

Dequant + GEMV

Fused dequantization + matrix-vector multiply for single-token decode. All bit widths (2, 3, 4, 5, 6, 8) in one kernel.

Dequant + GEMM

Fused dequantization + matrix-matrix multiply for prompt prefill. Tiled for Apple GPU threadgroup memory.

GQA Attention

Grouped-query attention decode + causal prefill. Supports standard, sliding window, and hybrid architectures.

RMSNorm + RoPE

Fused normalization and rotary position embedding. Traditional and non-traditional RoPE variants.

SwiGLU

Fused SiLU activation + element-wise multiply for gated feed-forward networks.

Quantized Embedding

Direct embedding lookup from quantized weights. No full-table dequantization needed.

Quantize

Convert any model

Python tooling to convert HuggingFace models to .jang format. Pick a profile, choose your quantization method, and go. Supports RTN, MSE-optimal grid search, and GPTQ (Hessian-guided) quantization.

6+ architecture families: Llama, Qwen, Gemma, Phi, Mistral, Mamba/SSM, MoE, and hybrid models including Qwen 3.5.

Open source — Apache 2.0 License

jang-tools

$ pip install jang-tools

$ jang convert --model Qwen/Qwen2.5-3B \

--profile JANG_4S \

--method gptq \

--output ./Qwen2.5-3B-JANG_4S/

# Quantizing with GPTQ (Hessian-guided)...

# Attention layers: 5-bit | MLP: 4-bit

# Average bits: 4.1 | Smaller than MLX 4-bit

# Done ✔

MLX Studio — JANG Converter

JANG Model Converter showing all quantization profiles

Memory

Run bigger models on less RAM

JANG_3M saves 25% vs 4-bit with comparable quality on 7B+ models. Fit models in unified memory that wouldn't fit before.

~4.1 GB

7B at JANG_4S (vs 4.5 GB 4-bit)

~8.2 GB

14B at JANG_4S (vs 9 GB 4-bit)

~41 GB

70B at JANG_4S (vs 45 GB 4-bit)

25%

Savings at JANG_3M vs 4-bit

Models

Proven smaller-win releases

This homepage now only surfaces model releases tied to the curated smaller-win evidence above. The full Hugging Face account is still linked, but the on-page list no longer shows unrelated recent models.

MiniMax-M2.5-JANG_2L

82.5 GB2.10 bits74% MMLU

+47.5 MMLU points over MLX 4-bit while 37.3 GB smaller.

Open curated MiniMax release Open full JANGQ-AI account

Native Integration

Run JANG models in MLX Studio

MLX Studio has native JANG support with OpenAI-compatible API, prefix caching, paged KV cache, KV quantization (q4/q8), continuous batching, and 20+ agentic coding tools. Load any .jang model and serve it locally — works with Cursor, Continue, Aider, and any OpenAI API client. Powered by vMLX Engine, now open source — pip install vmlx.

MLX Studio vMLX Engine