Open Source · Curated Wins Only

JANG

Smaller than MLX. Sharper output.

Only the cases where JANG wins by a long shot while using less memory or fewer bits.

The benchmark page is now filtered hard: no close wins, no larger JANG configurations, no estimates. If the comparison is listed here, JANG is smaller than the MLX baseline and the quality gap is obvious.

The strongest proof points are MiniMax-M2.5 at 82.5 GB beating MLX 4-bit at 119.8 GB by +47.5 MMLU points, and Qwen3.5-122B at 44.7 GB beating MLX mixed_2_6 at 45 GB by +33 points.

Smaller-than-MLX proof set MMLU blowouts only Coherency failures filtered No close wins No larger JANG configs Open source · Apache 2.0
+47.5
MMLU over MLX 4-bit
37.3
GB smaller on MiniMax
+33
MMLU over MLX mixed_2_6
3.37-bit
Fewer bits, better MSE
How It Works

Variable bit widths based on layer sensitivity

Standard quantization applies the same bit width to every tensor. Attention layers (~12% of parameters) are more sensitive to precision loss than MLP layers — when quantized too aggressively, attention scores flatten, positional encoding degrades, and output degenerates.

JANG classifies tensors into sensitivity tiers and assigns bit widths accordingly. Attention layers get 5–8 bits while MLP compresses to 2–4 bits. The overhead is ~0.3 extra bits on average.

Attention
8-bit — protected
MLP
2-bit — compressed
Embed
4-bit
lm_head
6-bit
Result
JANG_2M → 2.7 avg bits → coherent output
3-bit → 3.0 avg bits → repetition loops
Curated Proof Set

Only the smaller blowout wins.

Filtered to proven comparisons where JANG is smaller than the MLX baseline and wins by a large MMLU or coherency margin. Close wins, larger JANG configs, and untested estimates were removed.

MiniMax-M2.5 (230B) — MMLU blowout, smaller than MLX 4-bit

JANG
JANG_2L
82.5 GB · 2.10 bits · 0.9s/question
74.0%
MMLU (200q) · 148/200
+47.5 points · 37.3 GB smaller
MLX baseline
4-bit
119.8 GB · 4.0 bits · 0.9s/question
26.5%
MMLU (200q) · 53/200

Kept because JANG is both dramatically better and dramatically smaller: 82.5 GB vs 119.8 GB, with MLX 4-bit, 3-bit, and 2-bit all near random.

Per-subject proof — MiniMax-M2.5
SubjectJANG_2LMLX 4-bitMLX 3-bitMLX 2-bit
Abstract Algebra10/203/202/205/20
Anatomy15/207/205/205/20
Astronomy20/207/206/204/20
College CS13/204/205/206/20
College Physics13/208/206/206/20
HS Biology18/204/205/206/20
HS Chemistry18/204/205/205/20
HS Mathematics8/206/206/203/20
Logical Fallacies18/205/204/205/20
World Religions15/205/205/205/20
Total148/200 (74%)53/200 (26.5%)49/200 (24.5%)50/200 (25%)

Qwen3.5-122B-A10B — same-size MLX mixed baseline, JANG still smaller

JANG
JANG_2M
44.7 GB · 2.14 bits
79%
MMLU (200q) · 158/200
+33 points · 0.3 GB smaller
MLX baseline
mixed_2_6
45 GB · ~2.5 bits
46%
MMLU (200q) · 92/200

Kept because this is the closest same-memory MLX comparison: JANG is slightly smaller and still lands a +33 point MMLU gap.

Mistral-7B-v0.3 — photosynthesis
JANG_3M3.4 bits vs 3.5-bit MLX
“What is photosynthesis?”
JANG_3M
Correct explanation of plants using sunlight.
MLX 3-bit
Number-sequence degeneration.
Kept: JANG uses fewer bits and stays coherent.
Mistral-7B — arithmetic
JANG_4S4.1 bits vs 4.5-bit MLX
“What is 2+2?”
JANG_4S
“4”
MLX 4-bit
Loops the question.
Kept: smaller bit width and decisive coherency win.
Qwen2.5-3B — translation / factual QA
JANG_4S4.1–4.12 bits vs 4.5-bit MLX
“Translate 'thank you' to Spanish.” / “Is a tomato a fruit or vegetable?”
JANG
Answers directly: “gracias”; tomato is a fruit.
MLX 4-bit
Echoes or repeats the prompt.
Kept: smaller than 4-bit with clear coherency wins.
SmolLM2-1.7B — spider legs
JANG_3M3.4 bits vs 3.5-bit MLX
“How many legs does a spider have?”
JANG_3M
Answers 8.
MLX 3-bit
Number-sequence output.
Kept: fewer bits and direct answer.
TinyLlama-1.1B — water formula
JANG_4S4.1 bits vs 4.5-bit MLX
“What is the chemical formula for water?”
JANG_4S
Stays on topic.
MLX 4-bit
Derails to a different chemistry question.
Kept: smaller than 4-bit and more coherent.
Logit MSE Proof

Qwen2.5-3B: 3.37 bits beats 4.00-bit MSE

Lower is better. JANG MLP=3 / attention=6 reaches 11.10 MSE at 3.37 bits vs MLX 4-bit at 11.31 MSE.

MLX 4-bit
11.31 MSE — 4.00 bits
JANG
11.10 MSE — 3.37 bits
Summary

Shown models: decisive smaller wins only

ModelJANGMLX baselineWhy shown
MiniMax-M2.5JANG_2L · 82.5 GB · 74%4-bit · 119.8 GB · 26.5%+47.5 MMLU, 37.3 GB smaller
Qwen3.5-122B-A10BJANG_2M · 44.7 GB · 79%mixed_2_6 · 45 GB · 46%+33 MMLU, slightly smaller
Mistral-7BJANG_3M / JANG_4S3-bit / 4-bit MLXFewer bits, coherent output
Qwen2.5-3BJANG_4S / 3.37-bit proof4-bit MLXFewer bits, better MSE/coherency
SmolLM2-1.7BJANG_3M · 3.4 bits3-bit MLX · 3.5 bitsSmaller and answers directly
TinyLlama-1.1BJANG_4S · 4.1 bits4-bit MLX · 4.5 bitsSmaller and avoids topic derail
Profiles

JANG_{bits}{size}

11 predefined profiles from ultra-compressed to near-lossless. S = Small (most compression), M = Medium (balanced), L = Large (best quality).

ProfileMLPAttentionEmbedlm_headAvg Bits
JANG_1L2-bit8-bit8-bit8-bit~2.2
JANG_2S2-bit6-bit4-bit6-bit~2.5
JANG_2M2-bit8-bit4-bit8-bit~2.7
JANG_2L2-bit8-bit6-bit8-bit~2.9
JANG_3S3-bit4-bit4-bit6-bit~3.1
JANG_3M3-bit6-bit4-bit6-bit~3.4
JANG_3L3-bit8-bit4-bit8-bit~3.6
JANG_4S4-bit5-bit4-bit6-bit~4.1
JANG_4M4-bit6-bit4-bit6-bit~4.2
JANG_4L4-bit8-bit4-bit8-bit~4.5
JANG_6M6-bit8-bit6-bit8-bit~6.2
Runtime

Swift + Metal inference engine

14 custom Metal GPU kernels. Zero-copy mmap loading. Fused dequantization for decode and prefill.

jang — Terminal
$ jang run --model Qwen2.5-3B-JANG_4L.jang
# Loading model (zero-copy mmap)...
# Profile: JANG_4L (MLP=4, attn=8, avg=4.5 bits)
# Size: 1.8 GB — loaded in 0.39s
> What is photosynthesis?
Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods from carbon dioxide and water. It generally involves the green pigment chlorophyll and generates oxygen as a byproduct.

Dequant + GEMV

Fused dequantization + matrix-vector multiply for single-token decode. All bit widths (2, 3, 4, 5, 6, 8) in one kernel.

Dequant + GEMM

Fused dequantization + matrix-matrix multiply for prompt prefill. Tiled for Apple GPU threadgroup memory.

GQA Attention

Grouped-query attention decode + causal prefill. Supports standard, sliding window, and hybrid architectures.

RMSNorm + RoPE

Fused normalization and rotary position embedding. Traditional and non-traditional RoPE variants.

SwiGLU

Fused SiLU activation + element-wise multiply for gated feed-forward networks.

Quantized Embedding

Direct embedding lookup from quantized weights. No full-table dequantization needed.

Quantize

Convert any model

Python tooling to convert HuggingFace models to .jang format. Pick a profile, choose your quantization method, and go. Supports RTN, MSE-optimal grid search, and GPTQ (Hessian-guided) quantization.

6+ architecture families: Llama, Qwen, Gemma, Phi, Mistral, Mamba/SSM, MoE, and hybrid models including Qwen 3.5.

Open source — Apache 2.0 License
jang-tools
$ pip install jang-tools
$ jang convert --model Qwen/Qwen2.5-3B \
    --profile JANG_4S \
    --method gptq \
    --output ./Qwen2.5-3B-JANG_4S/
# Quantizing with GPTQ (Hessian-guided)...
# Attention layers: 5-bit | MLP: 4-bit
# Average bits: 4.1 | Smaller than MLX 4-bit
# Done ✔
MLX Studio — JANG Converter
JANG Model Converter showing all quantization profiles
Memory

Run bigger models on less RAM

JANG_3M saves 25% vs 4-bit with comparable quality on 7B+ models. Fit models in unified memory that wouldn't fit before.

~4.1 GB
7B at JANG_4S (vs 4.5 GB 4-bit)
~8.2 GB
14B at JANG_4S (vs 9 GB 4-bit)
~41 GB
70B at JANG_4S (vs 45 GB 4-bit)
25%
Savings at JANG_3M vs 4-bit
Models

Proven smaller-win releases

This homepage now only surfaces model releases tied to the curated smaller-win evidence above. The full Hugging Face account is still linked, but the on-page list no longer shows unrelated recent models.

Open curated MiniMax release Open full JANGQ-AI account
Native Integration

Run JANG models in MLX Studio

MLX Studio has native JANG support with OpenAI-compatible API, prefix caching, paged KV cache, KV quantization (q4/q8), continuous batching, and 20+ agentic coding tools. Load any .jang model and serve it locally — works with Cursor, Continue, Aider, and any OpenAI API client. Powered by vMLX Engine, now open source — pip install vmlx.

MLX Studio vMLX Engine