WHY ANOTHER IN-BROWSER LLM?

Existing in-browser runtimes like WebLLM (MLC) require every model to be specifically TVM-compiled to WebGPU shaders before it runs. New base models — Qwen 3.5, DeepSeek-R1, Mistral Nemo, Llama 4 — sit on the MLC team's compile queue before they're available. Amni-LLM uses GGUF, the format with the broadest community coverage on HuggingFace: most popular models have GGUF builds within days of release. Tradeoff is honest: llama.cpp WASM is currently CPU-SIMD-only, so per-token throughput is lower than WebGPU stacks. The win is universal model availability and arbitrary URL/file loading.

🌐

ANY GGUF, ANY DAY

Search HuggingFace live from inside the app. Pick a model, pick a quantization, click Load or Install. Works for anything with a GGUF build — including Qwen 3.5, DeepSeek-R1, Mistral Nemo, Phi-4, plus thousands of community fine-tunes. Models too large for browser memory (e.g. Llama 4 Scout's 29 GB minimum quant) appear in search but won't load on most machines.

💾

LOCAL FILES

Drop a `.gguf` file straight from your disk. Bytes never leave your machine. Useful for proprietary fine-tunes you can't host publicly, or large models you've already downloaded for desktop tools like Ollama or LM Studio.

🔒

FULLY CLIENT-SIDE

Inference runs in your browser via WebAssembly. Your prompts, your weights, your replies — nothing transits a server. The HuggingFace search hits HF's public API; the model download is a direct browser-to-HF transfer.

📚

3 SOTA DEFAULTS

Curated short list of current Qwen 3.5 GGUF builds that fit a typical browser memory budget: 0.8B (mobile, 508 MB), 4B (balanced, 2.6 GB), 9B (desktop, 5.4 GB). One click to load. For math/engineering or any other niche, use the built-in HF search.

🔎

HUGGINGFACE SEARCH

Built-in live search of `huggingface.co/api/models?library=gguf`. Type any keyword, expand a result to see GGUF files with quantization labels and sizes, then Load or Install. The Installed tab keeps your picks under `localStorage`.

🔧

WEBLLM-COMPATIBLE API

Drop-in `chatCompletions.create({messages, temperature, max_tokens})`, plus async iterator streaming via `chatStream()`. Migrating an existing WebLLM integration is mostly a one-line import change.

HOW IT COMPARES

Feature	Amni-LLM	WebLLM (MLC)
Load arbitrary HuggingFace model	Yes — any GGUF	No — only MLC pre-compiled
Qwen 3.5 today	Yes (verified GGUFs from unsloth)	Not until MLC publishes bundles
Llama 4 Scout/Maverick today	Available in HF search; smallest quant is ~29 GB so does not fit in browser memory on typical machines	Not in registry
Local `.gguf` upload	Yes	No
WebGPU acceleration	Experimental	Mature
CPU SIMD fallback	Yes (default)	No
Tokens/sec on a 7B Q4 (desktop)	~3-8 t/s	~15-30 t/s
Bundle / runtime size	~3 MB WASM	~600 KB JS + per-model WASM
OpenAI-style chat API	Yes	Yes
Streaming output	Yes	Yes
Browser cache for weights	Yes (IndexedDB)	Yes
License	MIT	Apache 2.0

QUICK API

import { createEngine } from '/lib/amni-llm/amni-llm.js';

// SOTA default by id

const e = await createEngine('Qwen3.5-4B-Q4_K_M');

// Any HuggingFace GGUF (the killer feature)

const e = await createEngine({

  url: 'https://huggingface.co/<org>/<repo>/resolve/main/model.Q4_K_M.gguf'

});

// Local file

const e = await createEngine({ file: fileInput.files[0] });

// Chat (WebLLM-compatible)

const r = await e.chatCompletions.create({

  messages: [{role:'user', content:'Hello'}],

  temperature: 0.4, max_tokens: 512

});

HONEST STACK

The transformer math runs through wllama (MIT-licensed llama.cpp WASM port). Amni-LLM owns the loader, registry, browser UI, HuggingFace search and install, arbitrary URL/file loading, and WebLLM-compatible API surface on top. We list the dependency by name in the README and on the demo page; the value-add is the integration and developer experience, not a from-scratch inference engine.

Performance honesty: per-token throughput is lower than MLC's WebGPU stack because llama.cpp WASM hasn't shipped mature WebGPU support yet. For interactive chat at small models the difference is barely noticeable. For high-throughput production workloads where MLC supports the model you need, MLC remains faster.

Memory note: in-browser models are bounded by the browser's allocator. Mobile browsers cap around 1 GB; desktop typically allows 2-4 GB without cross-origin isolation, more with it. Amni-LLM ships a service worker that enables COOP/COEP headers on static hosts (GitHub Pages, etc.) so SharedArrayBuffer becomes available and multi-thread WASM with larger memory works.

AMNI-LLM