Local Models
Rune can run language models on your own machine, with no API key and no data leaving your computer. A GGUF inference engine, powered by the excellent llama.cpp project, is built right into Rune, so there is nothing extra to install: download a model, point Rune at it, and your GPU does the rest.
This guide covers what local models can do, how to download and run them, and how to tune them for your hardware.
Why run a model locally
- Privacy: prompts and code never leave your machine.
- No cost or rate limits: once a model is downloaded, you can use it as much as you like, offline.
- Your GPU does the work: on macOS, Rune offloads inference to your GPU automatically (Apple Silicon and Intel via Metal). On Linux, inference currently runs on the CPU; NVIDIA CUDA acceleration is coming soon. See GPU acceleration for the details.
Local models are first-class in Rune: they show up in the model list alongside hosted providers, support streaming, tool calling, and (for models that ship a vision projector) image inputs.
What you need
- A model in GGUF format. GGUF is the standard single-file format for quantized local models, and it is what nearly every community model on the Hugging Face Hub publishes.
- Enough RAM (or VRAM) to hold the model. A rough rule of thumb: a model
quantized to
Q4_K_Mneeds a little more memory than its file size on disk. A 7B model atQ4_K_Mis around 4.5 GB; a 3B model is closer to 2 GB.
You do not need to install any separate runtime, server, or toolchain; the inference engine is embedded directly in Rune.
Downloading a model
Use the models local command from the Rune shell. To pull
a model, give it a reference in the familiar owner/repo:tag form:
models local download unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M
A progress bar tracks the download. When it finishes, Rune prints where the files landed and reminds you to list them.
A few things worth knowing about references:
-
The part after
:is the quantization tag (hereQ4_K_M). Lower numbers mean smaller files and lower memory use at some cost to quality.Q4_K_Mis a good default; tryQ5_K_MorQ6_Kif you have the memory to spare, or aQ3/Q2variant if you are tight. -
When you omit the host, Rune assumes Hugging Face (
huggingface.co). These are all equivalent:models local download unsloth/gemma-3n-E2B-it-GGUF:Q4_K_Mmodels local download hf.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_Mmodels local download huggingface.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M -
You can pin an exact build by digest instead of a tag:
models local download huggingface.co/owner/repo@sha256:abc123… -
Any OCI v2 registry works, not just Hugging Face, as long as the repository is packaged as a GGUF model (the same layout Ollama uses). Point the reference at the registry's host:
models local download my-registry.example.com/team/qwen2.5-coder-GGUF:Q4_K_MA generic container image (an app image, a base OS image, and so on) is not a model; even though it lives in an OCI registry, it has no GGUF weights, so Rune cannot run it.
Run the help models local download command to see this reference from
inside Rune.
Whatever the registry, the repository must contain GGUF weights; that is the one universal requirement.
On Hugging Face specifically, those weights are served through an OCI
endpoint that is only opened for GGUF repositories. If a Hugging Face pull
is rejected as not GGUF-compatible, look for a sibling repo whose name ends
in -GGUF; those are the community-converted, ready-to-run builds.
Listing and deleting models
See everything you have downloaded:
models local list
Each entry is shown by its full reference (for example
huggingface.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M). That reference is
also the model's name everywhere else in Rune.
Remove a model and reclaim disk space:
models local delete huggingface.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M
Models are stored in an Ollama-compatible cache under
$RUNE_DATADIR/models by default. You can change that location with the
models.local.models_cache_dir setting (see Configuration).
Using a local model
A downloaded model behaves like any other model in Rune. Its name is the
full reference you see in models local list. To make it the default
model for new chats, set models.default in your config:
- config.yaml
- config.star
models:
default: "huggingface.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M"
"models": {
"default": "huggingface.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M",
},
To confirm the model is available and see its context window, run the
models command from the Rune shell:
models local list
Local models appear in that list next to your hosted providers, so you can mix and match: a local model for quick, private edits and a hosted one for heavier work.
Configuration
All local-model settings live under models.local. The defaults are
chosen to "just work" on most machines; you usually only need to touch
these when tuning for your hardware or a specific model.
- config.yaml
- config.star
models:
default: "huggingface.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M"
local:
# Where downloaded models are stored. Empty uses $RUNE_DATADIR/models.
models_cache_dir: ""
# Layers to offload to the GPU. -1 offloads as many as the GPU fits.
# Set to 0 to run purely on the CPU.
n_gpu_layers: -1
# Generation threads. 0 lets the engine pick a good value.
threads: 0
# Enable flash-attention when your GPU supports it (saves memory).
flash_attention: false
# Cap tokens generated per reply. 0 means "until the model stops".
max_output_tokens: 0
"models": {
"default": "huggingface.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M",
"local": {
"models_cache_dir": "",
"n_gpu_layers": -1,
"threads": 0,
"flash_attention": False,
"max_output_tokens": 0,
},
},
The most useful knobs:
n_gpu_layers: controls GPU offloading. The default-1offloads the whole model to the GPU, which is what you want for speed. If a model is too large for your VRAM, lower this number to offload only some layers (the rest run on the CPU), or set it to0to run entirely on the CPU.flash_attention: turn this on if your GPU supports it to reduce memory use, especially at large context sizes.max_output_tokens: a hard cap on reply length. Leave it at0unless you want to keep responses short.
Sampling
Sampling controls how the model picks its next token: higher
temperature is more creative, lower is more deterministic. Sensible
defaults are applied automatically; override them under
models.local.sampling only if you know you want to:
- config.yaml
- config.star
models:
local:
sampling:
temperature: 0.8
top_k: 40
top_p: 0.95
min_p: 0.05
"models": {
"local": {
"sampling": {
"temperature": 0.8,
"top_k": 40,
"top_p": 0.95,
"min_p": 0.05,
},
},
},
For coding and tool use, a lower temperature (for example 0.2–0.4)
usually gives steadier, more predictable output. Any sampling field you
leave out keeps its built-in default.
GPU acceleration
How much of a model runs on the GPU is the same on every platform; what differs is which GPU backend the bundled engine ships with:
- macOS (Apple Silicon and Intel): Metal acceleration is built in and used automatically. No configuration or extra install needed.
- Linux: the published Linux build runs local models on the CPU today. NVIDIA CUDA acceleration is coming soon; once it ships, a compatible NVIDIA GPU with an up-to-date driver will be used automatically, with no CUDA toolkit install required.
In the meantime, if you have a GPU-backed machine and want hardware
acceleration for local inference on Linux, run a dedicated
OpenAI-compatible inference server (for example a CUDA-enabled
llama.cpp server) and point Rune at it as a
custom model provider.
You do not configure the backend directly; you control how much of the
model lives on the GPU with n_gpu_layers. With the default -1, Rune
offloads everything that fits (on a platform with GPU support). The
startup log notes how the model was loaded if you want to confirm GPU
usage.
Capabilities and limits
- Tool calling works with instruction-tuned models whose chat
template declares tools (Qwen, Gemma, DeepSeek, Hermes, and other
ChatML-family models). Base, non-instruction-tuned models will ignore
tool definitions; pick an
-it/-Instructvariant for agent use. - Image inputs are supported when a model ships a paired vision
projector (an
mmprojGGUF). Without one, images are dropped and only the text of a message is sent. - One model, one context: a loaded model serves requests one at a time. Rune keeps a cache of the previous turn so follow-up messages that share a prefix respond faster.
Choosing a model
Some starting points that run well locally:
- Qwen2.5-Coder (3B / 7B): strong coding and tool use; great default for agent work on a laptop.
- Gemma instruction-tuned builds: solid general-purpose chat with tool support.
Look for the -GGUF repositories on the Hugging Face Hub, pick an
instruction-tuned (-it or -Instruct) variant for agent and chat use,
and start with the Q4_K_M tag. Move up to Q5_K_M or Q6_K if you
have the memory and want a quality bump.
Troubleshooting
- "not compatible with the llama.cpp OCI endpoint": the repository
does not serve GGUF weights through the OCI endpoint. Use a
community-converted
-GGUFrepository instead. - Out of memory or very slow: the model is too large for your GPU.
Lower
n_gpu_layers, turn onflash_attention, or download a smaller quantization (a lowerQtag or a smaller parameter count). - The model ignores tools: you are likely running a base model. Pull
the instruction-tuned (
-it/-Instruct) variant.
Acknowledgements
Local inference in Rune stands on the shoulders of llama.cpp and the GGML community. Our heartfelt thanks to Georgi Gerganov and the entire community of contributors whose work on high-performance kernels, the GGUF model ecosystem, samplers, and chat templating makes fast, private, on-device models possible. Rune's local model support would not exist without their engineering and generosity.