Skip to main content

Local Models

Rune can run language models on your own machine, with no API key and no data leaving your computer. A GGUF inference engine, powered by the excellent llama.cpp project, is built right into Rune, so there is nothing extra to install: download a model, point Rune at it, and your GPU does the rest.

This guide covers what local models can do, how to download and run them, and how to tune them for your hardware.

Why run a model locally

  • Privacy: prompts and code never leave your machine.
  • No cost or rate limits: once a model is downloaded, you can use it as much as you like, offline.
  • Your GPU does the work: on macOS, Rune offloads inference to your GPU automatically (Apple Silicon and Intel via Metal). On Linux, inference currently runs on the CPU; NVIDIA CUDA acceleration is coming soon. See GPU acceleration for the details.

Local models are first-class in Rune: they show up in the model list alongside hosted providers, support streaming, tool calling, and (for models that ship a vision projector) image inputs.

What you need

  • A model in GGUF format. GGUF is the standard single-file format for quantized local models, and it is what nearly every community model on the Hugging Face Hub publishes.
  • Enough RAM (or VRAM) to hold the model. A rough rule of thumb: a model quantized to Q4_K_M needs a little more memory than its file size on disk. A 7B model at Q4_K_M is around 4.5 GB; a 3B model is closer to 2 GB.

You do not need to install any separate runtime, server, or toolchain; the inference engine is embedded directly in Rune.

Downloading a model

Use the models local command from the Rune shell. To pull a model, give it a reference in the familiar owner/repo:tag form:

models local download unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M

A progress bar tracks the download. When it finishes, Rune prints where the files landed and reminds you to list them.

A few things worth knowing about references:

  • The part after : is the quantization tag (here Q4_K_M). Lower numbers mean smaller files and lower memory use at some cost to quality. Q4_K_M is a good default; try Q5_K_M or Q6_K if you have the memory to spare, or a Q3 / Q2 variant if you are tight.

  • When you omit the host, Rune assumes Hugging Face (huggingface.co). These are all equivalent:

    models local download unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M
    models local download hf.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M
    models local download huggingface.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M
  • You can pin an exact build by digest instead of a tag:

    models local download huggingface.co/owner/repo@sha256:abc123…
  • Any OCI v2 registry works, not just Hugging Face, as long as the repository is packaged as a GGUF model (the same layout Ollama uses). Point the reference at the registry's host:

    models local download my-registry.example.com/team/qwen2.5-coder-GGUF:Q4_K_M

    A generic container image (an app image, a base OS image, and so on) is not a model; even though it lives in an OCI registry, it has no GGUF weights, so Rune cannot run it.

Run the help models local download command to see this reference from inside Rune.

tip

Whatever the registry, the repository must contain GGUF weights; that is the one universal requirement.

On Hugging Face specifically, those weights are served through an OCI endpoint that is only opened for GGUF repositories. If a Hugging Face pull is rejected as not GGUF-compatible, look for a sibling repo whose name ends in -GGUF; those are the community-converted, ready-to-run builds.

Listing and deleting models

See everything you have downloaded:

models local list

Each entry is shown by its full reference (for example huggingface.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M). That reference is also the model's name everywhere else in Rune.

Remove a model and reclaim disk space:

models local delete huggingface.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M

Models are stored in an Ollama-compatible cache under $RUNE_DATADIR/models by default. You can change that location with the models.local.models_cache_dir setting (see Configuration).

Using a local model

A downloaded model behaves like any other model in Rune. Its name is the full reference you see in models local list. To make it the default model for new chats, set models.default in your config:

models:
default: "huggingface.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M"

To confirm the model is available and see its context window, run the models command from the Rune shell:

models local list

Local models appear in that list next to your hosted providers, so you can mix and match: a local model for quick, private edits and a hosted one for heavier work.

Configuration

All local-model settings live under models.local. The defaults are chosen to "just work" on most machines; you usually only need to touch these when tuning for your hardware or a specific model.

models:
default: "huggingface.co/unsloth/gemma-3n-E2B-it-GGUF:Q4_K_M"
local:
# Where downloaded models are stored. Empty uses $RUNE_DATADIR/models.
models_cache_dir: ""
# Layers to offload to the GPU. -1 offloads as many as the GPU fits.
# Set to 0 to run purely on the CPU.
n_gpu_layers: -1
# Generation threads. 0 lets the engine pick a good value.
threads: 0
# Enable flash-attention when your GPU supports it (saves memory).
flash_attention: false
# Cap tokens generated per reply. 0 means "until the model stops".
max_output_tokens: 0

The most useful knobs:

  • n_gpu_layers: controls GPU offloading. The default -1 offloads the whole model to the GPU, which is what you want for speed. If a model is too large for your VRAM, lower this number to offload only some layers (the rest run on the CPU), or set it to 0 to run entirely on the CPU.
  • flash_attention: turn this on if your GPU supports it to reduce memory use, especially at large context sizes.
  • max_output_tokens: a hard cap on reply length. Leave it at 0 unless you want to keep responses short.

Sampling

Sampling controls how the model picks its next token: higher temperature is more creative, lower is more deterministic. Sensible defaults are applied automatically; override them under models.local.sampling only if you know you want to:

models:
local:
sampling:
temperature: 0.8
top_k: 40
top_p: 0.95
min_p: 0.05

For coding and tool use, a lower temperature (for example 0.20.4) usually gives steadier, more predictable output. Any sampling field you leave out keeps its built-in default.

GPU acceleration

How much of a model runs on the GPU is the same on every platform; what differs is which GPU backend the bundled engine ships with:

  • macOS (Apple Silicon and Intel): Metal acceleration is built in and used automatically. No configuration or extra install needed.
  • Linux: the published Linux build runs local models on the CPU today. NVIDIA CUDA acceleration is coming soon; once it ships, a compatible NVIDIA GPU with an up-to-date driver will be used automatically, with no CUDA toolkit install required.

In the meantime, if you have a GPU-backed machine and want hardware acceleration for local inference on Linux, run a dedicated OpenAI-compatible inference server (for example a CUDA-enabled llama.cpp server) and point Rune at it as a custom model provider.

You do not configure the backend directly; you control how much of the model lives on the GPU with n_gpu_layers. With the default -1, Rune offloads everything that fits (on a platform with GPU support). The startup log notes how the model was loaded if you want to confirm GPU usage.

Capabilities and limits

  • Tool calling works with instruction-tuned models whose chat template declares tools (Qwen, Gemma, DeepSeek, Hermes, and other ChatML-family models). Base, non-instruction-tuned models will ignore tool definitions; pick an -it / -Instruct variant for agent use.
  • Image inputs are supported when a model ships a paired vision projector (an mmproj GGUF). Without one, images are dropped and only the text of a message is sent.
  • One model, one context: a loaded model serves requests one at a time. Rune keeps a cache of the previous turn so follow-up messages that share a prefix respond faster.

Choosing a model

Some starting points that run well locally:

  • Qwen2.5-Coder (3B / 7B): strong coding and tool use; great default for agent work on a laptop.
  • Gemma instruction-tuned builds: solid general-purpose chat with tool support.

Look for the -GGUF repositories on the Hugging Face Hub, pick an instruction-tuned (-it or -Instruct) variant for agent and chat use, and start with the Q4_K_M tag. Move up to Q5_K_M or Q6_K if you have the memory and want a quality bump.

Troubleshooting

  • "not compatible with the llama.cpp OCI endpoint": the repository does not serve GGUF weights through the OCI endpoint. Use a community-converted -GGUF repository instead.
  • Out of memory or very slow: the model is too large for your GPU. Lower n_gpu_layers, turn on flash_attention, or download a smaller quantization (a lower Q tag or a smaller parameter count).
  • The model ignores tools: you are likely running a base model. Pull the instruction-tuned (-it / -Instruct) variant.

Acknowledgements

Local inference in Rune stands on the shoulders of llama.cpp and the GGML community. Our heartfelt thanks to Georgi Gerganov and the entire community of contributors whose work on high-performance kernels, the GGUF model ecosystem, samplers, and chat templating makes fast, private, on-device models possible. Rune's local model support would not exist without their engineering and generosity.

Ask Rune Agent