Learn/Local AI & Privacy/Ollama: Running Models Locally
Local AI & Privacy

Ollama: Running Models Locally

Ollama is the tool that made local AI accessible to people who aren't machine learning engineers. It wraps the complexity of running large language models into a simple command-line interface and a cl

Ollama: Running Models Locally

Ollama is the tool that made local AI accessible to people who aren't machine learning engineers. It wraps the complexity of running large language models into a simple command-line interface and a clean local API.

What Ollama Is

Ollama is an open-source runtime for large language models. It handles model downloading, memory management, GPU acceleration, and serving — all automatically. Running a state-of-the-art open model locally can be as simple as:

bash ollama run llama4

That single command downloads the model if needed and opens an interactive chat session in your terminal.

Installation

  • macOS/Linux: curl -fsSL https://ollama.com/install.sh | sh
  • Windows: Download the installer from ollama.com

Ollama installs as a background service. It uses your GPU when available (Apple Silicon, NVIDIA, AMD) and falls back to CPU otherwise.

The Model Library

Notable models available in 2026: - Llama 4 (Meta) — strong general-purpose, multiple size variants - Mistral / Mistral Nemo — efficient, fast, European-built - Gemma 3 (Google) — compact and surprisingly capable - Phi-4 (Microsoft) — punches above its weight at small sizes - DeepSeek-R1 — strong reasoning - Qwen 2.5 (Alibaba) — excellent multilingual support

Pull any model: ollama pull modelname. List installed: ollama list.

The REST API

Ollama exposes an OpenAI-compatible REST API at http://localhost:11434. Code written for the OpenAI API works with Ollama by changing one line:

python client = openai.OpenAI( base_url="http://localhost:11434/v1", api_key="ollama" # required but unused )

Memory Requirements

| Model Size | RAM/VRAM Needed | |---|---| | 3B–7B | ~6–8 GB | | 13B | ~16 GB | | 30B–34B | ~24 GB | | 70B | 40 GB+ |

Quantized models (Q4_K_M format) cut memory requirements significantly while preserving most quality.

Open WebUI

For a ChatGPT-like browser interface connected to Ollama, install Open WebUI via Docker:

bash docker run -d -p 3000:80 --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 — conversation history, model switching, system prompt configuration, all running entirely on your machine.

Have a follow-up question about this topic?

Ask AI