14 KiB
PhotoPrism — Ollama Engine Integration
Last Updated: November 14, 2025
Overview
This package provides PhotoPrism’s native adapter for Ollama-compatible multimodal models. It lets Caption, Labels, and future Generate workflows call locally hosted models without changing worker logic, reusing the shared API client (internal/ai/vision/api_client.go) and result types (LabelResult, CaptionResult). Requests stay inside your infrastructure, rely on base64 thumbnails, and honor the same ACL, timeout, and logging hooks as the default TensorFlow engines.
Context & Constraints
- Engine defaults live in
internal/ai/vision/ollamaand are applied whenever a model setsEngine: ollama. Aliases map toApiFormatOllama,scheme.Base64, and a default 720 px thumbnail. - Responses may arrive as newline-delimited JSON chunks.
decodeOllamaResponsekeeps the most recent chunk, whileparseOllamaLabelsreplays plain JSON strings found inresponse. - Structured JSON is optional for captions but enforced for labels when
Format: json(default for label models targeting the Ollama engine). - The adapter never overwrites TensorFlow defaults. If an Ollama call fails, downstream code still has Nasnet, NSFW, and Face models available.
- Workers assume a single-image payload per request. Run
photoprism vision runto validate multi-image prompts before changing that invariant.
Goals
- Let operators opt into local, private LLMs for captions and labels via
vision.yml. - Provide safe defaults (prompts, schema, sampling) so most deployments only need to specify
Name,Engine, andService.Uri. - Surface reproducible logs, metrics, and CLI commands that make it easy to compare Ollama output against TensorFlow/OpenAI engines.
Non-Goals
- Managing Ollama itself (model downloads, GPU scheduling, or authentication). Use the Compose profiles provided in the repository.
- Adding new HTTP endpoints or bypassing the existing
photoprism visionCLI. - Replacing TensorFlow workers—Ollama engines are additive and opt-in.
Architecture & Request Flow
- Model Selection —
Config.Model(ModelType)returns the top-most enabled entry. WhenEngine: ollama,ApplyEngineDefaults()fills in the request/response format, base64 file scheme, and a 720 px resolution unless overridden. - Request Build —
ollamaBuilder.Buildwraps thumbnails withNewApiRequestOllama, which encodes them as base64 strings.Model.Model()resolves the exact Ollama tag (gemma3:4b,qwen2.5vl:7b, etc.). - Transport —
PerformApiRequestuses a single HTTP POST (default timeout 10 min). Authentication is optional; provideService.Keyif you proxy through an API gateway. - Parsing —
ollamaParser.Parseconverts payloads intoApiResponse. It normalizes confidences (LabelConfidenceDefault = 0.5when missing), copies NSFW scores, and canonicalizes label names vianormalizeLabelResult. - Persistence —
entity.SrcOllamais stamped on labels/captions so UI badges and audits reflect the new source.
Prompt, Schema, & Options Guidance
- System Prompts
- Labels:
LabelSystemenforces single-word nouns. SetSystemto override; assignLabelSystemSimplewhen you need descriptive phrases. - Captions: no system prompt by default; rely on user prompt or set one explicitly for stylistic needs.
- Labels:
- User Prompts
- Captions use
CaptionPrompt, which requests one sentence in active voice. - Labels default to
LabelPromptDefault; whenDetectNSFWLabelsis true, the adapter swaps inLabelPromptNSFW. - For stricter noun enforcement, set
PrompttoLabelPromptStrict.
- Captions use
- Schemas
- Labels rely on
schema.LabelsJson(nsfw)(simple JSON template). SettingFormat: jsonauto-attaches a reminder (model.SchemaInstructions()). - Override via
Schema(inline YAML) orSchemaFile.PHOTOPRISM_VISION_LABEL_SCHEMA_FILEalways wins if present.
- Labels rely on
- Options
- Labels: default
TemperatureequalsDefaultTemperature(0.1 unless configured),TopP=0.9,Stop=["\n\n"]. - Captions: only
Temperatureis set; other parameters inherit global defaults. - Custom
Optionsmerge with engine defaults. LeaveForceJson=truefor labels so PhotoPrism can reject malformed payloads early.
- Labels: default
Supported Ollama Vision Models
| Model (Ollama Tag) | Size & Footprint | Strengths | JSON & Language Notes | When To Use |
|---|---|---|---|---|
gemma3:4b / 12b / 27b |
4B/12B/27B parameters, ~3.3 GB → 17 GB downloads, 128 K context | Multimodal text+image reasoning with SigLIP encoder, handles OCR/long documents, supports tool/function calling | Emits structured JSON reliably; >140 languages with strong default English output | High-quality captions + multilingual labels when you have ≥12 GB VRAM (4B works on 8 GB with Q4_K_M) |
qwen2.5vl:7b |
8.29 B params (Q4_K_M) ≈6 GB download, 125 K context | Excellent charts, GUI grounding, DocVQA, multi-image reasoning, agentic tool use | JSON mode tuned for schema compliance; supports 20+ languages with strong Chinese/English parity | Label extraction for mixed-language archives or UI/diagram analysis |
qwen3-vl:2b / 4b / 8b |
Dense 2B/4B/8B tiers (~3 GB, ~3.5 GB, ~6 GB downloads) with native 256 K context extendable to 1 M; fits single 12–24 GB GPUs or high-end CPUs (2B) | Spatial + video reasoning upgrades (Interleaved-MRoPE, DeepStack), 32-language OCR, GUI/agent control, long-document ingest | Emits JSON reliably when prompts specify schema; multilingual captions/labels with Thinking variants boosting STEM reasoning | General-purpose captions/labels when you need long-context doc/video support without cloud APIs; 2B for CPU/edge, 4B as balanced default, 8B when accuracy outweighs latency |
llama3.2-vision:11b |
11 B params, ~7.8 GB download, requires ≥8 GB VRAM; 90 B variant needs ≥64 GB | Strong general reasoning, captioning, OCR, supported by Meta ecosystem tooling | Vision tasks officially supported in English; text-only tasks cover eight major languages | Keep captions consistent with Meta-compatible prompts or when teams already standardize on Llama 3.x |
minicpm-v:8b-2.6 |
8 B params, ~5.5 GB download, 32 K context | Optimized for edge GPUs, high OCR accuracy, multi-image/video support, low token count (≈640 tokens for 1.8 MP) | Multilingual (EN/ZH/DE/FR/IT/KR). Emits concise JSON but may need stricter stopping sequences | Memory-constrained deployments that still require NSFW/OCR-aware label output |
Tip: pull models inside the dev container with
docker compose --profile ollama up -dand thendocker compose exec ollama ollama pull gemma3:4b. Keep the profile stopped when you do not need extra GPU/CPU load.
Qwen3-VL models stream their JSON payload via the
thinkingfield. PhotoPrism v2025.11+ captures this automatically; if you run older builds, upgrade before enabling these models or responses will appear empty.
Configuration
Environment Variables
PHOTOPRISM_VISION_LABEL_SCHEMA_FILE— Absolute path to a JSON snippet that overrides the default label schema (applies to every Ollama label model).PHOTOPRISM_VISION_YAML— Customvision.ymlpath. Keep it synced in Git if you automate deployments.OLLAMA_HOST,OLLAMA_MODELS,OLLAMA_MAX_QUEUE,OLLAMA_NUM_PARALLEL, etc. — Provided incompose*.yamlto tune the Ollama daemon. AdjustOLLAMA_KEEP_ALIVEif you want models to stay loaded between worker batches.PHOTOPRISM_LOG_LEVEL=trace— Enables verbose request/response previews (truncated to avoid leaking images). Use temporarily when debugging parsing issues.
vision.yml Example
Models:
- Type: labels
Name: qwen2.5vl:7b
Engine: ollama
Run: newly-indexed
Resolution: 720
Format: json
Options:
Temperature: 0.05
Stop: ["\n\n"]
ForceJson: true
Service:
Uri: http://ollama:11434/api/generate
RequestFormat: ollama
ResponseFormat: ollama
FileScheme: base64
- Type: caption
Name: gemma3:4b
Engine: ollama
Disabled: false
Options:
Temperature: 0.2
Service:
Uri: http://ollama:11434/api/generate
Guidelines:
- Place new entries after the default TensorFlow models so they take precedence while Nasnet/NSFW remain as fallbacks.
- Always specify the exact Ollama tag (
model:version) so upgrades are deliberate. - Keep option flags before positional arguments in CLI snippets (
photoprism vision run -m labels --count 1). - If you proxy requests (e.g., through Traefik), set
Service.KeytoBearer <token>and configure the proxy to inject/validate it.
Operational Checklist
- Scheduling — Use
Run: newly-indexedfor incremental runs,Run: manualfor ad-hoc CLI calls, orRun: on-schedulewhen paired with the scheduler. LeaveRun: autoif you want the worker to decide based on other model states. - Timeouts & Retries — Default timeout is 10 minutes (
ServiceTimeout). Ollama streaming responses complete faster in practice; if you need stricter SLAs, wrapphotoprism vision runin a job runner and retry failed batches manually. - Fallbacks — Keep Nasnet configured even when Ollama labels are primary.
labels.gostops at the first successful engine, so duplicates are avoided. - Security — When exposing Ollama beyond localhost, terminate TLS at Traefik and enable API keys. Never return full JSON payloads in logs; rely on trace mode only for debugging and sanitize before sharing.
- Model Storage — Bind-mount
./storage/services/ollama:/root/.ollama(see Compose) so pulled models survive container restarts. Rundocker compose exec ollama ollama listduring deployments to verify availability.
Observability & Testing
- CLI Smoke Tests
- Captions:
photoprism vision run -m caption --count 5 --force. - Labels:
photoprism vision run -m labels --count 5 --force. - After each run, check
photoprism vision lsforsource=ollama.
- Captions:
- Unit Tests
go test ./internal/ai/vision/ollama ./internal/ai/vision -run Ollama -count=1covers transport parsing and model defaults.- Add fixtures under
internal/ai/vision/testdatawhen capturing new response shapes; keep files small and anonymized.
- Logging
- Set
PHOTOPRISM_LOG_LEVEL=debugto watch summary lines (“processed labels/caption via ollama”). - Use
log.Tracesparingly; it prints truncated JSON blobs for troubleshooting.
- Set
- Metrics
/api/v1/metricsexposes counts per label source; scrape after a batch to compare throughput with TensorFlow/OpenAI runs.
Code Map
internal/ai/vision/ollama/*.go— Engine defaults, schema helpers, transport structs.internal/ai/vision/engine_ollama.go— Builder/parser glue plus label/caption normalization.internal/ai/vision/api_ollama.go— Base64 payload builder.internal/ai/vision/api_client.go— Streaming decoder shared among engines.internal/ai/vision/models.go— Default caption model definition (gemma3).compose*.yaml— Ollama service profile, Traefik labels, and persistent volume wiring.frontend/src/common/util.js— Mapssrc="ollama"to the correct badge; keep it updated when adding new source strings.
Next Steps
- Add formal schema validation (JSON Schema or JTD) so malformed label responses fail fast before normalization.
- Support multiple thumbnails per request once core workflows confirm the API contract (requires worker + UI changes).
- Emit per-model latency and success metrics from the vision worker to simplify tuning when several Ollama engines run side-by-side.
- Mirror any loader changes into PhotoPrism Plus/Pro templates to keep splash + browser checks consistent after enabling external engines.