Skip to main content

VideoMemory Technical Report

·4256 words·20 mins·
Author
Mark Ogata
AI and Robotics Undergraduate Researcher
Table of Contents

Created: 2026-05-15
Purpose: preserve a detailed account of the VideoMemory project so it can later be incorporated into a master’s thesis.

This document describes VideoMemory as it exists today: the problem it is solving, the system architecture, the implemented components, the operational workflows, the known engineering evidence, and the current limitations. It is not written as a final thesis chapter. It is written as a report that should make the future thesis easier to write because the concrete project record is already captured.

LaTeX source for this report is stored next to this page:

Executive Summary
#

VideoMemory is an agent-agnostic video ingestion and monitoring layer. Its core purpose is to let any capable agent consume large amounts of video from multiple streams without each agent needing to implement camera discovery, stream management, frame buffering, local filtering, model selection, evidence capture, or hardware-specific recovery logic.

The project is motivated by a practical bottleneck in agent systems. Agents can read text, call tools, and reason over screenshots, but they do not naturally watch the world over time. A normal request-response vision call can answer “what is in this image?” but it does not solve “watch these streams and tell my agent when a condition becomes true.” VideoMemory addresses that gap by turning video streams into task-scoped events.

The long-term technical emphasis is not only vision-language modeling. The hard part is cost-aware monitoring under limited hardware:

  • many streams may be active at once;
  • most frames are uninteresting;
  • cloud VLM calls are expensive relative to cheap local filtering;
  • edge hardware may not have a strong GPU;
  • cameras can be stale, unavailable, or locked by another process;
  • an external agent needs structured readiness and event evidence, not just a human-facing video preview.

The current system implements the basic architecture for this: a Flask-based core service, input device discovery, network and local camera handling, task lifecycle APIs, one-off frame captioning, binary and general monitor types, semantic filtering, saved evidence frames and clips, readiness diagnostics, usage/cost accounting, OpenAPI documentation, Android snapshot streaming, a generic webhook contract, OpenClaw-compatible hooks, and a Claude Code channel adapter. Claude Code is an important integration surface, but it is not the definition of the project. It is one adapter over the more general VideoMemory core.

System Goal
#

The intended system shape is:

many video streams
  -> VideoMemory stream ingestion
  -> local sampling, filtering, readiness checks, and buffering
  -> selective binary or VLM evaluation
  -> task state and aligned evidence
  -> generic event delivery
  -> any external agent decides what to do next

The essential abstraction is a natural-language monitoring task attached to a video source. The agent says what visual condition matters. VideoMemory handles the stream and produces state changes or events when the condition is observed.

The project should be evaluated by whether it makes video streams agent-addressable in a practical way. That means the system is successful only if it can support more than a single demo camera, avoid naive model-per-frame processing, expose usable readiness diagnostics, and deliver event evidence that an external agent can trust.

Design Principles
#

Agent Agnostic By Default
#

VideoMemory core should not be tied to Claude, OpenClaw, or any single agent runtime. The core interface is HTTP plus event delivery. Agent-specific integrations should be adapters that translate their runtime conventions into the same underlying task and event contract.

This matters because the reusable project contribution is the video ingestion and eventing layer, not a one-off Claude plugin. Claude Code is useful because it demonstrates push-style wakeups into a real agent session, but the same core should work with a generic webhook receiver, OpenClaw, a local desktop agent, or a future research agent.

Perception And Action Are Separate
#

VideoMemory owns the visual condition. The external agent owns the follow-up action.

For a request like “tell me when I hold my phone up,” VideoMemory stores and monitors the condition “a phone is visibly held up.” The agent stores the action “tell the user.” This prevents the perception service from becoming an unbounded automation system and keeps its correctness boundary clear: device state, frame state, model results, task state, evidence, and event delivery.

Cost And Hardware Constraints Are First-Class
#

Multiple streams make naive VLM processing unrealistic. If ten cameras are sampled continuously and every frame is sent to a cloud model, the system will hit cost, latency, and throughput limits almost immediately. VideoMemory’s architecture therefore needs cheap local triage before expensive reasoning:

  • sample streams rather than process every frame;
  • drop stale or redundant frames;
  • use binary monitors for simple conditions;
  • use semantic filtering before richer VLM calls;
  • reserve general VLM note generation for tasks that need context;
  • expose model-call counts, latency, and estimated cost;
  • keep low-power deployments such as Raspberry Pi in scope.

Readiness Is Not Task Creation
#

A task row existing in the database does not prove that a monitor is working. A useful monitor also needs a registered device, a running ingestor, fresh frames, valid camera permission, model readiness when required, and a configured event path if the agent expects a wakeup.

VideoMemory therefore exposes device readiness separately from task creation. The readiness payload reports whether the device exists, whether the ingestor is running, whether a frame is available, the frame age, browser camera freshness, binary monitor status, semantic filter status, and warnings. This is important because agents otherwise tend to report success after an API call even when the camera is not actually producing usable frames.

Evidence Must Match The Event
#

If a task note says that a visual condition occurred, the saved frame and video clip must correspond to the same event. A later live snapshot is not adequate evidence. In a slow VLM path, the callback may finish after the live rolling buffer has already moved forward. VideoMemory’s evidence path therefore needs to snapshot frames at queue/evaluation time so that saved evidence remains aligned with the trigger.

Implemented Architecture
#

Core Service
#

The core service is a Flask application in flask_app/app.py. It serves the web UI, JSON APIs, OpenAPI schema, device pages, task pages, settings pages, ingestor debug surfaces, semantic preview routes, capture endpoints, and readiness endpoints. It can be started directly with:

uv run flask_app/app.py

It can also be run in Docker using the core compose stack:

docker compose -f docker-compose.core.yml up --build

The default local service address is:

http://localhost:5050

The API exposes both human-facing UI surfaces and machine-facing endpoints. The machine-facing endpoints are the important part for agent integration.

Input Sources
#

The repository currently supports or documents these input modes:

  • USB and built-in cameras through local device discovery and OpenCV-style access.
  • RTSP and HTTP video streams as network sources.
  • Browser camera frames posted into VideoMemory through a browser camera bridge.
  • Android phone snapshot streams through an Android app that serves /snapshot.jpg.
  • Raspberry Pi USB camera deployments, including fixes for camera contention between preview/capture and long-running ingestion.

This mix is important because agent-facing video memory cannot assume one clean camera environment. Laptop webcams, browser permission flows, Android devices, network cameras, and low-power Linux hosts all fail differently.

Task Model
#

The central application object is a task. A task has an io_id, a natural language task_description, optional bot_id, optional monitor type, optional semantic filter configuration, status, notes, and optional evidence. Agents create tasks with:

POST /api/tasks

The task creation body requires:

{
  "io_id": "0",
  "task_description": "Watch for a phone visibly held up in the user's hand."
}

Optional fields include:

  • bot_id, for identifying the creating agent;
  • monitor_type, either general or binary;
  • save_note_frames;
  • save_note_videos;
  • semantic_filter_keywords or the alias required_keywords;
  • semantic filter backend, threshold, threshold mode, reduce method, smoothing, and ensemble mode.

The task API supports listing tasks, reading details, editing the description, stopping a task, and deleting a task.

Monitor Types
#

VideoMemory currently separates simple visual triggers from richer note generation.

monitor_type: "binary" is intended for fast done/not-done conditions. It is the correct shape for “wake me when a human is visible” or “tell me when the dog appears.” The documented binary monitor uses a local true/false path, defaults to a threshold of 0.5, and requires 2 true votes out of the last 3 evaluated frames before marking the task done. This mode is valuable because it can avoid a cloud model dependency for simple triggers.

monitor_type: "general" is the older chunked VLM monitor. It writes richer task notes and is useful when the user needs explanation, description, or more context than a binary trigger. It may require a configured cloud or local model provider.

The system also exposes a one-off caption endpoint:

POST /api/caption_frame

That endpoint is for immediate questions about the current frame, not for long-running monitoring.

Semantic Filtering
#

Semantic filtering is a cost and hardware control. The task API can accept semantic keywords so that frames are gated before being sent to the richer VLM path. The current API describes backends including dino_clip_adapter and semantic_autogaze, a default threshold, threshold modes, reduce strategies, smoothing, and simple ensemble modes.

The debug API also exposes semantic filter status, semantic preview streams, semantic pass streams, queue size, dropped semantic frames, pass-frame age, and latest evaluation timestamp. These details are thesis-relevant because they show the system moving toward explicit accounting for the local filtering stage instead of treating the VLM as the whole system.

Evidence Model
#

VideoMemory can save frames and video clips associated with task notes. Event payloads can include:

  • note_frame_api_url;
  • note_video_api_url.

The important rule is that agents should use those saved URLs when responding to a detected event. They should not take a fresh snapshot later and pretend it is evidence for the earlier event.

An observed evidence bug made this rule concrete. The saved still image was tied to the model input frame, but the evidence video was built later from the live rolling evidence buffer. When VLM processing was slow, the saved video could drift ahead and no longer match the picture or the description. The durable fix was to snapshot the evidence buffer when a chunk is queued, then build note videos from that queued snapshot instead of reading the live buffer at callback time.

Readiness Diagnostics
#

The endpoint:

GET /api/device/{io_id}/readiness

returns structured readiness information. It distinguishes registered devices from working devices. It reports:

  • whether the device exists;
  • whether an ingestor exists;
  • whether the ingestor is running;
  • whether it has a frame;
  • frame age;
  • browser camera freshness for browser sources;
  • binary monitor status;
  • semantic filter status;
  • warnings.

Common warnings include no fresh browser frames, no active ingestor, no captured frame, unregistered device ID, and local camera permission problems on macOS.

This readiness endpoint is one of the key agent-facing improvements. Without it, an agent can easily confuse “task created” with “visual monitoring is actually armed.”

Model Providers And Usage Accounting
#

The repository contains provider adapters for Google, OpenAI, Anthropic, OpenRouter, Mistral, and local vLLM-style usage. The settings surface supports keys such as:

  • GOOGLE_API_KEY;
  • OPENAI_API_KEY;
  • OPENROUTER_API_KEY;
  • ANTHROPIC_API_KEY;
  • VIDEO_INGESTOR_MODEL.

Usage accounting is implemented in videomemory/system/usage.py. It normalizes model names, tracks input tokens, output tokens, total tokens, latency, success status, and estimated cost. The pricing table includes entries for Gemini, OpenAI, Anthropic, OpenRouter/free models, and local-vllm at zero external API cost.

This is important for the multi-stream thesis framing. If the system is going to claim cost-aware ingestion, model-call accounting cannot be optional. The existing usage layer is a starting point for measuring dollars per useful event and model calls avoided by local filtering.

Agent Integration
#

Generic Contract
#

The generic contract is documented in docs/agent-integration-contract.md. VideoMemory core owns:

  • device discovery and stream registration;
  • task lifecycle and video ingestion;
  • VLM-based detection loop and task notes.

The external agent owns:

  • user conversation and orchestration;
  • policy, memory, authentication, authorization, and tool planning;
  • translating user intent into VideoMemory API calls;
  • follow-up actions and delivery routing.

The stable agent-to-core endpoints include health checks, devices, captures, preview, readiness, network device registration, tasks, settings, and OpenAPI/schema access. This makes VideoMemory a service that agents can call rather than an agent runtime itself.

Event Delivery
#

For “when X happens, do Y” workflows, the visual condition belongs in VideoMemory and the follow-up action belongs in the external agent. Event delivery can happen through a generic webhook receiver, OpenClaw-compatible hooks, or a Claude Code channel.

The generic webhook configuration uses settings such as:

VIDEOMEMORY_OPENCLAW_WEBHOOK_URL
VIDEOMEMORY_OPENCLAW_WEBHOOK_TOKEN
VIDEOMEMORY_SELF_BASE_URL

The setting name still contains OPENCLAW because it originated in that integration, but the value can point at any compatible receiver.

Claude Code Adapter
#

The Claude Code integration is a useful proof point because it supports push-style wakeups into a running Claude session. The documented flow is:

VideoMemory monitor task
  -> task note / detection
  -> POST http://127.0.0.1:8791/videomemory-event
  -> Claude Code channel
  -> running Claude Code session receives a videomemory channel event

The public plugin install path is:

claude auth login
claude plugin marketplace add https://github.com/Clamepending/videomemory
claude plugin install videomemory@videomemory

The friend-facing test prompt used for this path is:

Use VideoMemory to watch my pet dog from my FaceTime camera. Use a binary monitor and wake me when the dog is visible.

The important interpretation is that Claude is an adapter and a strong demo surface. It should not be treated as the whole project. The future thesis should frame Claude as evidence that VideoMemory can deliver events into a real agent, while still emphasizing that the core interface is agent-agnostic.

OpenClaw And Package Distribution
#

The package under videomemory-package/ is published as @clamepending/videomemory and currently declares version 0.1.9. It provides CLI binaries named videomemory and videomemory-openclaw, includes bundled scripts, hooks, skills, and plugin metadata, and has a prepack script that synchronizes bundled scripts before packaging.

The public GitHub release/tag path previously referenced v0.1.6, while the npm package was @clamepending/videomemory@0.1.9. This mismatch is not necessarily wrong, but it is a release-management fact that should be recorded because clean installation is part of making an agent tool usable by people outside the original development machine.

Concrete Engineering Milestones
#

Core HTTP Service And API
#

The core service provides the durable interface for agents. It exposes health, device discovery, capture, preview, caption, task lifecycle, readiness, settings, and OpenAPI endpoints. The API is explicit enough that an external agent can discover devices, create a task, check readiness, inspect task notes, and deliver follow-up actions without scraping a UI.

Status: implemented.

Evidence:

  • README.md documents core startup and API calls.
  • docs/agent-integration-contract.md defines the external agent boundary.
  • flask_app/app.py implements task creation, readiness, semantic filter debug routes, and OpenAPI descriptions.
  • Tests cover devices, task creation, task lifecycle, settings reload, caption frame, usage API, and URL utilities.

Browser Camera Bridge
#

The browser camera bridge exists because direct local camera access is not always the best agent path on macOS. Browser permission prompts are visible to the user, and browser frames can be posted into VideoMemory under a stable browser_* device ID.

Status: implemented and used by the Claude flow.

Evidence:

  • The friend-facing path uses http://127.0.0.1:5050/browser-camera/facetime.
  • Readiness diagnostics include browser-camera freshness.
  • The README instructs users to keep the opened camera tab running while a monitor is active.

Android Snapshot Stream
#

The Android app turns a phone into a network camera by running a small HTTP server that serves the latest camera frame as GET /snapshot.jpg. It uses CameraX, exposes a single screen with a snapshot URL and controls, and requires Android 7+ plus camera permission. The VideoMemory server and phone must be on the same LAN or mutually reachable through Tailscale.

Status: implemented as developer utility.

Evidence:

  • android/README.md documents build, run, and usage.
  • The app can be added in VideoMemory as a network camera by entering the snapshot URL.

Binary Monitor
#

The binary monitor provides a fast local path for simple visual conditions. It is designed for done/not-done triggers, not descriptive notes. It reduces the need for cloud model calls on simple events and is therefore central to the limited-hardware, cost-aware framing.

Status: implemented.

Evidence:

  • The agent contract documents monitor_type: "binary".
  • The OpenAPI description says binary uses a local done/not-done monitor and does not require a cloud model key.
  • The Claude integration can create binary monitors through its MCP tool.

Semantic Filtering
#

Semantic filtering gates frames before the VLM path. It supports keyword configuration on task creation and debug/status routes for tuning. In local monitoring work, a phone-held-up monitor used keywords phone, smartphone, hand, and person with dino_clip_adapter, threshold 0.3, absolute mode, max reduction, and ensemble off.

Status: implemented, but needs more systematic multi-stream evaluation.

Evidence:

  • Task creation accepts semantic_filter_keywords and aliases.
  • Debug endpoints expose semantic filter status and semantic preview streams.
  • Prior local monitor setup used semantic keywords for the phone-held-up task.

Evidence Alignment Fix
#

The evidence alignment bug was a concrete correctness issue. Saved note videos did not always match the saved image and description because the video clip was built from the live rolling buffer too late. The fix was to snapshot evidence when a chunk is queued and build note videos from that queued snapshot.

Status: fixed and pushed to main in commit 57eb342.

Evidence:

  • Key files: videomemory/system/stream_ingestors/evidence.py, videomemory/system/stream_ingestors/video_stream_ingestor.py, and tests/test_video_stream_ingestor_detection_callbacks.py.
  • Targeted verification used uv run python -m unittest.
  • The fix directly supports the thesis claim that event evidence must be auditable and event-aligned.

Local Status And Phone-Held-Up Monitor
#

A local status check found that http://127.0.0.1:5050 was the effective VideoMemory base URL. The observed devices were FaceTime HD Camera, OBS Virtual Camera, and Browser FaceTime Camera. A phone-held-up monitor was created with semantic filtering, but the monitor remaining active with no task note meant the event had not yet fired. This distinction is important: setup was not the same as successful detection.

Status: partially successful operational test.

Evidence:

  • Service status can report launchd installed but unloaded while a direct Python process is serving the API.
  • Monitor status active, done: false, and empty task note means the monitor is armed but no qualifying event has been detected.

Raspberry Pi Camera Contention Fix
#

On a Raspberry Pi deployment, /dev/video0 existed and standalone OpenCV could read frames once the Flask app was stopped. The failure was not missing hardware. The issue was contention: preview/capture tried to open the local USB camera independently instead of reusing the active ingestor. The durable fix was to reuse a shared ingestor lease and keep the local ingestor warm for preview and capture without changing default network-camera behavior.

Status: fixed on the Pi deployment.

Evidence:

  • Useful smoke checks were GET /api/device/0/preview, which should return a JPEG, and POST /api/device/0/capture?format=json, which should report source: "ingestor_live" or source: "shared_ingestor_warm".
  • Logs should not contain Local camera open failed or Could not open local camera during successful preview/capture.

This incident is important for the limited-hardware framing. It shows that multi-stream and edge deployment problems are often not model problems; they are resource ownership, camera lifecycle, and process coordination problems.

Public Plugin Packaging
#

VideoMemory has been packaged for public-style installation through a Claude plugin path and npm package. The high-signal release verification included JavaScript syntax checks, channel package checks, npm dry-run packaging, Python tests, and a CLI install check from the repo.

Status: implemented enough for friend-facing smoke testing.

Evidence:

  • Public install commands are documented in the README.
  • videomemory-package/package.json declares package metadata and CLI binaries.
  • Fresh Claude-home verification should show videomemory@videomemory enabled in claude plugin list.

Current Test Coverage
#

The test suite covers a broad set of system surfaces. Existing test files include:

  • provider adapters: Anthropic, Mistral, OpenRouter;
  • API endpoints: devices, caption frame, task creation, task lifecycle, usage, settings reload, update check;
  • readiness and device detection: device readiness, macOS device detector, local camera preview ingestor, network camera keepalive;
  • evidence: task note frames, task note videos, task note video API, task manager note frame/video settings, video stream ingestor callbacks;
  • integration and packaging: OpenClaw integration, OpenClaw plugin scaffold, alert transform, task helper original request;
  • semantic filtering: DINO backend and debug threshold API;
  • URL and snapshot utilities.

The evidence alignment work used uv run python -m unittest because pytest was not installed in the project virtual environment and a mismatched fallback interpreter caused noise. That should be recorded because reproducible verification matters for the future thesis.

Cost And Scaling Model
#

The system’s central scaling challenge is that video streams produce far more frames than an agent can afford to reason about with large models. A useful future evaluation should report:

  • number of streams;
  • sampled frames per second per stream;
  • frames dropped before model evaluation;
  • semantic filter pass rate;
  • binary monitor evaluations per minute;
  • general VLM calls per minute;
  • model latency distribution;
  • estimated model cost per hour;
  • CPU, GPU, memory, and network usage;
  • event latency from visual occurrence to agent event;
  • false positive and false negative rates.

The repository already contains some building blocks for this evaluation:

  • model usage records with latency and estimated cost;
  • semantic frame queue status and dropped-frame counters;
  • readiness payloads with frame age;
  • binary monitor status;
  • saved evidence URLs for event inspection.

What is not yet complete is a systematic multi-stream benchmark. The current system has architecture and instrumentation pointing in the right direction, but the strong claim “large amounts of multiple video streams on limited hardware” still needs measured evidence.

Strengths Of The Current System
#

VideoMemory has a clear and useful service boundary. It does not require every agent to know how to open cameras, keep streams alive, filter frames, call vision models, or package evidence. That is the right architectural direction.

The system also captures the right distinction between one-off perception and long-running monitoring. POST /api/caption_frame is for immediate questions. Tasks and monitors are for persistent watch conditions. That separation should remain explicit.

The readiness endpoint is a strong agent-facing design feature. It addresses a real failure mode where an agent reports success after a task is created even though the camera is stale, permission is missing, or no ingestor is running.

The evidence alignment fix is also important. It turns evidence from a UI convenience into a correctness property. For a future thesis, this is one of the clearest examples of an implementation bug revealing a deeper system principle.

Finally, the project has real deployment contact with messy environments: browser camera permission, Android snapshot serving, Raspberry Pi USB camera contention, package publishing, and Claude channel wakeups. These are practical systems issues that make the project more than a toy VLM demo.

Current Limitations
#

The biggest limitation is that multi-stream scale is not yet proven by a repeatable benchmark. The project is framed around many streams and limited hardware, but the report currently has more architectural evidence than quantitative multi-stream evidence.

Cost efficiency is partially implemented through usage accounting, binary monitors, and semantic filters, but it still needs end-to-end numbers. The future thesis should not merely claim that filtering saves money; it should measure model calls avoided and dollars saved under controlled workloads.

The agent-agnostic claim is architecturally plausible because the HTTP API and webhook contract are generic, but most polished onboarding currently points at Claude Code. A generic receiver demo should be elevated so the project is not perceived as only a Claude plugin.

The monitor types also need comparative evaluation. Binary monitors, semantic filters, and general VLM notes should be compared on the same tasks using the same streams. Without that, it is hard to justify when each mode should be used.

The evidence system has a clear correctness target, but retention, export, and dataset-style reuse remain underdeveloped. For thesis writing, it would be valuable to export per-task evidence bundles with task metadata, frame/video URLs, timestamps, model outputs, and readiness snapshots.

Thesis-Relevant Interpretation
#

The likely future thesis contribution is not “I built a video app.” It is:

Agents need a resource-constrained, agent-agnostic video ingestion layer that turns many unreliable live streams into structured, evidence-backed events.

VideoMemory is a concrete implementation of that idea. It contributes:

  • an agent-facing task abstraction for natural-language visual monitoring;
  • a separation between perception state and agent action state;
  • support for multiple stream types;
  • cheap local paths through binary monitors and semantic filtering;
  • readiness diagnostics that prevent false success reports;
  • event-aligned evidence for auditability;
  • generic webhook delivery plus concrete adapters such as Claude Code;
  • operational evidence from local Mac, browser, Android, and Raspberry Pi environments.

The future thesis should use this report as raw material for system design and implementation chapters. The evaluation chapter should add quantitative multi-stream, cost, latency, and accuracy results. The discussion chapter should use the evidence alignment bug, Pi camera contention bug, and readiness/task distinction as concrete failure-driven design lessons.

Recommended Next Engineering Steps#

The next work should be chosen to strengthen the central claim.

First, build a repeatable multi-stream benchmark. It should run several streams at once, measure frame sampling, filter pass rate, model-call rate, latency, resource usage, and cost. This benchmark is more important than adding another agent adapter.

Second, add a generic webhook receiver demo and document it as a first-class path. Claude should remain a polished integration, but the report and thesis need an obviously agent-agnostic demonstration.

Third, produce evidence bundles for completed tasks. A bundle should include task metadata, model settings, readiness state, note text, frame evidence, video evidence, timestamps, and usage/cost records. This would make future thesis figures and tables much easier.

Fourth, compare monitor modes on the same workload. The system needs measured answers for when to use binary monitoring, semantic-filtered VLM monitoring, and general VLM notes.

Fifth, run a limited-hardware benchmark on Raspberry Pi or equivalent hardware. The current Pi work proves an important camera lifecycle issue, but the thesis claim needs numbers: how many streams, what sampling rate, what latency, and what model path can run reliably.

Report Artifacts
#

Source locations:

  • VideoMemory repository: /Users/mark/Desktop/projects/videomemory
  • Hugo report page: /Users/mark/Desktop/projects/markogata/content/videomemory/report/index.md
  • LaTeX report source: /Users/mark/Desktop/projects/markogata/content/videomemory/report/videomemory-project-report.tex

Local preview URL:

http://127.0.0.1:1313/videomemory/report/

Related