The METR study found developers believe AI makes them 20% faster. Measured: 19% slower. I wanted data on my own usage, so I built a local effectiveness tool.

Three months after shipping this, I realized aide was one node in a broader pattern of independent engineers assembling the OS layer for coding agents. That follow-up: We Are the LLM’s Memory.

aide dashboard overview

What Is This

aide ingests Claude Code and Codex session logs (JSONL) into SQLite and shows what is happening across projects: cost, token usage, session patterns, effectiveness trends, review queues, and repeat workflow friction.

It also turns repeated findings into proposed project knowledge. You review and accept those artifacts before they become runbooks or start-session briefs for future human or agent work.

Zero LLM calls. Zero cost to run. All data stays local.

~/.claude/projects/**/*.jsonl → Claude parser → SQLite → aide
~/.codex/sessions/**/*.jsonl  → Codex parser  → SQLite → aide

The Problem

Claude Code and Codex generate detailed local logs: messages, tool calls, token counts, timestamps, commands, and metadata. The files sit under ~/.claude/projects/ and ~/.codex/sessions/. Nobody looks at them.

Everyone has opinions about whether AI coding tools are worth it. Nobody has data. Am I getting more efficient over time? Which projects eat the most tokens? Where do sessions repeatedly go wrong? Did a better instruction or runbook actually reduce friction?

The useful view is personal trends across all your sessions, plus a feedback loop that converts repeat mistakes into better future context. A single session transcript doesn’t show either.

How It Works

Data Pipeline

  1. Discovery - finds configured Claude and Codex JSONL sources
  2. Parsing - normalizes provider-specific events into one session model
  3. Work Blocks - splits each session into continuous coding periods at 30-minute idle gaps
  4. Ingestion - upserts into SQLite with incremental ingest, tracking file mtime
  5. Review Loop - ranks suspicious sessions, groups repeat causes, and proposes reviewable artifacts

Cost Estimation

All costs estimated at current API rates. For subscription users (Pro/Max), a toggle shows token-based metrics instead of dollar amounts.

Key Features

Overview Dashboard

Summary cards, effectiveness metrics (cache hit rate, edit ratio, compaction rate, error rate), trend charts, work blocks per week.

Overview page with effectiveness metrics and trend charts

Effectiveness

Project and provider rollups compare the last 30 days with the previous 30 days: average cost per session, active time, review queue rate, edit attribution, and tool error rate.

Action summaries group repeated investigation signals like no-edit work, weak project attribution, missing prefix rules, and edit attribution gaps. Each action links back to the sessions that caused it.

Session Detail

Drill into any session: token breakdown, tool usage, files touched with read/edit/write counts, work block timeline, error categorization.

Session detail showing tokens, tools, and files

Insights

First-prompt effectiveness, cost concentration, time patterns, model usage, tool sequences, thinking block analysis, permission friction, and investigation actions.

Insights page with first-prompt analysis and time patterns

More

  • Session Autopsy - aide autopsy <session-id> generates a per-session diagnostic report: cost breakdown, context window analysis, compaction detection, and project-instruction improvement suggestions.
  • Artifact Review - repeated findings become proposed semantic artifacts. Accepted artifacts feed runbooks and task briefs.
  • Subscription Mode - toggle between API cost view and token-based metrics for Pro/Max subscribers.
  • CLI Stats - aide stats prints a quick summary to the terminal without opening the browser.

Technical Details

Work Blocks

A JSONL “session” is just one terminal window staying open. A session spanning Mon-Wed with sleep in between reads as a 48-hour session, making “duration” useless.

The gap distribution between messages is bimodal: most gaps are under 5 minutes (active work), a small cluster is over 30 minutes (away). A 30-minute threshold cleanly separates the two modes.

Each session splits into work blocks, continuous coding periods. “119 work blocks across 35 sessions” tells you more than “35 sessions.”

Error Categorization

Tool errors are categorized automatically: Test (pytest, jest), Lint (ruff, eslint), Build (pip, npm), Git, Edit Mismatch, File Access. Most “errors” are normal iteration (test failures during edit-test-fix cycles). The dashboard separates iteration from actual mistakes.

Effectiveness Metrics

  • Cache Hit Rate - % of input context served from cache (higher = better reuse)
  • Edit Ratio - % of tool calls that are file edits (higher = more productive)
  • Compaction Rate - % of sessions hitting context limits
  • Read-to-Edit Ratio - reads per edit (lower = less searching)
  • Iteration Rate - sessions with files edited 3+ times
  • Review Rate - % of recent sessions that deserve manual review
  • Edit Attribution - % of edit/write calls that attach to a file path

AI coding logs are a gold mine. Every tool call, token count, timestamp, and command shape is there. The hard part was deciding which metrics actually matter.

Work blocks changed everything. Raw session duration was misleading for every chart. Splitting at idle gaps made the data honest. Data cleaning matters more than fancy visualizations.

Zero LLM calls was the right constraint. Every metric is heuristic. No API calls, no marginal cost. Re-ingest and rebuild as many times as you want.

Tech Stack

  • Python 3.12+ with Click CLI
  • Flask + Jinja2 templates
  • Chart.js (CDN) for interactive charts
  • Tailwind CSS (CDN) for styling
  • SQLite (stdlib) for storage
  • uv for package management