KOLens
All posts
·KOLens teamAIEngineeringCostCache

Why we cache every LLM call — the economics under the dossier

Operator feedback was direct: '请求返回的overview要落库,避免每次请求,造成浪费'. We audited every call_llm site, fixed the two that weren't cached, and added a refresh flag everywhere. Per-KOL LLM spend dropped to whatever the operator deliberately chose.

The principle

Every LLM call that could be repeated for the same input must be cached. The Refresh button is the only path that re-bills the operator.

The audit that started it

Operator feedback came in two sentences: "请求返回的 overview 要落库,避免每次请求,造成浪费. 其他的所有 文案如果是有调用 LLM 进行转换的,都要落盘". So we grepped call_llm + generateText across the repo. Eight LLM-driven features. Six were already cached. Two weren't.

Already cached (no change)

  • kol_overviewDBKolOverviewSummary, 30 days
  • kol_backgroundDBKolBackgroundResearch, 30 days
  • kol_deep_overviewDBKolDeepOverview, 7 days
  • kol_activityDBKolActivitySummary, 24 hours
  • translate — writes bio_zh / display_name_zh / category_zh directly to author columns; only runs once per author per field
  • The 4 insight cards — DBSearchInsight via the insight_* services

Newly cached

  • kol_outreach — new kol_outreach_drafts table (migration 013), 30 days, locale-gated
  • kol_audience_insights — new kol_audience_insights table (migration 013), 30 days, locale-gated + snapshot-hash gate

Three call shapes per feature

Each cached LLM feature exposes the same three entry points so the calling layers (HTTP route, MCP tool, scheduler) can choose:

  1. read_cached(...) — never calls the LLM. ReturnsNone on miss. Used for cache-only paths like the watchlist's brand-fit star rating.
  2. get_or_generate(...) — cache-first. LLM only on miss or stale.
  3. regenerate(...) — always calls. Maps to the UI's Refresh button + the ?refresh=true query.

Locale as a first-class cache key dimension

Adding zh-CN to the dossier in May 2026 surfaced a subtle cache bug: an existing English row served the zh-CN caller until they hit Refresh. So locale moved into the cache row's JSON payload, and _read_cached now treats a locale mismatch as a miss.

Backwards-compatible: pre-feature rows have no locale field and default to "en". Existing English caches keep serving English users — only the new languages need a fresh generation pass.

Snapshot-hash gate (audience insights only)

Audience insights is the odd one out: its input is a sampled- followers snapshot the operator can re-run. We SHA-256 the snapshot dict on write; on read, if the caller supplies a snapshot whose hash differs from the cached row, it's an automatic invalidation — no manual refresh required.

Charging happens at the route, not the service

The service layer doesn't know about credits. The route layer calls charge() first (workspace pool → personal fallback), and only on success invokes the regenerate path. On AI Gateway 503 after the debit, the route refunds the workspace half immediately.

Why this split: the cache row should only exist when the operator has been billed and the LLM has returned valid content. Two-step commits would deadlock on retries; the debit-then-refund pattern is simpler and matches the way /api/scrape already worked.

Effective LLM spend per workspace

Order-of-magnitude estimate, for a workspace that touches 50 KOL dossiers per week:

  • Pre-cache: 50 dossier opens × 2 LLM-driven cards on the page = ~100 LLM calls per week
  • Post-cache: 50 first-time generations the first week, ~5-10 refreshes per week thereafter
  • 5x-10x reduction on the steady state; first week is identical because everything misses

The deep-analysis tier (5 credits, opt-in button) doesn't change — that one is meant to be deliberate.

READY?

Try it now — 50 free credits on signup.

See it in action — open any KOL

Frequently asked

Which features were uncached before this fix?
Two: cold outreach drafts (kolens-web/app/api/kols/[u]/outreach) and audience-insight interpretation (kolens-web/app/api/kols/[u]/audience/insights). Both lived on Vercel BFFs that called the AI Gateway directly via generateText() — no DB persistence layer. Operators hitting the 'Generate' button twice paid twice.
How is the locale gate implemented?
Every cache table stashes the generation locale inside the JSON payload. read_cached compares that string to the caller's locale; a mismatch counts as a cache miss. Rows that predate the locale-gate feature default to 'en' so existing English caches keep serving English users unchanged.
What about the audience-snapshot hash gate?
Audience insights is special — the input is a sampled-followers snapshot that can change. We SHA-256 the snapshot dict at write time and re-check at read time. Same locale + same snapshot bytes → cache hit. Same locale + different snapshot (e.g. you re-sampled) → automatic invalidation, no manual refresh.
Do you ever charge twice for the same cache row?
Never on cache hits. Only the regenerate / refresh paths charge. Refunds fire on the failure path (AI Gateway 503 after the debit) for the workspace side; personal-side refunds need the operator to claim because Supabase write paths from Railway are limited.
How is the cache row invalidated when I update my brand profile?
Honestly: it isn't yet — that's an explicit out-of-scope on PRs #255 and #256. For now the operator clicks Refresh on the dossier after a brand-profile edit. Cache-key-includes-profile-hash is a clean follow-up.
What's the 30-day TTL story?
Creator Overview: 30d — themes are stable. Background Research: 30d — identity facts drift slowly. Cold Outreach: 30d — the creator's profile doesn't change in a week. Audience Insights: 30d (hash gate handles the case where the snapshot did change). Activity Summary: 24h — explicitly tracks recent posts so shorter window. Deep Analysis: 7d — negotiation / partnership insights drift faster.

Read next

Why we cache every LLM call — the economics under the dossier · KOLens | KOLens