Multi-Agent Orchestration Platform

Architecture

Passive Pipeline — Email Ingest

Job Alert Emails

Gmail triggers

Processor

Node.js on Hetzner

Dedup

Notion DB query

Create Entry

Notion API

Active Pipeline — ATS Discovery

SearXNG

Layer 1 · self-hosted

Brave Search

Layer 2 · fallback

URL Match

any ATS platform

Dedup

Halt-on-failure

Create Entry

Notion API

JD Enrichment — Multi-Fallback Fetch

markdown.new

Primary

defuddle.md

Fallback 1

r.jina.ai

Fallback 2

Scrapling

Fallback 3

Write JD

Notion page

5-Stage Cron Pipeline — Per Batch (2 active: Noon + Evening)

Intake

SKILL-intake.md

30 min

Enrich

SKILL-enrich.md

15 min

Triage-1

Filter · hard rules

15 min

Triage-2

Assess · rules 1–13

15 min

Triage-3

Decide · Telegram

Agents

Agent 1 — OpenClaw (Kimi K2.5)

Kimi K2.5 (Moonshot AI) · OpenAI fallback · Self-hosted via OpenClaw gateway

Autonomous triage agent. Reads per-stage reference files (SKILL-intake.md, SKILL-enrich.md, SKILL-triage-1-filter.md, SKILL-triage-2-assess.md, SKILL-triage-3-decide.md) split from the original 2,300-line SKILL.md to prevent context window overflow. Applies 13 calibration rules, scores incoming roles, and updates the Notion job tracker. 4 batches defined (5:30, 12:00, 16:00, 20:00); 2 currently active (Noon, Evening). Each batch = 5 pipeline stages staggered 15–30 min apart. Triage refuses to process any item where JD Enriched ≠ true.

Agent 2 — Claude

Claude Code · Anthropic API

Handles strategic work — CV adaptation, cover letter writing, interview prep, pipeline decisions. Reads the Agent Coordination Board at session start for context handoff.

Agent 3 — Claude Code on Server

Claude Code CLI · Anthropic API · CatClaw VPS

Infrastructure agent. Debugs OpenClaw config, maintains SKILL.md and split reference files, monitors service health, and runs pipeline diagnostics. References OPENCLAW-PIPELINE.md as shared state document across all three agents.

Coordination: Shared Notion board for async communication. Structured message format (agent name, timestamp, task/question). No human relay needed for routine operations. Luka intervenes only for judgment calls tagged 🔴. OPENCLAW-PIPELINE.md serves as shared state across all three agents — architecture map, key IDs, known-good states, and fixes log. All agents can read and update it. It references SKILL.md sections rather than duplicating them.

Production incident: zombie cron race condition (April 23, 2026)

🔴 INCIDENT — 2026-04-23 Zombie Cron Race Condition

WHAT HAPPENED
When triage was decomposed from SKILL-triage.md into 3
sub-stages, new cron jobs were created and the legacy monolithic
triage job was set to enabled: false in jobs.json. It was not
deleted.

OpenClaw's scheduler doesn't reliably respect enabled: false
when state.nextRunAtMs is still set in the job definition. The
scheduler saw a valid next-run timestamp and fired the job anyway.

RESULT
Legacy triage ran IN PARALLEL with the new 3-stage pipeline
every evening. The legacy job shortlisted a MYTRAFFIC "GTM
Engineer M/F" role at ⭐⭐⭐⭐⭐. The new pipeline would have
caught "fluent in French" as a hard language filter eliminate —
but the legacy job wrote its verdict to Notion first. Item was
already marked Decided before triage-1 (filter) reached it.

A second zombie (legacy enrichment) was also firing, duplicating
work with the active enrichment stage.

DETECTION
Manual audit: MYTRAFFIC shortlisted with no language filter note.
Cross-referencing cron logs showed two triage processes writing
to the same Notion entries within the same time window.

ROOT CAUSE
// ❌ What we did (unreliable):
{
  "name": "[LEGACY] Evening Triage (9:00 PM)",
  "enabled": false,           // scheduler ignores this
  "state": {
    "nextRunAtMs": 1745528400000  // ...but honors this
  }
}

// ✅ What works:
// Job deleted from jobs.json entirely.
// No entry = no execution. No ambiguity.

FIX
Deleted both legacy jobs from jobs.json. Set MYTRAFFIC to
Eliminated with explanation of the false positive.

LESSON
"Delete, don't disable." In cron systems where job state persists
alongside config, a disabled flag can be silently overridden by
stale scheduler state. The only reliable decommission is removal.

Key Engineering Decisions

Halt-on-failure dedup: If the Notion query fails, skip entry creation entirely — never assume no duplicates exist

Halt-on-failure dedup (agentmail-job-processor.js)

// Load dedup data — HALT if either query fails
const urlResult = await loadNotionUrls();
const comboResult = await loadNotionCompanyTitles();

if (!urlResult.ok || !comboResult.ok) {
  const errors = [];
  if (!urlResult.ok) errors.push(`URL dedup: ${urlResult.error}`);
  if (!comboResult.ok) errors.push(`Company+Title dedup: ${comboResult.error}`);
  console.error(`\n🛑 DEDUP QUERY FAILED — HALTING PIPELINE`);
  console.error(`Errors: ${errors.join('; ')}`);
  console.error(`Reason: Cannot proceed without dedup — would create duplicate entries.`);
  process.exit(1);
}

Multi-fallback JD fetching: 4-layer chain plus alternative source search for LinkedIn-blocked URLs
5-stage pipeline split: Monolithic 2,300-line SKILL.md caused context window overflow — Kimi lost critical rules mid-run. Split into 5 focused files: intake, enrich, and 3 triage sub-stages (filter, assess, decide). Each stage runs independently; state handoff uses Notion, not memory. Triage further decomposed so individual phases can be retried without re-running the full run.
JD Enriched checkpoint: Triage refuses to process any item where JD Enriched ≠ true in Notion. Enrichment sets this flag only after successfully writing the full JD to the Notion page body. Ensures the agent never scores a role it hasn't read.
Title blocklist (pre-Notion filter): The email processor applies a categorised TITLE_BLOCKLIST ({ pattern, label }) at extraction time — before entries ever reach Notion. Categories: "too senior", "engineering", "French title", "wrong domain". A fourth dedup/filter layer operating upstream of the entire pipeline.

Evolving calibration rules: Started with 4, now 13 rules based on real triage errors (false positives and false negatives)

Example calibration rule from SKILL.md

**Rule 6: Apply the "What Would Luka Actually DO All Day?" Test**

Before triaging, mentally simulate a typical day in the role.
If the answer is "build automations, connect APIs, deploy AI
workflows, improve internal tools" — that's a match regardless
of title. If the answer is "write SQL queries all day, build
dashboards in Looker, run A/B tests in Amplitude" — that's a
miss. The signal is in the verbs and tools in the
responsibilities section, not in the title or team name.

Provider-agnostic: Agent 1 uses Kimi with OpenAI fallback; Agent 2 uses Claude — designed to work across providers
Self-hosted gateway: OpenClaw on Hetzner VPS provides unified API routing with provider failover

Metrics

20+

items processed / day

700+

items processed, 3 duplicates

13

calibration rules, evolved from 4

3

coordinating agents

21

cron jobs defined (11 active)

Autonomous

with human-in-the-loop

Tech Stack

Claude Code Anthropic API Kimi K2.5 (Moonshot) OpenAI API OpenClaw Node.js Bash Python Notion API Brave Search API Hetzner VPS Cloudflare Tunnel Cron Scheduling SearXNG Docker systemd

Engineering Patterns

The domain is personal automation, but the architecture patterns are general-purpose:

Signal ingestion + dedup pipeline — customer support routing, lead processing, content moderation
Multi-agent async coordination — any system where specialized agents share state without blocking
Calibration rules with human-in-the-loop — fraud detection tuning, lead scoring, content recommendation
Multi-fallback data enrichment — CRM enrichment, competitive intelligence, news monitoring
Provider-agnostic AI routing — production resilience across LLM providers
Context-aware pipeline decomposition — splitting monolithic instruction files into stage-specific reference docs to fit within LLM context windows, with state handoff via database rather than memory

Changelog

The system evolves through real failures — each rule and safeguard exists because something broke in production.

Date	Change	Why
2026-04-26	Rule 13: Function-Modified Ops cap	CX Ops role shortlisted despite automation being secondary to departmental outcomes. Roles where "Ops" follows a department name now capped at ⭐⭐⭐ unless automation IS the function.
2026-04-23	Triage decomposed into 3 sub-stages	Monolithic triage caused context overflow and made it impossible to retry individual phases. Splitting into filter → assess → decide allows targeted re-runs and cleaner failure modes.
2026-04-23	Legacy cron zombies deleted	Disabled jobs still executing due to stale nextRunAtMs timestamps — caused a race condition that shortlisted a role violating language rules.
2026-03-30	SearXNG replaces Brave as primary JD search	Brave-only + ATS-only filter caused 100% failure when companies weren't on 5 whitelisted platforms. SearXNG (self-hosted) is now Layer 1; Brave is fallback. Any URL matching company name is accepted.
2026-03-30	Halt-on-failure dedup	Notion query failures silently let duplicates through. Processor now exits entirely if dedup queries can't be verified — never assumes the pipeline is clean.
2026-03-27	Full autonomy — human-in-the-loop only for 🔴 decisions	Moved from semi-manual triage to 4-batch cron with Telegram summaries. Luka reviews only items escalated with 🔴 urgency.