<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Kestrel Labs</title>
    <link>https://kestrellabshq.com</link>
    <description>Notes, writeups, and updates from Kestrel Labs.</description>
    <language>en</language>
    
<item>
  <title>Start here</title>
  <link>https://kestrellabshq.com/blog/start-here</link>
  <guid>https://kestrellabshq.com/blog/start-here</guid>
  <pubDate>Thu, 19 Mar 2026 00:00:00 GMT</pubDate>
  <description>What this blog is for, who it’s for, and the kind of work Kestrel Labs tries to do.</description>
  <content:encoded><![CDATA[<p>This is a place for practical notes from real projects—what held up, what didn’t, and what I’d do again.</p>
<p>If you’re responsible for a business website, an internal tool, or the systems behind them, the goal is to share patterns that reduce friction and help things run calmer.</p>
<h2>What you’ll find here</h2>
<ul>
<li>Short after-action notes (what changed, what improved)</li>
<li>Tradeoffs explained plainly (not as ideology)</li>
<li>Checklists and small playbooks you can reuse</li>
</ul>
<h2>What you won’t find</h2>
<ul>
<li>Inflated claims</li>
<li>Cargo-cult best practices</li>
<li>Process for its own sake</li>
</ul>]]></content:encoded>
</item>

<item>
  <title>Deterministic LLM gating (Part 2): a harness for contract-bound outputs</title>
  <link>https://kestrellabshq.com/blog/kestrel-evals-deterministic-gating</link>
  <guid>https://kestrellabshq.com/blog/kestrel-evals-deterministic-gating</guid>
  <pubDate>Thu, 19 Mar 2026 00:00:00 GMT</pubDate>
  <description>How the initial kestrel-evals implementation works: YAML suites, provider calls, deterministic checks, baseline reports, and CI gating.</description>
  <content:encoded><![CDATA[<p>Part 1 made the sequencing argument: when an LLM output is contract-bound, deterministic checks should come before subjective judging. This post is the practical follow-through: what the current <code>kestrel-evals</code> implementation looks like, where it is strongest, and what it does not try to be.</p>
<h2>BLUF</h2>
<p>If an LLM feature depends on a structured output contract, that contract should be testable in the same operational way as ordinary software: with deterministic checks, repeatable suites, and CI failure when the output breaks.</p>
<p>That is the purpose of <code>kestrel-evals</code>.</p>
<p>The project is intentionally narrow. It is not trying to solve the entire LLM evaluation problem. It is trying to solve one common and operationally important part of it well: making contract-bound LLM behaviors observable, repeatable, and gateable.</p>
<h2>Abstract</h2>
<p><code>kestrel-evals</code> is a small evaluation harness for LLM-powered systems where correctness is partly or largely mechanical. Rather than beginning with rubric scoring, model-vs-model judging, or broad subjective grading, it starts from a simpler premise: if an LLM feature is expected to produce machine-consumable output, then the first layer of evaluation should verify that the output actually obeys its contract.</p>
<p>In practice, that means checking whether the output is valid JSON, whether required keys exist, whether values remain inside a controlled vocabulary, and whether formatting guarantees are preserved. The harness takes YAML-defined suites, executes prompts against a provider, applies deterministic checks, emits a JSON report, and exits non-zero on failure so the result can gate CI.</p>
<p>The current implementation is intentionally small. It is not presented as a universal benchmark platform or a general theory of LLM quality. It is a working implementation of a narrower claim: for contract-bound LLM outputs, deterministic gating is often the most useful place to start.</p>
<h2>System design and implementation</h2>
<p>At a high level, <code>kestrel-evals</code> has four moving parts: a suite definition, a provider call, deterministic checks, and a report plus exit status. That may sound almost trivial, but that simplicity is part of the point. The harness is meant to be easy to inspect, easy to reason about, and easy to run.</p>
<p>The suite itself is authored in YAML. Each suite defines a name, description, and list of cases. Each case defines a system prompt, user prompt, and one or more deterministic checks. This keeps the test specification close to the behavior being evaluated. The suite file acts as both the test definition and the documentation of the expected contract.</p>
<p>On execution, the runner loads the suite, iterates through the cases, calls the configured provider, captures raw output, applies the requested checks, and accumulates a JSON report. The provider abstraction is intentionally minimal: it exposes a simple <code>generate(model, system, user) -> str</code> interface.</p>
<p>The current implementation is <code>OpenAIProvider</code>, but the abstraction already isolates vendor-specific API code from suite logic and check execution. That separation matters if the project later expands to additional providers.</p>
<p>The checks themselves are deliberately limited. The current implementation supports a small subset of JSON Schema validation, an <code>allowed_values</code> check for controlled vocabularies, and a <code>regex</code> check against raw output text. These are enough to validate a surprisingly wide range of contract-bound behaviors without pulling in heavier dependencies or turning the project into a large framework too early.</p>
<p>The result is a system that is simple enough to act like a normal test runner. The CLI loads a suite, runs it, writes a report, and exits with code <code>1</code> if any case fails. That final behavior is what makes the project useful as a gate rather than merely as a reporting utility.</p>
<h3>Core files</h3>
<ul>
<li><code>src/kestrel_evals/suite_loader.py</code></li>
<li><code>src/kestrel_evals/models.py</code></li>
<li><code>src/kestrel_evals/providers/base.py</code></li>
<li><code>src/kestrel_evals/providers/openai_provider.py</code></li>
<li><code>src/kestrel_evals/checks.py</code></li>
<li><code>src/kestrel_evals/runner.py</code></li>
<li><code>src/kestrel_evals/cli.py</code></li>
</ul>
<h3>Architecture diagram</h3>
<pre><code class="language-mermaid">flowchart TD
    A[Suite YAML\nexamples/structured_extraction.yaml] --> B[suite_loader.py]
    B --> C[Validated EvalSuite models]
    C --> D[runner.py]
    D --> E[Provider abstraction\nLLMProvider]
    E --> F[OpenAIProvider]
    F --> G[LLM output_text]
    G --> H[checks.py]
    H --> H1[json_schema]
    H --> H2[allowed_values]
    H --> H3[regex]
    H1 --> I[Case results]
    H2 --> I
    H3 --> I
    I --> J[JSON report]
    J --> K[reports/report.json]
    J --> L[CI artifact upload]
    I --> M{Any failed?}
    M -- Yes --> N[CLI exits code 1]
    M -- No --> O[CLI exits code 0]
</code></pre>
<h2>Example case study: structured extraction</h2>
<p>The included example suite, <code>examples/structured_extraction.yaml</code>, models a common operational problem: extracting a consistent structured payload from messy inbound lead emails. This is a good representative case because it combines several typical LLM failure modes in a compact form. The inputs are semi-structured, some information is missing, wording varies, service names can drift into synonyms, and the downstream system expects a strict JSON payload rather than prose.</p>
<p>The contract enforced by the suite is intentionally strict. The output must be raw JSON only, must contain all required top-level fields, must use empty strings for unknown values, must emit <code>services</code> as an array, and must restrict that array to an approved vocabulary. In other words, the suite is not testing whether the model “roughly understood the email.” It is testing whether the model produced something the pipeline could safely trust.</p>
<p>The current example suite includes ten cases. Together they cover clear extraction, sparse signature-block style emails, multi-service requests, budget/timeline extraction, AI/R&#x26;D language, devops terminology, security-related mapping, advisory work, and very short low-context inputs.</p>
<h3>Example suite summary table</h3>
<table>
<thead>
<tr>
<th>Item</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Suite name</td>
<td><code>Sales lead intake extraction</code></td>
</tr>
<tr>
<td>Purpose</td>
<td>Extract a consistent JSON payload from messy inbound lead emails</td>
</tr>
<tr>
<td>Cases</td>
<td>10</td>
</tr>
<tr>
<td>Model in baseline</td>
<td><code>gpt-4.1-mini</code></td>
</tr>
<tr>
<td>Contract style</td>
<td>Raw JSON + fixed top-level schema + controlled vocabulary</td>
</tr>
<tr>
<td>Primary checks</td>
<td><code>json_schema</code>, <code>allowed_values</code>, <code>regex</code></td>
</tr>
<tr>
<td>CI behavior</td>
<td>Fail on any case failure</td>
</tr>
</tbody>
</table>
<h3>Test case coverage table</h3>
<table>
<thead>
<tr>
<th>Case ID</th>
<th>Scenario covered</th>
<th>Main contract risk being tested</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>basic_new_site</code></td>
<td>clear website redesign request</td>
<td>base extraction + email pattern</td>
</tr>
<tr>
<td><code>signature_block</code></td>
<td>sparse request with contact info in signature</td>
<td>name/email extraction from loose formatting</td>
</tr>
<tr>
<td><code>website_upgrade_and_maintenance</code></td>
<td>multi-service request</td>
<td>multiple allowed service labels</td>
</tr>
<tr>
<td><code>cms_setup</code></td>
<td>field-like email with company name</td>
<td>CMS mapping + structured extraction</td>
</tr>
<tr>
<td><code>budget_and_timeline</code></td>
<td>request with explicit budget/timeline</td>
<td>numeric-ish text capture without schema drift</td>
</tr>
<tr>
<td><code>ai_rnd_and_evals</code></td>
<td>LLM feature + CI evaluation ask</td>
<td>AI/R&#x26;D label mapping</td>
</tr>
<tr>
<td><code>devops_platform</code></td>
<td>CI/CD + observability</td>
<td>controlled vocab for platform work</td>
</tr>
<tr>
<td><code>security_review</code></td>
<td>security review + auth hardening</td>
<td>multi-label security mapping</td>
</tr>
<tr>
<td><code>fractional_cto</code></td>
<td>fractional CTO + architecture review</td>
<td>advisory/strategy label mapping</td>
</tr>
<tr>
<td><code>very_short</code></td>
<td>extremely terse lead</td>
<td>minimum viable extraction under low context</td>
</tr>
</tbody>
</table>
<p>One practical lesson from this case study is that strong models still need clear contract reinforcement when the output format is strict. Reliability improved not through exotic techniques, but through careful prompt and suite design: restating the exact JSON shape, insisting on no markdown or code fences, clarifying that unknowns should be empty strings rather than <code>null</code>, and adding explicit mapping guidance for controlled-vocabulary fields. That is exactly why deterministic gating is useful. It does not assume the model will “probably do the right thing.” It makes the contract visible and then tests whether the model actually honored it.</p>
<h2>Baseline results</h2>
<p>A checked-in baseline report is included at <code>baselines/structured_extraction-2026-03-19_gpt-4.1-mini.json</code>. That baseline matters because it does two things at once. First, it shows that the suite is not hypothetical: it was executed against a real model and produced a passing result. Second, it creates a stable reference point for future comparison. Prompt changes, suite changes, provider changes, and model changes can all be evaluated against the same baseline.</p>
<h3>Baseline results table</h3>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Result</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline artifact</td>
<td><code>baselines/structured_extraction-2026-03-19_gpt-4.1-mini.json</code></td>
</tr>
<tr>
<td>Model</td>
<td><code>gpt-4.1-mini</code></td>
</tr>
<tr>
<td>Total cases</td>
<td>10</td>
</tr>
<tr>
<td>Passed</td>
<td>10</td>
</tr>
<tr>
<td>Failed</td>
<td>0</td>
</tr>
<tr>
<td>Pass rate</td>
<td>100%</td>
</tr>
<tr>
<td>Checks used</td>
<td><code>json_schema</code>, <code>allowed_values</code>, <code>regex</code></td>
</tr>
<tr>
<td>Intended use</td>
<td>local verification + CI gating</td>
</tr>
</tbody>
</table>
<p>This baseline should be read narrowly. It is one successful reference run for one suite on one model, not evidence of broad model-agnostic reliability. What it establishes is that the harness pattern works, that the suite is executable, and that a contract-bound workflow can be gated deterministically in practice.</p>
<h2>Design tradeoffs and limitations</h2>
<p>The project’s strongest qualities are also its clearest limitations. By keeping the dependency surface small, the harness remains easy to inspect and easy to extend, but the current schema support is intentionally incomplete. It is enough for common contract enforcement, not a substitute for the full <code>jsonschema</code> ecosystem.</p>
<p>By enforcing strict raw-JSON output, the suite reflects real production constraints, but that strictness also means seemingly “close enough” outputs will still fail. A payload wrapped in code fences or preceded by commentary may be human-legible and still be unacceptable to the pipeline. This is not accidental harshness; it is a deliberate alignment between the evaluation and the downstream system’s expectations.</p>
<p>The current report format is similarly pragmatic rather than polished. JSON is excellent for CI artifacts and downstream tooling, but less ideal for human review than Markdown or HTML summaries would be. The current writeup can point to the baseline artifact and the suite definitions, but it should still be read as evidence of an initial implementation, not as proof that the harness has already solved broad LLM evaluation reliability.</p>
<p>There is also an evidentiary limit worth keeping explicit: the current repo most directly validates one pattern well, namely structured extraction with controlled vocabularies. The architecture is suitable for adjacent contract-bound problems, but the strongest direct support in the repository today is still this example case.</p>
<h2>Where the current implementation is strongest</h2>
<p>Today, <code>kestrel-evals</code> is strongest in settings where the output contract is explicit and the failure modes are mechanical. That includes structured extraction outputs, controlled-vocabulary classification, prompt regressions on machine-consumable responses, and CI gating for LLM features with hard output constraints.</p>
<p>The narrower and more contract-bound the output, the more credible this harness is today. That makes it a strong fit for things like lead intake extraction, support ticket routing labels, metadata extraction, internal categorization workflows, and JSON response formatting for downstream automations.</p>
<p>It is less compelling, at least in its current form, for open-ended conversational evaluation or nuanced quality grading, which require different methods and a broader evidence base.</p>
<h2>Future directions</h2>
<p>The next sensible expansions are concrete rather than abstract. The provider abstraction already makes room for additional backends, so the most obvious next step is to add Anthropic, Azure OpenAI, or local OpenAI-compatible endpoints and run the same suites across them without rewriting the evaluation logic.</p>
<p>Reporting is another clear next layer. The current JSON artifact is useful, but Markdown or HTML summaries would make inspection easier, especially if they include pass/fail diffs against a checked-in baseline and per-case change summaries between runs.</p>
<p>Dataset and baseline management is also an obvious place to deepen the project. Named baselines, versioned datasets, and explicit “current run versus baseline” views would make it easier to track reliability over time rather than treating each run as an isolated event.</p>
<p>Finally, rubric scoring may still be worth adding later, but only after deterministic gating remains solid. That order matters. The correct sequence is first to confirm that the contract holds, and only then to ask whether the output quality is good.</p>
<h2>Conclusion</h2>
<p><code>kestrel-evals</code> is a deliberately small tool solving a real problem: making a class of LLM regressions testable in the same operational style as conventional software tests.</p>
<p>Its central argument is simple. If an LLM feature depends on output structure, deterministic validation should be the first evaluation layer, and CI should be able to fail when that contract breaks.</p>
<p>That is the contribution of this initial implementation. It is not a grand unified evaluation platform. It is a compact, production-minded harness for one of the most common and immediately useful evaluation patterns: deterministic gating of contract-bound LLM outputs.</p>
<h3>So what?</h3>
<p>The practical implication is that teams do not have to choose between prompt engineering in the dark and building a giant evaluation platform. There is a useful middle ground.</p>
<p><code>kestrel-evals</code> shows that for contract-bound LLM features, a modest amount of structure goes a long way. Define the contract, codify the checks, run them locally, and gate CI on regressions. That does not make LLM behavior magically stable or universally reliable. What it does do is move one important class of prompt-dependent behavior closer to an ordinary software component: observable, testable, and easier to trust in production workflows.</p>]]></content:encoded>
</item>

<item>
  <title>Deterministic LLM gating (Part 1): why contract tests come first</title>
  <link>https://kestrellabshq.com/blog/deterministic-llm-gating</link>
  <guid>https://kestrellabshq.com/blog/deterministic-llm-gating</guid>
  <pubDate>Thu, 19 Mar 2026 00:00:00 GMT</pubDate>
  <description>If an LLM feature depends on structured output, validate the contract deterministically before you argue about quality.</description>
  <content:encoded><![CDATA[<h2>BLUF</h2>
<p>If an LLM feature depends on a structured output contract, that contract should be testable in the same operational way as ordinary software: with deterministic checks, repeatable suites, and CI failure when the output breaks.</p>
<p>That is the motivation behind <code>kestrel-evals</code>.</p>
<p>The project is intentionally narrow. It is not trying to solve the entire LLM evaluation problem. It is trying to solve one common and operationally important part of it well: making contract-bound LLM behaviors observable, repeatable, and gateable.</p>
<hr>
<h2>Introduction</h2>
<p>A large share of practical LLM failures do not first appear as subtle questions of quality. They appear as broken software behavior.</p>
<p>A downstream system may expect valid JSON, a fixed set of top-level keys, empty strings rather than <code>null</code>, or a list field that contains only approved labels. When those assumptions are violated, the integration fails long before anyone has a chance to debate whether the response was insightful or well written.</p>
<p>That observation is the starting point for this series. If an LLM output is part of a production workflow, then the output contract should be testable. And if it is testable, it should be possible to run those tests locally, in CI, and in a form that fails fast when the contract breaks.</p>
<p>This framing matters because the industry conversation around LLM evaluation often broadens too quickly. It is common to jump to questions such as whether an answer is helpful, whether one model outperforms another, or whether a judge model would rate the output highly.</p>
<p>Those are legitimate evaluation problems, but they are not always the first problems that matter. In many real systems, the first question is much less glamorous: did the model produce something the pipeline can safely consume?</p>
<hr>
<h2>Why deterministic gating comes first</h2>
<p>For a broad class of workflows, deterministic checks deserve to be the first evaluation layer because the earliest and most expensive failures are mechanical.</p>
<p>A model that returns markdown-wrapped JSON when the system expects raw JSON has failed, even if the payload inside is otherwise sensible. A classifier that emits a near-synonym instead of a valid controlled label has failed, even if a human would understand what it meant. A field that should be an array but arrives as a string has failed, regardless of whether the prose around it sounds competent.</p>
<p>These are not cosmetic issues. They are contract violations. If the LLM sits inside a pipeline for parsing, routing, storage, or automation, they are often the difference between a working system and a broken one.</p>
<p>Deterministic checks are well suited to these problems because they are both cheap and legible. A failed regex, missing key, or disallowed value is easy to understand. That clarity matters. Teams are far more likely to trust an evaluation harness when the failure mode is concrete and inspectable rather than buried inside an opaque scoring system.</p>
<p>It also makes CI integration natural. If the question is whether the contract held, then the harness can behave like any other test runner and return a failing exit code when the answer is no.</p>
<p>This does not make subjective evaluation unimportant. Rubric scoring, pairwise comparison, and model-as-judge approaches all have their place. The point is one of sequencing. It is usually a mistake to ask whether an answer is “good” before confirming that it is structurally valid.</p>
<p>Deterministic gating is therefore not a replacement for all other evaluation methods. It is the right first layer for systems whose outputs must obey explicit mechanical constraints.</p>
<hr>
<h2>Design goals and scope</h2>
<p>The design goal of <code>kestrel-evals</code> is deliberately constrained: define suites in a small, reviewable format; run them locally and in CI; check output contracts with deterministic logic; and fail quickly when those contracts are violated.</p>
<p>The project is not trying to solve all evaluation problems at once. It is trying to make one common and useful pattern reliable enough to support development and deployment workflows.</p>
<p>That focus is reflected in the current scope. The implementation supports YAML-defined suites, local execution, CI execution, deterministic checks, provider-backed generation through an OpenAI implementation, JSON report output, and non-zero exit behavior on failure.</p>
<p>It does not yet attempt to provide a full benchmarking platform, a dataset registry, a judge-model workflow, comprehensive JSON Schema support, or broad provider parity. Those limitations are intentional. The project is small on purpose.</p>
<hr>
<h2>Conclusion</h2>
<p><code>kestrel-evals</code> is a deliberately small tool solving a real problem: making a class of LLM regressions testable in the same operational style as conventional software tests.</p>
<p>Its central argument is simple. If an LLM feature depends on output structure, deterministic validation should be the first evaluation layer, and CI should be able to fail when that contract breaks.</p>
<h3>So what?</h3>
<p>The practical implication is that teams do not have to choose between prompt engineering in the dark and building a giant evaluation platform. There is a useful middle ground.</p>
<p>For contract-bound LLM features, a modest amount of structure goes a long way. Define the contract, codify the checks, run them locally, and gate CI on regressions.</p>
<p>In Part 2, I’ll walk through the current implementation: the suite format, the runner, the checks, the example structured extraction suite, and baseline results.</p>
<p><strong>Next:</strong> <a href="/blog/kestrel-evals-deterministic-gating">kestrel-evals (Part 2): a small harness for deterministic LLM gating</a></p>]]></content:encoded>
</item>

<item>
  <title>Hello, world</title>
  <link>https://kestrellabshq.com/blog/hello-world</link>
  <guid>https://kestrellabshq.com/blog/hello-world</guid>
  <pubDate>Thu, 19 Mar 2026 00:00:00 GMT</pubDate>
  <description>First post on the Kestrel Labs site.</description>
  <content:encoded><![CDATA[<p>This is the first <strong>Kestrel Labs</strong> post.</p>
<h2>Why a blog</h2>
<p>A place for:</p>
<ul>
<li>short technical notes</li>
<li>longer essays</li>
<li>project updates</li>
</ul>
<pre><code class="language-ts">export function hello(name: string) {
  return `Hello, ${name}`;
}
</code></pre>]]></content:encoded>
</item>
  </channel>
</rss>