Production AI Agents in Laravel — Reliable Loops with Prism and Laravel Workflow

Ship reliable Laravel AI agents in production. Wire Prism, Laravel Workflow, verifier loops, idempotent tools, Langfuse tracing, and a CI eval harness.

Steven Richardson
Steven Richardson
· 18 min read

Building a demo agent is easy. Shipping one that survives a flaky model, a network blip, and a cold-deployed worker is the hard part. The Laravel ecosystem now has every piece you need — Prism for the LLM boundary, Laravel Workflow for durable orchestration, Langfuse for tracing — but nobody has wired them together end-to-end. This is that playbook for Laravel AI agents in production.

What you'll learn#

  • Why a single Prism::text() call is not a production agent, and what to add around it
  • The verifier-loop pattern that converts a stochastic model into a deployable component
  • How to make tools idempotent so retries do not corrupt state
  • How Laravel Workflow gives you durable agent state that survives crashes and redeploys
  • How to add Langfuse tracing, structured-output validation, and a CI eval harness
  • Cost control patterns that keep token spend predictable

Why an agent isn't just a Prism call#

When most teams ship their first AI feature, the code looks like this: a controller, a Prism::text() call, a tool or two, a return statement. It works in the happy path. Then production happens. The model returns malformed JSON. The OpenAI dashboard shows 503s. A worker SIGTERMs in the middle of a six-step plan. The user retries and a duplicate Stripe refund goes out.

A production agent is not a chat reply. It is a long-running, stateful, side-effecting process that needs to be correct under partial failure. That changes the design surface in five ways:

  1. Output is stochastic. Two identical prompts produce different responses. You cannot trust a single response — you must validate it.
  2. Tools have side effects. Calling refund_order twice is not the same as calling it once. Retries demand idempotency.
  3. Loops can run for minutes. A multi-step agent may take longer than a request timeout, longer than a queue worker's --max-time, or longer than a deploy window.
  4. Failures are silent. A model that "thinks" it succeeded but produced nonsense will not throw. You need verifiers and observability.
  5. Costs compound. Five steps cost five times more than one. Without budgeting, a runaway loop becomes an invoice.

If you have read the Laravel Prism getting-started guide and the follow-up on building tool-calling agents with Prism, you have the LLM-call primitive. This article is everything that wraps it. We will be intentionally opinionated: Prism for the model boundary, Laravel Workflow for the orchestration, Langfuse for traces, Pest for evaluation. Each of those choices has alternatives, but together they form a coherent stack with no glue code.

Two clarifying notes before we dive in. First, this is the Prism package — prism-php/prism — not Laravel's own first-party AI SDK. The two solve overlapping problems and the complete Laravel AI SDK guide covers when to pick which. For agents that need durable workflow state, Prism's lower-level loop control is currently easier to wrap. Second, by "agent" I mean a Prism loop with tools and a goal — not a multi-agent crew. Multi-agent orchestration is a strict superset of everything here.

The verifier loop — generate, validate, retry, surface#

The single most useful pattern for reliable agents is the verifier loop. The shape is:

  1. Generate. Prism produces structured output describing the action it wants to take.
  2. Validate. A second, cheaper LLM call (or pure PHP rules) inspects the output against acceptance criteria.
  3. Retry. If validation fails, feed the failure reason back to the planner with a bounded retry counter.
  4. Surface. If retries exhaust, raise a typed exception that the workflow can handle deterministically.

The verifier is doing the job a unit test cannot do at runtime: scoring a non-deterministic output. It works because the validation problem is almost always easier than the generation problem. Asking a model "does this JSON satisfy the schema and reference real entities from the context?" is far cheaper and more reliable than asking it to produce that JSON in one shot.

Crucially, the verifier sees the full context — the user request, the proposed action, and any reference data — and returns a structured verdict, not free text. That verdict is the only thing your workflow acts on. You never act on the planner's first response directly.

This pattern alone moves agents from "demo" to "deployable". The remaining sections are about making the loop durable, observable, and cheap.

Anatomy of a Prism agent in Laravel#

Let's build a concrete example: a customer-service agent that updates an order's status when a customer reports a delivery issue. It must call the right tool, idempotently, with structured arguments.

Install the packages once:

composer require prism-php/prism
composer require laravel-workflow/laravel-workflow
composer require langfuse/langfuse-php

Then publish the Workflow migrations:

php artisan vendor:publish --tag=workflows
php artisan migrate

The minimal Prism call with a structured-output schema looks like this:

use Prism\Prism\Prism;
use Prism\Prism\Schema\ObjectSchema;
use Prism\Prism\Schema\StringSchema;
use Prism\Prism\Schema\EnumSchema;

$schema = new ObjectSchema(
    name: 'order_action',
    description: 'The action the agent wants to take on an order.',
    properties: [
        new EnumSchema('action', 'What to do', ['mark_delayed', 'request_proof', 'escalate']),
        new StringSchema('order_id', 'The Stripe-style order id, e.g. ord_abc'),
        new StringSchema('reason', 'Short justification, max 280 chars'),
    ],
    requiredFields: ['action', 'order_id', 'reason'],
);

$response = Prism::structured()
    ->using('anthropic', 'claude-sonnet-4-5')
    ->withSchema($schema)
    ->withSystemPrompt(view('prompts.order-agent')->render())
    ->withPrompt($customerMessage)
    ->withMaxSteps(5)
    ->withClientOptions(['timeout' => 30])
    ->asStructured();

$plan = $response->structured; // ['action' => ..., 'order_id' => ..., 'reason' => ...]

This is the generate step. Three things make it production-ready: the schema constrains the output, the timeout caps a hung provider, and withMaxSteps puts an upper bound on tool-calling depth. Without those three, you have a demo, not a building block.

The next thing the agent needs is a verifier. Run a second Prism call that only sees the proposed action and the source data, and returns a verdict object:

$verifierSchema = new ObjectSchema(
    name: 'verdict',
    description: 'Whether the proposed action is safe to execute.',
    properties: [
        new EnumSchema('decision', 'Verdict', ['approve', 'revise', 'reject']),
        new StringSchema('reason', 'Why'),
    ],
    requiredFields: ['decision', 'reason'],
);

$verdict = Prism::structured()
    ->using('anthropic', 'claude-haiku-4-5') // cheap, fast verifier
    ->withSchema($verifierSchema)
    ->withSystemPrompt('You are a verifier. Approve only if the action matches the customer message and the order id is present in the provided context.')
    ->withPrompt(json_encode([
        'message' => $customerMessage,
        'plan' => $plan,
        'order_context' => $orderContext,
    ]))
    ->asStructured();

The planner is a smart, expensive model. The verifier is a cheap one with a tightly scoped job. This asymmetry is deliberate — you can run the verifier multiple times for the cost of a single planner call.

Tool design — idempotent, side-effect-aware, deterministic where possible#

If the verifier approves, the agent calls a tool. Tool design is where most agents fail at 3am, because the wrong abstraction makes retries dangerous. Three rules.

Tools must be idempotent on the caller's request id. Every tool that mutates state takes an idempotency_key parameter. The implementation looks up the key first; if the action has already been performed, it returns the prior result instead of repeating it.

use Prism\Prism\Facades\Tool;

$markDelayed = Tool::as('mark_order_delayed')
    ->for('Marks an order as delayed in shipping. Idempotent on idempotency_key.')
    ->withStringParameter('order_id', 'Order id, e.g. ord_abc')
    ->withStringParameter('idempotency_key', 'Unique key per agent decision, opaque to the LLM')
    ->withStringParameter('reason', 'Customer-facing reason')
    ->using(function (string $order_id, string $idempotency_key, string $reason): string {
        $existing = AgentAction::where('idempotency_key', $idempotency_key)->first();

        if ($existing) {
            return "Already applied: {$existing->result}";
        }

        return DB::transaction(function () use ($order_id, $idempotency_key, $reason) {
            $order = Order::lockForUpdate()->findOrFail($order_id);
            $order->update(['status' => 'delayed', 'delay_reason' => $reason]);

            AgentAction::create([
                'idempotency_key' => $idempotency_key,
                'tool' => 'mark_order_delayed',
                'order_id' => $order_id,
                'result' => "marked {$order_id} delayed",
            ]);

            return "Order {$order_id} marked delayed.";
        });
    });

The idempotency key is generated by the workflow, not by the LLM. The model never picks the key — that would defeat the purpose. The workflow generates a deterministic key per decision (e.g. agent:{workflow_id}:step:{n}) and passes it to the tool through Prism's tool-arg substitution, or by binding it via a closure as above.

Tools must be side-effect-aware. If a tool sends an email, charges a card, or fires a webhook, that has to be visible in the trace. Wrap external calls in their own logged event so a failed tool does not leave the system in a half-applied state.

Push determinism out where possible. The LLM should not multiply numbers. It should not pick a random tax rate. It should call calculate_tax on a deterministic PHP function. The fewer numerical decisions the LLM makes, the smaller your bug surface.

I have written more on the deterministic-tool philosophy in the piece on building tool-calling agents with Prism — that article is the foundation; this section adds the production hardening.

Orchestrating with Laravel Workflow — durable state across crashes#

A single Prism call lives inside a single PHP process. As soon as your agent has multiple steps, you need state that survives:

  • a queue worker --max-time recycle
  • a deploy-triggered worker restart
  • a transient model 503 that you want to retry tomorrow, not now
  • a human approval that may take hours

This is what laravel-workflow/laravel-workflow is for. It models long-running processes as durable state machines. Each Activity is a single step that is checkpointed to the database; if the worker dies mid-step, another worker resumes from the last checkpoint.

A workflow that wraps the verifier loop looks like this:

namespace App\Workflows;

use Workflow\Workflow;
use Workflow\ActivityStub;

class CustomerIssueAgent extends Workflow
{
    public function execute(string $threadId, string $customerMessage): \Generator
    {
        $context = yield ActivityStub::make(LoadOrderContext::class, $threadId);

        for ($attempt = 1; $attempt <= 3; $attempt++) {
            $plan = yield ActivityStub::make(
                PlanAction::class,
                $customerMessage,
                $context,
                $previousFailure ?? null,
            );

            $verdict = yield ActivityStub::make(VerifyAction::class, $plan, $context);

            if ($verdict['decision'] === 'approve') {
                $idempotencyKey = "agent:{$this->workflowId()}:attempt:{$attempt}";

                return yield ActivityStub::make(
                    ExecuteAction::class,
                    $plan,
                    $idempotencyKey,
                );
            }

            $previousFailure = $verdict['reason'];
        }

        return yield ActivityStub::make(EscalateToHuman::class, $threadId, $previousFailure);
    }
}

Each ActivityStub::make() call is a checkpoint. The framework persists the inputs and outputs to the workflow_logs table. If the worker crashes after PlanAction but before VerifyAction, a new worker reads the logs, sees the plan was already produced, and resumes at the verify step — without re-running the planner. That single property is what makes durable workflow state non-negotiable for production agents.

The activities themselves are thin wrappers around your Prism calls and tool dispatches. Keep them small — one external side effect per activity — so checkpoints have meaningful boundaries. If you are new to background-processing fundamentals, the scaling Laravel queues in production guide covers the worker tuning that Workflow runs on top of.

Bounded retries and circuit breakers around the LLM#

The model and its provider are the most failure-prone components in the system. Three layers of protection:

Per-activity retries with exponential backoff. Laravel Workflow lets each activity declare its retry policy:

use Workflow\Models\StoredWorkflow;

class PlanAction
{
    public int $tries = 4;
    public array $backoff = [1, 5, 30, 120]; // seconds

    public function execute(string $customerMessage, array $context, ?string $previousFailure): array
    {
        // Prism call here
    }
}

That gives you retries on transient provider errors without touching the workflow itself.

A circuit breaker on the provider. When OpenAI returns 503s for two minutes straight, you want every agent to immediately fail over (to a backup provider, to a queued retry tomorrow, to a human). Use Laravel's cache-backed counters or a small package like cknow/laravel-money style breaker — the implementation is unimportant; the policy is. Set thresholds at requests, not at workflows, so a single bad workflow does not trip the breaker.

A per-workflow budget. Every workflow declares a maximum LLM cost. Each Prism call increments a counter on the workflow's stored state; if the counter exceeds the budget, the workflow short-circuits to escalation. This prevents one runaway customer from costing $40 in tokens.

Together these three give you a deployable system. Without them, a provider blip in the early hours can mean either thousands of failed agent runs (no retries) or a $400 token bill (no breaker, no budget).

Structured output validation with Prism's schema-driven mode#

LLMs hallucinate. Schemas catch that. Prism's structured() mode enforces a JSON Schema on the model's output and validates it before returning. If validation fails, Prism either retries internally (when the provider supports tool-style structured outputs) or throws a typed exception you can catch.

The pattern for production: define schemas as PHP classes, not inline. That gives you reuse across the planner, the verifier, and the test suite, and gives PHPStan something to type-check.

namespace App\Agents\Schemas;

use Prism\Prism\Schema\ObjectSchema;
use Prism\Prism\Schema\StringSchema;
use Prism\Prism\Schema\EnumSchema;

final class OrderActionSchema
{
    public static function make(): ObjectSchema
    {
        return new ObjectSchema(
            name: 'order_action',
            description: 'An action the agent wants to take on an order.',
            properties: [
                new EnumSchema('action', 'What to do', ['mark_delayed', 'request_proof', 'escalate']),
                new StringSchema('order_id', 'Order id'),
                new StringSchema('reason', 'Short justification'),
            ],
            requiredFields: ['action', 'order_id', 'reason'],
        );
    }
}

Two production tips. First, validate twice: the schema validates the shape, and a follow-up PHP rule validates the semantics (the order_id must exist in the database, the reason must be under 280 characters, etc.). Schema-only validation will happily accept a syntactically perfect plan that references an order that does not exist. Second, log every validation failure with the raw response — those logs are gold for prompt iteration.

Observability — tracing every step with Langfuse and Prism#

Without traces, agents are unobservable. The model takes a request and produces a response, the tool fires, something happens. When it goes wrong, you need to see the prompt, the raw model response, the tool inputs and outputs, the verifier's verdict, and the wall-clock duration of each step.

Langfuse is purpose-built for this and has first-class Prism integration. Wire it up by setting environment variables and registering a Prism middleware:

LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com
// app/Providers/AppServiceProvider.php
use Langfuse\Prism\LangfuseObserver;
use Prism\Prism\Facades\Prism;

public function boot(): void
{
    Prism::observe(new LangfuseObserver(
        traceFactory: fn () => [
            'workflow_id' => app('current.workflow.id'),
            'user_id' => auth()->id(),
        ],
    ));
}

Every Prism call now emits a span: prompt, response, tool calls, token usage, latency, model name. Group spans by workflow_id and you have a full timeline of an agent run, complete with the inputs and outputs of every tool. Langfuse also computes per-trace cost in dollars, which feeds directly into the budget enforcement we set up earlier.

Pair this with one of Laravel's in-process profilers — see the Telescope, Debugbar, and Pulse comparison for which to choose — and you can correlate the Langfuse trace with database queries, queue dispatches, and Eloquent N+1s in the same span.

A nice consequence of doing this properly: when a customer says "the agent gave me the wrong refund", you can pull up the trace and see exactly which prompt, which response, which tool call, and which row in the database. That is the difference between debugging an LLM bug in twenty minutes and debugging it in two days.

Evaluating agents in CI — golden datasets and LLM-as-judge#

Unit tests cannot tell you if your agent gives a good answer. They can only tell you that the code did not throw. To know whether your agent is correct, you need an eval harness — a set of fixtures (golden datasets) that exercises the agent, plus an automatic judge that scores each output.

A minimal Pest 3 setup looks like this. Put fixtures in tests/Fixtures/Agent/ — each one a JSON file with a prompt, context, and expected block.

// tests/Feature/AgentEvalTest.php
use App\Workflows\CustomerIssueAgent;
use Workflow\WorkflowStub;

dataset('agent_fixtures', function () {
    return collect(glob(base_path('tests/Fixtures/Agent/*.json')))
        ->map(fn ($path) => [json_decode(file_get_contents($path), true), basename($path)]);
});

it('produces an approved plan for golden fixtures', function (array $fixture, string $name) {
    $stub = WorkflowStub::make(CustomerIssueAgent::class);
    $stub->start($fixture['thread_id'], $fixture['customer_message']);

    while ($stub->running()) {
        usleep(100_000);
    }

    $output = $stub->output();
    expect($output['action'])->toBe($fixture['expected']['action']);
    expect($output['order_id'])->toBe($fixture['expected']['order_id']);
})->with('agent_fixtures');

For the qualitative part — "is this customer-facing reason actually empathetic and accurate?" — use an LLM-as-judge. The judge gets the input, the agent's output, and a rubric, and returns a 1-5 score plus a critique. You assert against the score.

it('writes a customer-facing reason that scores >= 4', function (array $fixture, string $name) {
    $output = runAgent($fixture); // helper

    $score = LlmJudge::score($fixture['customer_message'], $output['reason'], rubric: 'empathy_and_accuracy');

    expect($score)->toBeGreaterThanOrEqual(4);
})->with('agent_fixtures');

Run this as a nightly job, not on every push — judge calls cost real money. The GitHub Actions matrix-testing guide shows how to schedule and constrain this kind of long-running CI workflow. Architectural rules for the agent code itself — no model imports in activities, no tool registrations outside the agent module — are enforceable with Pest architecture testing.

The eval harness is what gives you the confidence to change a system prompt. Without it, every prompt edit is a leap of faith.

Cost control patterns#

Token costs compound. Five-step agents cost five times more than one-shot calls, and the asymmetric pricing of input vs output tokens means you have to think carefully. Five concrete patterns:

The cheapest model that works wins. Run the planner on a frontier model, the verifier on a small fast model. A Haiku-class verifier costs roughly a tenth of a Sonnet-class planner per call.

Trim context aggressively. Your system prompt may contain ten examples; if eight of them are never matched, drop them. Use vector similarity search to retrieve only the few most-relevant examples per request. This single change has cut 30-50% off token bills in real systems I have run.

Cache identical prompts. Laravel's cache layer is fine for this — key by a hash of (system, user, schema, model) and TTL by minutes-to-hours depending on data freshness. Anthropic's prompt caching is a bigger lever still and Prism supports it natively when the provider does.

Set a per-trace budget in dollars and short-circuit when exceeded. We covered this above; it is the only safety net that catches a bug during an outage rather than on the next invoice.

Track cost in the same span as latency. If a step is slow and expensive, that is the first place to optimise.

Testing this#

The testing pyramid for an agent has four layers, each with a different signal.

Unit tests target the activities and the tools — pure PHP, mockable, fast. Test idempotency by calling the tool twice with the same key and asserting the side effect happens exactly once. Test schema validation by feeding deliberately malformed responses through the parser.

Workflow tests stub Prism calls and verify the orchestration is correct. Laravel Workflow's testing helpers let you replay a workflow with deterministic activity outputs and assert on the path taken — including verifying that EscalateToHuman is reached after three failed verifications.

use Workflow\WorkflowStub;
use Workflow\Models\StoredWorkflow;

it('escalates after three failed verifications', function () {
    Prism::fake([
        // planner returns the same plan three times
        'planner-1' => structuredFake(['action' => 'mark_delayed', ...]),
        'verifier-1' => structuredFake(['decision' => 'reject', 'reason' => '...']),
        // ... repeated three times
    ]);

    $stub = WorkflowStub::make(CustomerIssueAgent::class);
    $stub->start('thread_1', 'where is my order?');

    expect($stub->output()['escalated'])->toBeTrue();
});

Integration tests hit a real model on a tiny fixture set. These run in a separate Pest group so they do not slow the main suite. Mark them @group integration and run on PR, not on every push.

Eval tests are the harness from the previous section. They are slow, expensive, and the only thing that catches "the agent is technically working but the answers are bad" regressions.

The combination of these layers is what gives a team the courage to ship prompt edits without a senior engineer's approval.

Common mistakes#

A handful of the most expensive ones I have seen, in roughly the order they tend to bite teams.

Treating the LLM as a function instead of a stochastic process. The same prompt produces different outputs. Asserting on exact strings is a recipe for flaky CI. Assert on structure and semantics, not on text.

Letting the LLM pick the idempotency key. Models will gladly produce a fresh UUID on every retry, and your "idempotent" tool will charge the card twice. The key is workflow-deterministic — generate it in PHP, pass it in.

Using Prism::text() when Prism::structured() is what you want. Free-text outputs need parsers, parsers fail, fallbacks pile up. Always use structured output for actions; use free text only for the customer-facing message.

Skipping the verifier "just for now". Every system that ships without verification ends up with a verifier within three months — usually after a public incident. Build it in from the start; it is one extra Prism call.

Storing the full transcript in a session. A six-step agent with a 4000-token context produces 100KB of conversation. Multiply by users and you have a database table that grows without bound. Store the result, not the trace; rely on Langfuse for the trace.

Forgetting to budget. The first time you have a runaway loop without a budget cap, you will know.

Treating the agent's output as authoritative. The agent proposes; the system disposes. A second-line PHP rule should still validate the action against business invariants before any side effect fires.

Wrapping up#

Production AI agents in Laravel are a stack, not a feature. Prism gives you a clean LLM/tool boundary. Laravel Workflow gives you durable state and bounded retries. A verifier loop with structured output gives you reliability. Idempotent tools give you safety under retry. Langfuse gives you visibility. A Pest-driven eval harness gives you the confidence to change prompts without breaking customers. Take any one of those away and the others get harder.

If this is the first AI work in your codebase, start with getting started with Laravel Prism and building tool-calling agents before stitching the production stack on top. If you are deciding between Prism and Laravel's first-party AI SDK, the complete Laravel AI SDK guide is the right next read. And if you want your agent to be reachable from external AI clients (Claude Desktop, Cursor, etc.), the Laravel MCP server guide shows you how to expose its tools over MCP.

Build for the failure modes from the start. Demo agents and production agents look almost identical on the happy path; everything that matters is in the unhappy paths.

FAQ#

What is the difference between Prism and the Laravel AI SDK?

Prism is a community package (prism-php/prism) that has been the de facto LLM client for Laravel since 2024. The Laravel AI SDK is the first-party package launched in early 2026 and ships with Laravel 13. They overlap heavily — both abstract OpenAI/Anthropic/Ollama behind a fluent API and both support tool calling and structured output. Prism currently has more granular control over the loop and a richer middleware ecosystem (including Langfuse integration), which makes it the easier choice for production agents today. The first-party SDK is improving fast and is a fine choice for simpler features.

How do I make a Laravel AI agent reliable in production?

Wrap the LLM call in a verifier loop, run the orchestration inside Laravel Workflow so state survives worker restarts, make every tool idempotent on a workflow-generated key, set bounded retries on each activity, and trace every step with Langfuse. Each piece is small; their combination converts a stochastic model into a deployable component. Skip any one and the next outage will tell you which one you needed.

Can I use Laravel Workflow with Prism tool calls?

Yes, and it is the recommended pattern for any agent with more than two steps. Wrap the Prism call in a Laravel Workflow Activity and dispatch it from the workflow with ActivityStub::make(). Prism manages the inner tool-calling loop within a single LLM round-trip; the workflow checkpoints state between rounds. The two layers compose cleanly because Prism is synchronous within an activity and the workflow is durable across activities.

How do I observe LLM calls in a Laravel app?

Register a Langfuse observer on Prism (Prism::observe(new LangfuseObserver(...))) and set LANGFUSE_* environment variables. Every call then emits a span with the prompt, response, tool calls, tokens, cost, and latency. Group spans by a workflow id or trace id to see a full agent run. For non-LLM concerns — database queries, queue jobs, HTTP — pair Langfuse with Telescope, Pulse, or Debugbar so you have a single timeline.

What's the right way to retry a failed LLM tool call?

Retry at the activity level with exponential backoff (e.g. [1, 5, 30, 120] seconds), not inside the tool body. The activity is a checkpoint, the tool is not. Combine activity retries with idempotency on the tool — the same idempotency key on the second call returns the prior result instead of re-running the side effect. Add a circuit breaker on the provider so a sustained outage triggers escalation instead of hammering a dead endpoint.

How do I keep agent state consistent across crashes?

Run the agent inside a Laravel Workflow. Each Activity writes its inputs and outputs to a workflow_logs table; if the worker dies, another worker reads the logs and resumes from the last checkpoint without re-running prior activities. Combine that with idempotent tools — even if a checkpoint races, side effects fire at most once. Avoid storing agent state in queue payloads or session — both are too short-lived for a multi-step process.

How do I evaluate agent output quality in CI?

Maintain a golden dataset of input fixtures with expected outputs and run them as Pest tests. For deterministic fields (action, order id), assert exact matches; for qualitative fields (customer-facing prose), use an LLM-as-judge that scores 1-5 against a rubric and assert on the score. Run eval tests on a nightly schedule — not every push — because judge calls cost money. The eval harness is what makes prompt iteration safe.

When should I use a state machine vs Laravel Workflow for an agent?

Use a state machine (e.g. spatie/laravel-model-states) when the workflow is short, fits inside a single request or job, and never needs to span hours. Use Laravel Workflow when steps can pause for human approval, span deploys, retry across days, or need durable replay. The rule of thumb: if any step might block longer than a queue worker's --max-time, you need Workflow. State machines are simpler when you do not need that durability — and simpler is usually better.

Steven Richardson
Steven Richardson

CTO at Digitonic. Writing about Laravel, architecture, and the craft of leading software teams from the west coast of Scotland.