How to Run a 1-Week AI Agent Pilot That Actually Ships to Production
Most AI pilots stall in evaluation purgatory. The teams that actually ship use a tight 5-day playbook with non-negotiable scope, eval set written before the prompt, and a production-readiness gate baked in from day one.
Quick answer: A 1-week pilot ships when scope is reduced to one workflow, success criteria are quantitative and pre-agreed, the eval set is written before any prompt, and production-readiness work (observability, guardrails, on-call) is included in week one rather than punted to "phase 2." Skip any of those and the pilot becomes a 6-month research project.
The 5-day shape that ships
Day 1: Scope (afternoon)
Pick exactly one workflow. Document its current shape: input format, decision points, output format, edge cases, current owner, current cycle time, current cost-to-serve. Get a one-line quantitative success criterion ("resolve at least 70% of inbound support tickets without escalation") signed by the operating function — not the innovation team.
Day 1-2: Eval set
Build 50-200 examples from real production data. Include common cases (60%), edge cases you've seen in real data (30%), and adversarial examples (10%). Score the eval set with the current human baseline first — that's your floor. Anything the agent ships at must beat that floor on the eval set.
Day 2-3: First version
Build the simplest agent that could plausibly work. One model (Claude Sonnet or GPT-4.5 mini), one or two tools, no fancy retrieval if you can avoid it. Run against the eval set. Iterate prompt and tool definitions. Stop when eval pass rate exceeds the human baseline by a meaningful margin.
Day 3-4: Production wrapper
Wire up: per-input token budget cap, structured logging (every prompt + completion + tool call), output validation (Zod schema, plus an LLM judge for soft validation), error handling and retries with backoff, fallback to human escalation when confidence is low, and a kill switch (env var that turns the agent off without redeploy).
Day 4-5: Production deploy + handoff
Ship to production in shadow mode first (agent runs alongside humans, doesn't affect outcomes). Compare agent decisions against human decisions for 24-48 hours. If shadow mode looks good, flip to live mode for a small percentage of traffic. Document the runbook. Hand off the on-call rotation.
The non-negotiables
- One workflow. Two workflows means six weeks, not one.
- Operating-function sponsor. The team that owns the workflow has to have the budget for production. Nobody else can buy production-readiness.
- Eval set before prompt. Without an eval set you cannot iterate, regression-test, or argue for production.
- Production-readiness in week one. Token budget, logging, validation, kill switch. None are optional.
- Shadow mode before live mode. Always.
What we cut
To make 5 days work, we cut: pretty UIs, slick demos, custom dashboards, exotic LLM features (fine-tuning, advanced RAG), and any "phase 2" work. The pilot is the agent — not the wrapper around the agent.
Need help running a pilot? Our AI Kickstart is exactly this playbook, executed by us, embedded with your team. Or book a $425 strategy hour if you want a second opinion on your pilot scope before you start.
Frequently asked questions
Can you really build and ship an AI agent in one week?
Yes — for ONE narrowly-scoped workflow with quantitative success criteria, an existing operating-function sponsor, and a small (50-200 example) eval set. Multi-workflow projects, exploratory builds without a sponsor, or anything requiring custom infrastructure cannot ship in a week.
What is shadow mode for an AI agent?
Shadow mode runs the agent alongside the existing human-driven workflow without affecting outcomes. Agent decisions are logged but not executed. You compare agent vs human decisions for 24-48 hours to validate quality before going live.
How big should an AI agent eval set be for a pilot?
50-200 examples for most workflow agents. Build it from real production data, not synthetic. Include 60% common cases, 30% edge cases you've seen, and 10% adversarial. Score it against your current human baseline first to establish the floor the agent must beat.
Why do pilots fail to reach production?
Most fail because production-readiness work (observability, guardrails, on-call, evals) is punted to 'phase 2' that never gets funded. Bake it into week one or accept the project will stall. We have a separate post on the 5 recurring reasons AI pilots fail to ship.
What does it cost to ship a 1-week pilot?
Our AI Kickstart engagements run $7,000-$10,000 for a one-week pilot, including production deploy. In-house teams typically spend 2-4 engineer-weeks plus product manager and SME time, which is roughly $15,000-$40,000 in fully-loaded cost. Vendor solutions vary widely.
Related articles
Ready to ship an AI agent that actually works?
We embed with your team, build the agent, and ship it to production. Founder-led, no slide decks.