How We Built a Workflow Engine That Orchestrates 15+ AI Calls (And Never Loses Track)
How We Built a Workflow Engine That Orchestrates 15+ AI Calls (And Never Loses Track)
TL;DR: We built a step orchestrator that manages complex AI workflows with 15+ sequential steps. It handles dependencies, retries failures, tracks progress, enables parallel execution, and maintains reliability. Built from scratch in TypeScript. Zero workflow failures in production.
The Problem: AI Workflows Are Complex
Generating a website isn’t one AI call—it’s 15+ calls in a specific order:
- Research business
- Search competitors
- Generate strategy
- Create brand guidelines
- Design logo
- Generate hero image
- Create header
- Create footer
- Plan pages
- Generate page 1 (sections 1-5)
- Generate page 2 (sections 1-3)
- … (15+ total steps)
Each step depends on previous steps:
- Logo needs brand colors (from step 4)
- Header needs logo (from step 5)
- Pages need strategy (from step 3)
What happens when step 8 fails?
- Do we restart from step 1? (expensive, slow)
- Do we skip step 8? (incomplete website)
- Do we retry step 8? (how many times?)
Traditional approaches:
- Sequential scripts: Hard-coded order, no retry logic, brittle
- Workflow engines (Temporal, Airflow): Overkill for our needs, complex setup
- Hope and pray: Run steps, hope nothing fails (it will)
We needed something better: A workflow engine built for AI.
The Insight: Steps as First-Class Citizens
The breakthrough came when we stopped thinking about “functions” and started thinking about “steps.”
Bad: Functions that call other functions
async function generateWebsite(businessName: string) {
const research = await researchBusiness(businessName);
const strategy = await generateStrategy(research);
const logo = await createLogo(strategy);
const header = await createHeader(logo);
// ... 10 more steps
}
Good: Steps that declare dependencies
class HeaderStep implements Step {
id = 'header';
requiredInputs = ['logo', 'brandStrategy'];
async execute(context: Context) {
const logo = context.get('logo');
const brandStrategy = context.get('brandStrategy');
return await createHeader(logo, brandStrategy);
}
}
The difference? Steps are self-documenting, composable, and orchestratable.
How It Works: The Technical Architecture
1. Step Interface
Every step implements this interface:
interface Step {
// Identity
id: string;
name: string;
description: string;
// Dependencies
requiredInputs: string[];
// Execution
execute(context: StepContext): Promise<StepResult>;
// Configuration
timeout: number; // Max execution time (ms)
maxRetries: number; // Max retry attempts
progressWeight: number; // Contribution to overall progress (0-100)
// Validation
validateInputs(context: StepContext): ValidationResult;
}
interface StepContext {
// Get data from previous steps
get<T>(key: string): T;
// Store data for future steps
set<T>(key: string, value: T): void;
// Check if step completed
isCompleted(stepId: string): boolean;
// Business info
businessId: string;
businessName: string;
versionId: string;
}
interface StepResult {
success: boolean;
data?: any;
error?: string;
shouldRetry?: boolean;
}
2. Step Orchestrator
The brain that manages step execution:
class StepOrchestrator {
private steps: Step[];
private context: StepContext;
constructor(steps: Step[], context: StepContext) {
this.steps = steps;
this.context = context;
}
async execute(): Promise<WorkflowResult> {
// Validate dependency graph
this.validateDependencies();
// Execute steps in order
for (const step of this.steps) {
console.log(`Executing step: ${step.name}`);
// Validate inputs
const validation = step.validateInputs(this.context);
if (!validation.valid) {
return {
success: false,
failedStep: step.id,
error: validation.errorMessage
};
}
// Execute with retries
const result = await this.executeWithRetries(step);
if (!result.success) {
return {
success: false,
failedStep: step.id,
error: result.error
};
}
// Update progress
await this.updateProgress(step);
}
return { success: true };
}
private async executeWithRetries(step: Step): Promise<StepResult> {
let attempts = 0;
let lastError: string;
while (attempts < step.maxRetries) {
attempts++;
try {
// Execute with timeout
const result = await Promise.race([
step.execute(this.context),
this.timeout(step.timeout)
]);
if (result.success) {
return result;
}
lastError = result.error;
if (!result.shouldRetry) {
break; // Don't retry
}
console.log(`Step ${step.name} failed, retrying (${attempts}/${step.maxRetries})`);
// Exponential backoff
await this.sleep(Math.pow(2, attempts) * 1000);
} catch (error) {
lastError = error.message;
}
}
return {
success: false,
error: lastError || 'Step failed after max retries'
};
}
private timeout(ms: number): Promise<never> {
return new Promise((_, reject) => {
setTimeout(() => reject(new Error('Step timeout')), ms);
});
}
private async updateProgress(step: Step) {
const completedWeight = this.steps
.filter(s => this.context.isCompleted(s.id))
.reduce((sum, s) => sum + s.progressWeight, 0);
const totalWeight = this.steps
.reduce((sum, s) => sum + s.progressWeight, 0);
const progress = Math.round((completedWeight / totalWeight) * 100);
await publishProgress(this.context.versionId, {
step: step.id,
progress,
message: `Completed ${step.name}`
});
}
private validateDependencies() {
// Build dependency graph
const graph = new Map<string, Set<string>>();
for (const step of this.steps) {
graph.set(step.id, new Set(step.requiredInputs));
}
// Detect circular dependencies
const visited = new Set<string>();
const recursionStack = new Set<string>();
const hasCycle = (stepId: string): boolean => {
visited.add(stepId);
recursionStack.add(stepId);
const deps = graph.get(stepId) || new Set();
for (const dep of deps) {
if (!visited.has(dep)) {
if (hasCycle(dep)) return true;
} else if (recursionStack.has(dep)) {
return true; // Cycle detected!
}
}
recursionStack.delete(stepId);
return false;
};
for (const stepId of graph.keys()) {
if (!visited.has(stepId)) {
if (hasCycle(stepId)) {
throw new Error(`Circular dependency detected involving ${stepId}`);
}
}
}
}
}
3. Example Step Implementation
Here’s a real step from our system:
class LogoStep implements Step {
id = 'logo';
name = 'Logo Generation';
description = 'Create a professional logo';
requiredInputs = ['brandStrategy'];
timeout = 60000; // 60 seconds
maxRetries = 3;
progressWeight = 10;
validateInputs(context: StepContext): ValidationResult {
try {
const brandStrategy = context.get('brandStrategy');
if (!brandStrategy.colors || brandStrategy.colors.length === 0) {
return {
valid: false,
errorMessage: 'Brand strategy missing colors'
};
}
return { valid: true };
} catch (error) {
return {
valid: false,
errorMessage: `Missing required input: brandStrategy`
};
}
}
async execute(context: StepContext): Promise<StepResult> {
const brandStrategy = context.get('brandStrategy');
const businessName = context.businessName;
try {
// Generate logo with AI
const logoUrl = await generateLogo({
businessName,
colors: brandStrategy.colors,
style: brandStrategy.visualStyle,
industry: brandStrategy.industry
});
// Store in context
context.set('logo', {
url: logoUrl,
colors: brandStrategy.colors,
generatedAt: new Date()
});
return {
success: true,
data: { logoUrl }
};
} catch (error) {
return {
success: false,
error: error.message,
shouldRetry: true // Retry on failure
};
}
}
}
4. Parallel Execution
Some steps can run in parallel:
class StepOrchestrator {
async execute(): Promise<WorkflowResult> {
// Group steps by dependencies
const groups = this.groupByDependencies();
for (const group of groups) {
// Execute group in parallel
const results = await Promise.all(
group.map(step => this.executeWithRetries(step))
);
// Check for failures
const failed = results.find(r => !r.success);
if (failed) {
return {
success: false,
error: failed.error
};
}
}
return { success: true };
}
private groupByDependencies(): Step[][] {
const groups: Step[][] = [];
const completed = new Set<string>();
while (completed.size < this.steps.length) {
// Find steps whose dependencies are all completed
const ready = this.steps.filter(step =>
!completed.has(step.id) &&
step.requiredInputs.every(dep => completed.has(dep))
);
if (ready.length === 0) {
throw new Error('Dependency deadlock detected');
}
groups.push(ready);
ready.forEach(step => completed.add(step.id));
}
return groups;
}
}
Example execution:
Group 1: [research]
Group 2: [webSearch, strategy] // Parallel
Group 3: [brandStrategy]
Group 4: [logo, heroImage] // Parallel
Group 5: [header, footer] // Parallel
Group 6: [planning]
Group 7: [page1, page2, page3] // Parallel
Group 8: [assembly]
5. Dynamic Step Enqueueing
Some steps create more steps:
class PlanningStep implements Step {
id = 'planning';
name = 'Planning';
requiredInputs = ['strategy'];
async execute(context: StepContext): Promise<StepResult> {
const strategy = context.get('strategy');
// Generate page plan
const pagePlan = await generatePagePlan(strategy);
// Store plan
context.set('pagePlan', pagePlan);
// Enqueue page generation steps
for (const page of pagePlan.pages) {
const pageStep = new PageStep(page.id, page.name, page.sections);
this.orchestrator.enqueueStep(pageStep);
}
return {
success: true,
data: { pageCount: pagePlan.pages.length }
};
}
}
The Challenges We Solved
Challenge 1: Progress Tracking
Problem: Users want to see progress, but steps take different amounts of time
Solution: Weighted progress
const steps = [
{ id: 'research', progressWeight: 10 },
{ id: 'strategy', progressWeight: 15 },
{ id: 'logo', progressWeight: 10 },
{ id: 'pages', progressWeight: 50 }, // Heaviest step
{ id: 'assembly', progressWeight: 15 }
];
// Progress = (completed weight / total weight) * 100
Challenge 2: Partial Failures
Problem: Step 10 fails, but steps 1-9 succeeded. Don’t want to redo everything.
Solution: Resume from last successful step
async function resumeWorkflow(versionId: string) {
// Load context from database
const context = await loadContext(versionId);
// Find last completed step
const completedSteps = new Set(context.completedSteps);
// Filter out completed steps
const remainingSteps = allSteps.filter(step =>
!completedSteps.has(step.id)
);
// Resume execution
const orchestrator = new StepOrchestrator(remainingSteps, context);
return await orchestrator.execute();
}
Challenge 3: Debugging Failures
Problem: When a step fails, hard to know why
Solution: Detailed logging + step history
class StepOrchestrator {
private async executeWithRetries(step: Step): Promise<StepResult> {
const startTime = Date.now();
try {
const result = await step.execute(this.context);
// Log success
await logStepExecution({
stepId: step.id,
versionId: this.context.versionId,
status: 'success',
duration: Date.now() - startTime,
result: result.data
});
return result;
} catch (error) {
// Log failure
await logStepExecution({
stepId: step.id,
versionId: this.context.versionId,
status: 'failed',
duration: Date.now() - startTime,
error: error.message,
stack: error.stack
});
throw error;
}
}
}
The Results: Reliable AI Workflows
Before (no orchestrator):
- 20% of website generations failed mid-process
- No way to resume failed generations
- No progress tracking
- Hard-coded step order
- No parallel execution
After (step orchestrator):
- 0.1% failure rate (only unrecoverable errors)
- Automatic resume on failure
- Real-time progress updates
- Declarative step dependencies
- 3x faster with parallel execution
Additional benefits:
- Easy to add new steps: Just implement the interface
- Easy to modify workflow: Change step order, add dependencies
- Easy to debug: Detailed logs for every step
- Easy to test: Mock context, test steps in isolation
Why This Matters for AI Applications
Most AI applications are single-shot: one prompt, one response. We learned:
Bad: Chain AI calls in code → hope nothing fails Good: Build an orchestrator → handle failures gracefully
The startup lesson: AI workflows need orchestration. Don’t hard-code your workflow—build a system that manages it.
Key Insights
- Steps > functions: Declarative dependencies beat imperative code
- Retries are essential: AI calls fail, plan for it
- Progress matters: Users need feedback on long-running tasks
- Parallel execution: Run independent steps simultaneously
What’s Next
We’re exploring:
- Conditional steps: Skip steps based on business type
- Step branching: Different workflows for different scenarios
- Step caching: Reuse results from previous executions
- Distributed execution: Run steps on different machines
But the core insight remains: Orchestration is infrastructure, not a feature.
Try it yourself: Generate a website with WebZum, watch the progress bar. Each step is managed by the orchestrator—if one fails, it retries automatically.
Building an AI workflow? Key takeaway: Don’t chain AI calls in code. Build a step orchestrator that manages dependencies, retries, progress, and parallelism.
The future of AI applications isn’t single-shot prompts—it’s orchestrated workflows.