Error Recovery Patterns
Comprehensive patterns and strategies for recovering from errors in BPMN processes.
Overview
This guide covers proven recovery patterns for:
- ✅ Retry strategies - Automatic retry with backoff
- ✅ Fallback mechanisms - Alternative data sources or services
- ✅ Circuit breaker - Protect against cascading failures
- ✅ Compensating actions - Undo partial work
- ✅ Human intervention - Escalate complex failures
Pattern 1: Exponential Backoff Retry
Automatically retry failed operations with increasing delays between attempts.
Implementation
<bpmn:serviceTask id="CallExternalAPI" name="Call External API">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="CallPartnerAPICommand"/>
<custom:property name="ResultVariable" value="apiResult"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>
<!-- Non-interrupting for retryable errors -->
<bpmn:boundaryEvent id="RetryableError"
attachedToRef="CallExternalAPI"
cancelActivity="false">
<bpmn:errorEventDefinition errorRef="RetryableError" />
</bpmn:boundaryEvent>
<!-- Interrupting for exhausted retries -->
<bpmn:boundaryEvent id="RetriesExhausted"
attachedToRef="CallExternalAPI"
cancelActivity="true">
<bpmn:errorEventDefinition errorRef="ExhaustedError" />
</bpmn:boundaryEvent>
<!-- Calculate retry delay -->
<bpmn:scriptTask id="CalculateRetry" name="Calculate Retry">
<bpmn:script>
var retryCount = (context.apiRetryCount || 0) + 1;
var maxRetries = 5;
if (retryCount > maxRetries) {
// Max retries reached - convert to permanent error
throw new BpmnError('API_RETRY_EXHAUSTED',
'Failed after ' + maxRetries + ' retry attempts');
}
context.apiRetryCount = retryCount;
// Exponential backoff: 2^retryCount * 1000ms
// Retry 1: 2s, Retry 2: 4s, Retry 3: 8s, Retry 4: 16s, Retry 5: 32s
var delayMs = Math.pow(2, retryCount) * 1000;
// Add jitter (random 0-500ms) to prevent thundering herd
delayMs += Math.random() * 500;
context.retryDelayMs = delayMs;
logger.info('API retry attempt ' + retryCount + ' of ' + maxRetries +
' after ' + Math.round(delayMs) + 'ms delay');
return {
retryCount: retryCount,
delayMs: Math.round(delayMs)
};
</bpmn:script>
</bpmn:scriptTask>
<!-- Wait before retry -->
<bpmn:intermediateCatchEvent id="WaitBeforeRetry" name="Wait">
<bpmn:timerEventDefinition>
<bpmn:timeDuration xsi:type="bpmn:tFormalExpression">
PT${Math.round(context.retryDelayMs / 1000)}S
</bpmn:timeDuration>
</bpmn:timerEventDefinition>
</bpmn:intermediateCatchEvent>
<!-- Loop back to retry -->
<bpmn:sequenceFlow sourceRef="WaitBeforeRetry" targetRef="CallExternalAPI" />
<!-- Handle permanent failure -->
<bpmn:scriptTask id="HandlePermanentFailure" name="Handle Permanent Failure">
<bpmn:script>
var error = context._lastError;
logger.error('API permanently failed after ' + context.apiRetryCount +
' retries: ' + error.errorMessage);
// Send alert
BankLingo.ExecuteCommand('SendAlert', {
severity: 'HIGH',
message: 'External API permanently unavailable',
context: {
errorCode: error.errorCode,
errorMessage: error.errorMessage,
retries: context.apiRetryCount
}
});
context.apiCallFailed = true;
context.useAlternativeMethod = true;
</bpmn:script>
</bpmn:scriptTask>
<bpmn:error id="RetryableError" errorCode="GATEWAY_TIMEOUT" />
<bpmn:error id="ExhaustedError" errorCode="API_RETRY_EXHAUSTED" />
Use Cases
- ✅ Network timeouts
- ✅ Temporary service unavailability
- ✅ Rate limit exceeded (with longer backoff)
- ✅ Database deadlocks
Pattern 2: Fallback Cascade
Try multiple data sources or services in order of preference.
Implementation
<bpmn:process id="FallbackCascade">
<!-- Try primary service -->
<bpmn:serviceTask id="PrimaryService" name="Primary Service">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="PrimaryServiceCommand"/>
<custom:property name="ResultVariable" value="serviceResult"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>
<!-- Primary failed -->
<bpmn:boundaryEvent id="PrimaryFailed"
attachedToRef="PrimaryService"
cancelActivity="true">
<bpmn:errorEventDefinition />
</bpmn:boundaryEvent>
<!-- Log and try fallback -->
<bpmn:scriptTask id="LogPrimaryFailure" name="Log Primary Failure">
<bpmn:script>
logger.warn('Primary service failed, trying fallback service');
context.primaryServiceFailed = true;
</bpmn:script>
</bpmn:scriptTask>
<!-- Try fallback service -->
<bpmn:serviceTask id="FallbackService" name="Fallback Service">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="FallbackServiceCommand"/>
<custom:property name="ResultVariable" value="serviceResult"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>
<!-- Fallback failed -->
<bpmn:boundaryEvent id="FallbackFailed"
attachedToRef="FallbackService"
cancelActivity="true">
<bpmn:errorEventDefinition />
</bpmn:boundaryEvent>
<!-- Log and try cache -->
<bpmn:scriptTask id="LogFallbackFailure" name="Log Fallback Failure">
<bpmn:script>
logger.warn('Fallback service failed, trying cached data');
context.fallbackServiceFailed = true;
</bpmn:script>
</bpmn:scriptTask>
<!-- Try cached data -->
<bpmn:scriptTask id="UseCachedData" name="Use Cached Data">
<bpmn:script>
if (!context.cachedServiceData) {
throw new BpmnError('NO_DATA_AVAILABLE',
'All data sources failed and no cache available');
}
var cacheAge = Date.now() - new Date(context.cacheTimestamp).getTime();
var cacheAgeHours = Math.round(cacheAge / 1000 / 60 / 60);
logger.info('Using cached data (age: ' + cacheAgeHours + ' hours)');
context.serviceResult = context.cachedServiceData;
context.dataSource = 'CACHE';
context.dataFreshness = cacheAgeHours < 24 ? 'RECENT' : 'STALE';
context.cacheAgeHours = cacheAgeHours;
return {
usedCache: true,
cacheAgeHours: cacheAgeHours
};
</bpmn:script>
</bpmn:scriptTask>
<!-- Cache also failed -->
<bpmn:boundaryEvent id="CacheFailed"
attachedToRef="UseCachedData"
cancelActivity="true">
<bpmn:errorEventDefinition errorRef="NoDataError" />
</bpmn:boundaryEvent>
<!-- All sources failed - final error -->
<bpmn:scriptTask id="HandleCompleteFailure" name="Handle Complete Failure">
<bpmn:script>
logger.error('All data sources failed (primary, fallback, cache)');
// Send critical alert
BankLingo.ExecuteCommand('SendCriticalAlert', {
message: 'All data sources unavailable',
impact: 'CRITICAL'
});
context.dataUnavailable = true;
</bpmn:script>
</bpmn:scriptTask>
<bpmn:error id="NoDataError" errorCode="NO_DATA_AVAILABLE" />
</bpmn:process>
Use Cases
- ✅ External API unavailability
- ✅ Database failover
- ✅ Content delivery networks (CDN fallback)
- ✅ Multi-region redundancy
Pattern 3: Circuit Breaker
Prevent cascading failures by "opening the circuit" after repeated failures.
Implementation
<bpmn:scriptTask id="CheckCircuitBreaker" name="Check Circuit Breaker">
<bpmn:script>
var serviceName = 'PaymentGateway';
var circuitKey = 'circuit_' + serviceName;
// Circuit state: CLOSED, OPEN, HALF_OPEN
var circuitState = context[circuitKey + '_state'] || 'CLOSED';
var failureCount = context[circuitKey + '_failures'] || 0;
var lastFailureTime = context[circuitKey + '_lastFailure'];
// Configuration
var failureThreshold = 10; // Open after 10 failures
var resetTimeout = 300000; // Try half-open after 5 minutes
var successThreshold = 3; // Close after 3 successes in half-open
context[circuitKey + '_config'] = {
failureThreshold: failureThreshold,
resetTimeout: resetTimeout,
successThreshold: successThreshold
};
if (circuitState === 'OPEN') {
// Circuit is open - check if we should try half-open
var now = Date.now();
var timeSinceFailure = now - new Date(lastFailureTime).getTime();
if (timeSinceFailure > resetTimeout) {
// Try half-open
context[circuitKey + '_state'] = 'HALF_OPEN';
context[circuitKey + '_halfOpenSuccesses'] = 0;
logger.info('Circuit breaker HALF_OPEN for ' + serviceName + ' (testing)');
} else {
// Still open - fail fast
var remainingMs = resetTimeout - timeSinceFailure;
var remainingMin = Math.ceil(remainingMs / 1000 / 60);
throw new BpmnError('CIRCUIT_BREAKER_OPEN',
serviceName + ' circuit breaker is OPEN. ' +
'Service unavailable for ' + remainingMin + ' more minutes.');
}
}
return {
circuitState: context[circuitKey + '_state'],
failureCount: failureCount
};
</bpmn:script>
</bpmn:scriptTask>
<!-- Circuit is closed or half-open - proceed with call -->
<bpmn:serviceTask id="CallProtectedService" name="Call Service">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="PaymentGatewayCommand"/>
<custom:property name="ResultVariable" value="paymentResult"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>
<!-- Handle success -->
<bpmn:scriptTask id="HandleServiceSuccess" name="Handle Success">
<bpmn:script>
var serviceName = 'PaymentGateway';
var circuitKey = 'circuit_' + serviceName;
var circuitState = context[circuitKey + '_state'];
if (circuitState === 'HALF_OPEN') {
// Increment success count in half-open
var successCount = (context[circuitKey + '_halfOpenSuccesses'] || 0) + 1;
context[circuitKey + '_halfOpenSuccesses'] = successCount;
var successThreshold = context[circuitKey + '_config'].successThreshold;
if (successCount >= successThreshold) {
// Enough successes - close circuit
context[circuitKey + '_state'] = 'CLOSED';
context[circuitKey + '_failures'] = 0;
logger.info('Circuit breaker CLOSED for ' + serviceName + ' (recovered)');
} else {
logger.info('Circuit breaker HALF_OPEN: ' + successCount + ' of ' +
successThreshold + ' successes');
}
}
return { success: true };
</bpmn:script>
</bpmn:scriptTask>
<!-- Handle failure -->
<bpmn:boundaryEvent id="ServiceFailed"
attachedToRef="CallProtectedService"
cancelActivity="true">
<bpmn:errorEventDefinition />
</bpmn:boundaryEvent>
<bpmn:scriptTask id="HandleServiceFailure" name="Handle Failure">
<bpmn:script>
var serviceName = 'PaymentGateway';
var circuitKey = 'circuit_' + serviceName;
var circuitState = context[circuitKey + '_state'];
var error = context._lastError;
logger.error('Service call failed: ' + error.errorMessage);
if (circuitState === 'HALF_OPEN') {
// Failure in half-open - reopen circuit
context[circuitKey + '_state'] = 'OPEN';
context[circuitKey + '_lastFailure'] = new Date().toISOString();
logger.warn('Circuit breaker OPEN for ' + serviceName + ' (half-open test failed)');
} else {
// Increment failure count
var failureCount = (context[circuitKey + '_failures'] || 0) + 1;
context[circuitKey + '_failures'] = failureCount;
context[circuitKey + '_lastFailure'] = new Date().toISOString();
var failureThreshold = context[circuitKey + '_config'].failureThreshold;
if (failureCount >= failureThreshold) {
// Open circuit
context[circuitKey + '_state'] = 'OPEN';
logger.error('Circuit breaker OPEN for ' + serviceName +
' (threshold reached: ' + failureCount + ' failures)');
// Send alert
BankLingo.ExecuteCommand('SendAlert', {
severity: 'HIGH',
message: 'Circuit breaker opened for ' + serviceName,
failureCount: failureCount
});
} else {
logger.warn('Circuit breaker: ' + failureCount + ' of ' +
failureThreshold + ' failures');
}
}
// Re-throw error for process handling
throw new BpmnError(error.errorCode, error.errorMessage);
</bpmn:script>
</bpmn:scriptTask>
Circuit States
| State | Description | Behavior |
|---|---|---|
| CLOSED | Normal operation | Calls proceed normally |
| OPEN | Too many failures | Fail fast without calling service |
| HALF_OPEN | Testing recovery | Allow limited calls to test if service recovered |
Use Cases
- ✅ External payment gateways
- ✅ Third-party APIs
- ✅ Microservice dependencies
- ✅ Database connections
Pattern 4: Compensating Transactions
Undo partially completed work when an error occurs.
Implementation
<bpmn:process id="BookingWithCompensation">
<!-- Step 1: Reserve flight -->
<bpmn:serviceTask id="ReserveFlight" name="Reserve Flight">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="ReserveFlightCommand"/>
<custom:property name="ResultVariable" value="flightReservation"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>
<!-- Step 2: Reserve hotel -->
<bpmn:serviceTask id="ReserveHotel" name="Reserve Hotel">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="ReserveHotelCommand"/>
<custom:property name="ResultVariable" value="hotelReservation"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>
<!-- Hotel reservation failed -->
<bpmn:boundaryEvent id="HotelReservationFailed"
attachedToRef="ReserveHotel"
cancelActivity="true">
<bpmn:errorEventDefinition />
</bpmn:boundaryEvent>
<!-- Compensate: Cancel flight -->
<bpmn:scriptTask id="CancelFlight" name="Cancel Flight Reservation">
<bpmn:script>
var error = context._lastError;
logger.warn('Hotel reservation failed: ' + error.errorMessage);
logger.info('Compensating: Canceling flight reservation');
// Cancel the flight reservation
BankLingo.ExecuteCommand('CancelFlightReservation', {
reservationId: context.flightReservation.reservationId
});
context.compensationPerformed = true;
context.compensationReason = 'Hotel reservation failed';
return {
flightCanceled: true,
reason: error.errorMessage
};
</bpmn:script>
</bpmn:scriptTask>
<!-- Step 3: Charge payment -->
<bpmn:serviceTask id="ChargePayment" name="Charge Payment">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="ChargePaymentCommand"/>
<custom:property name="ResultVariable" value="paymentResult"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>
<!-- Payment failed -->
<bpmn:boundaryEvent id="PaymentFailed"
attachedToRef="ChargePayment"
cancelActivity="true">
<bpmn:errorEventDefinition />
</bpmn:boundaryEvent>
<!-- Compensate: Cancel both flight and hotel -->
<bpmn:scriptTask id="CancelAllReservations" name="Cancel All Reservations">
<bpmn:script>
var error = context._lastError;
logger.warn('Payment failed: ' + error.errorMessage);
logger.info('Compensating: Canceling all reservations');
// Cancel flight
BankLingo.ExecuteCommand('CancelFlightReservation', {
reservationId: context.flightReservation.reservationId
});
// Cancel hotel
BankLingo.ExecuteCommand('CancelHotelReservation', {
reservationId: context.hotelReservation.reservationId
});
// Notify customer
BankLingo.ExecuteCommand('SendEmail', {
to: context.customerEmail,
subject: 'Booking Failed - Payment Issue',
body: 'Your booking could not be completed due to payment failure. ' +
'All reservations have been canceled. ' + error.errorMessage
});
context.compensationPerformed = true;
context.compensationReason = 'Payment failed';
return {
allCanceled: true,
customerNotified: true
};
</bpmn:script>
</bpmn:scriptTask>
</bpmn:process>
Use Cases
- ✅ Multi-step bookings/reservations
- ✅ Distributed transactions
- ✅ Saga pattern implementation
- ✅ Financial transactions with multiple steps
Pattern 5: Manual Intervention
Escalate complex errors to human operators.
Implementation
<bpmn:serviceTask id="AutomatedProcessing" name="Automated Processing">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="ConnectorKey" value="ProcessDataCommand"/>
</custom:properties>
</bpmn:extensionElements>
</bpmn:serviceTask>
<!-- Catch all errors -->
<bpmn:boundaryEvent id="ProcessingError"
attachedToRef="AutomatedProcessing"
cancelActivity="true">
<bpmn:errorEventDefinition />
</bpmn:boundaryEvent>
<!-- Determine if manual intervention needed -->
<bpmn:scriptTask id="AnalyzeError" name="Analyze Error">
<bpmn:script>
var error = context._lastError;
// Categorize error
var errorCategory = 'UNKNOWN';
var requiresManualIntervention = false;
if (error.errorCode === 'DATA_INCONSISTENCY' ||
error.errorCode === 'MISSING_DATA' ||
error.errorCode === 'INVALID_DATA') {
errorCategory = 'DATA_QUALITY';
requiresManualIntervention = true;
} else if (error.errorCode === 'BUSINESS_RULE_VIOLATION') {
errorCategory = 'BUSINESS_RULE';
requiresManualIntervention = true;
} else if (error.errorCode === 'GATEWAY_TIMEOUT') {
errorCategory = 'TRANSIENT';
requiresManualIntervention = false; // Can retry
}
context.errorCategory = errorCategory;
context.requiresManualIntervention = requiresManualIntervention;
logger.info('Error categorized as: ' + errorCategory +
', Manual intervention: ' + requiresManualIntervention);
return {
category: errorCategory,
manual: requiresManualIntervention
};
</bpmn:script>
</bpmn:scriptTask>
<!-- Gateway: Manual intervention needed? -->
<bpmn:exclusiveGateway id="NeedsManualReview" name="Needs Manual Review?"/>
<bpmn:sequenceFlow sourceRef="NeedsManualReview" targetRef="CreateManualTask">
<bpmn:conditionExpression xsi:type="bpmn:tFormalExpression">
${context.requiresManualIntervention === true}
</bpmn:conditionExpression>
</bpmn:sequenceFlow>
<!-- Create manual review task -->
<bpmn:userTask id="CreateManualTask" name="Manual Error Review">
<bpmn:extensionElements>
<custom:properties>
<custom:property name="FormKey" value="error-review-form"/>
<custom:property name="ResponsibleTeams" value="Operations,Support"/>
<custom:property name="Priority" value="HIGH"/>
<custom:property name="Description"><![CDATA[
Automated processing failed with error that requires manual review.
Error Code: {{context._lastError.errorCode}}
Error Message: {{context._lastError.errorMessage}}
Error Category: {{context.errorCategory}}
Please review and take appropriate action.
]]></custom:property>
<custom:property name="PreScript"><![CDATA[
// Prepare context for form
formContext.error = context._lastError;
formContext.errorCategory = context.errorCategory;
formContext.processData = {
customerId: context.customerId,
transactionId: context.transactionId,
amount: context.amount
};
]]></custom:property>
<custom:property name="ServerScript"><![CDATA[
// Process operator's decision
context.operatorDecision = formData.decision; // 'RETRY', 'SKIP', 'CANCEL'
context.operatorComments = formData.comments;
context.reviewedBy = formData.userId;
context.reviewedAt = new Date().toISOString();
return {
decision: formData.decision,
reviewedBy: formData.userId
};
]]></custom:property>
</custom:properties>
</bpmn:extensionElements>
</bpmn:userTask>
<!-- Process operator decision -->
<bpmn:exclusiveGateway id="OperatorDecision" name="Operator Decision"/>
<!-- Retry -->
<bpmn:sequenceFlow sourceRef="OperatorDecision" targetRef="AutomatedProcessing">
<bpmn:conditionExpression xsi:type="bpmn:tFormalExpression">
${context.operatorDecision === 'RETRY'}
</bpmn:conditionExpression>
</bpmn:sequenceFlow>
<!-- Skip -->
<bpmn:sequenceFlow sourceRef="OperatorDecision" targetRef="SkipAndContinue">
<bpmn:conditionExpression xsi:type="bpmn:tFormalExpression">
${context.operatorDecision === 'SKIP'}
</bpmn:conditionExpression>
</bpmn:sequenceFlow>
<!-- Cancel -->
<bpmn:sequenceFlow sourceRef="OperatorDecision" targetRef="CancelProcess">
<bpmn:conditionExpression xsi:type="bpmn:tFormalExpression">
${context.operatorDecision === 'CANCEL'}
</bpmn:conditionExpression>
</bpmn:sequenceFlow>
Use Cases
- ✅ Complex data quality issues
- ✅ Business rule exceptions
- ✅ Ambiguous situations requiring judgment
- ✅ High-value transactions with errors
Best Practices
✅ Do This
// ✅ Track retry attempts
context.retryCount = (context.retryCount || 0) + 1;
// ✅ Use exponential backoff
var delayMs = Math.pow(2, retryCount) * 1000;
// ✅ Add jitter to prevent thundering herd
delayMs += Math.random() * 500;
// ✅ Set max retries
if (retryCount > maxRetries) {
throw new BpmnError('RETRY_EXHAUSTED', 'Failed after ' + maxRetries + ' attempts');
}
// ✅ Log all recovery attempts
logger.info('Recovery attempt ' + retryCount + ': ' + strategy);
// ✅ Send alerts on critical failures
BankLingo.ExecuteCommand('SendAlert', { severity: 'HIGH', message: '...' });
// ✅ Track circuit breaker state
context.circuitState = 'OPEN';
context.circuitLastFailure = new Date().toISOString();
❌ Don't Do This
// ❌ Infinite retries
while (true) {
try { /* ... */ } catch(e) { /* retry forever */ }
}
// ❌ No retry delays (thundering herd)
for (var i = 0; i < 10; i++) {
try { callAPI(); break; } catch(e) { /* immediate retry */ }
}
// ❌ No error logging
catch (error) {
// No logging - impossible to debug
}
// ❌ No alerting on critical failures
if (allSourcesFailed) {
return null; // Silent failure
}
// ❌ No compensation for partial work
// Transaction 1: Success
// Transaction 2: Success
// Transaction 3: Failed
// -> No rollback of 1 and 2
Error Recovery Decision Tree
Error Occurred
├─ Is it transient? (timeout, network, etc.)
│ ├─ YES → Retry with exponential backoff
│ │ └─ Max retries reached?
│ │ ├─ YES → Try fallback
│ │ └─ NO → Continue retrying
│ └─ NO → Continue
│
├─ Is alternative available? (fallback service, cache)
│ ├─ YES → Try fallback
│ │ └─ Fallback failed?
│ │ ├─ YES → Circuit breaker or manual intervention
│ │ └─ NO → Use fallback data (mark as such)
│ └─ NO → Continue
│
├─ Is partial work done?
│ ├─ YES → Compensate (undo partial work)
│ └─ NO → Continue
│
├─ Is circuit breaker applicable?
│ ├─ YES → Open circuit, fail fast for duration
│ └─ NO → Continue
│
└─ Requires human judgment?
├─ YES → Create manual review task
└─ NO → Log and fail process
Related Documentation
- Error Handling Overview - Complete error handling guide
- Boundary Error Events - Error boundary patterns
- JavaScript Error Throwing - BpmnError constructor
- Timer Events - Timeout patterns for retry delays
- User Task - Manual intervention patterns
Features Used:
- Phase 5: Error Handling
- Phase 4: Timer Events (retry delays)
- Phase 2: Async Boundaries (background compensation)
Status: ✅ Production Ready
Version: 2.0
Last Updated: January 2026