The Challenge
When automating user workflows on third-party websites, we face several key challenges:
- State Consistency: User actions often depend on previous state (e.g., clicking a button that only appears after a form submission)
- Session Management: Browser sessions can expire or become invalid between recording and replay
- Dynamic Content: Modern web apps render content dynamically, making traditional DOM-based replay unreliable
- Error Recovery: Network issues or timing problems can cause actions to fail during replay
Our Approach: Event-Sourced Action Replay
Rather than treating user actions as a simple sequence of DOM events, we model them as an event-sourced stream of state transitions. Each action is recorded with its complete state context and preconditions.
Here's a simplified example of our action recording format:
1nterface ActionEvent {
2 type: 'click' | 'input' | 'submit' | 'navigation';
3 target: {
4 selector: string;
5 attributes: Record<string, string>;
6 stateHash: string; // Hash of relevant DOM state
7 };
8 preconditions: {
9 visible: boolean;
10 enabled: boolean;
11 stateMatchers: Array<StateMatcher>;
12 };
13 timestamp: number;
14 sessionContext: SessionContext;
15}
16
17interface StateMatcher {
18 selector: string;
19 condition: 'exists' | 'contains' | 'matches';
20 value: string;
21}
The Replay Engine
Our replay engine uses a state machine approach to handle action replay. Rather than blindly executing actions in sequence, it:
- Validates preconditions before each action
- Maintains session state and handles re-authentication
- Implements exponential backoff and retry logic
- Records detailed telemetry for debugging
Here's a simplified version of our replay logic:
1class ActionReplayEngine {
2 async replayAction(action: ActionEvent): Promise<boolean> {
3 // Verify session is still valid
4 if (!await this.validateSession(action.sessionContext)) {
5 await this.refreshSession();
6 }
7
8 // Wait for preconditions with exponential backoff
9 await this.waitForPreconditions(action.preconditions, {
10 maxAttempts: 3,
11 baseDelay: 1000
12 });
13
14 // Verify state hash matches recording
15 const currentHash = await this.computeStateHash(action.target.selector);
16 if (currentHash !== action.target.stateHash) {
17 throw new StateHashMismatchError();
18 }
19
20 // Execute the action
21 await this.executeAction(action);
22
23 return true;
24 }
25}
State Synchronization
One key insight was that we needed to synchronize state at multiple levels:
- DOM State: The visible page structure and content
- JavaScript State: The application's internal state
- Network State: Active XHR requests and WebSocket connections
- Storage State: Cookies, localStorage, and sessionStorage
We developed a novel approach using what we call "state checkpoints" - snapshots of all relevant state that must be synchronized before an action can proceed:
1interface StateCheckpoint {
2 dom: {
3 snapshot: string;
4 criticalSelectors: string[];
5 };
6 storage: {
7 cookies: Record<string, string>;
8 localStorage: Record<string, string>;
9 };
10 network: {
11 activeRequests: string[];
12 wsConnections: WebSocketState[];
13 };
14}
Error Recovery and Debugging
To make debugging easier when replays fail, we built comprehensive telemetry into our system. Each replay attempt generates a trace that includes:
- Timing data for each action and wait period
- Screenshots at key points
- Network request logs
- Console output
- State checkpoint diffs
This data is stored in a structured format that makes it easy to identify exactly where and why a replay failed.
Results and Lessons Learned
This approach has proven robust in production, with several key benefits:
- Reliability: Our action replay success rate improved from ~80% to >98%
- Debuggability: Mean time to resolve replay failures decreased by 65%
- Maintainability: The event-sourced model makes it easier to extend the system
Some key lessons learned:
[list-check]
- State synchronization is more important than perfect action replay
- Exponential backoff and retry logic is essential for reliability
- Comprehensive telemetry is worth the overhead
Future Work
We're currently working on several improvements:
[list-task]
- Machine learning for automatic retry strategies
- Parallel action replay for independent state changes
- Predictive state preloading to improve performance
[cta]
The challenges of reliable action replay in modern web applications are complex, but our event-sourced approach with careful state management has proven effective at scale.
Note: This post was written by the Anon Engineering team. To learn more about building on Anon's platform, visit our Developer Docs.