Platform
Execution
What the edge agent does: supervising the runtime near machines, bridging protocols, and forwarding telemetry reliably.
Design intent
Use this lens when implementing Execution across a fleet: define clear boundaries, make change snapshot-based, and keep operational signals observable.
- Edge should keep control executing even when cloud is down
- Supervision + safe restart policies prevent unattended crash loops
- Store-and-forward telemetry preserves evidence across outages
What it is
The edge agent runs near the machines, supervises the control runtime, bridges industrial protocols, and forwards telemetry upstream.
Design constraints
- Edge should keep control executing even when cloud is down
- Supervision + safe restart policies prevent unattended crash loops
- Store-and-forward telemetry preserves evidence across outages
Architecture at a glance
- Define a stable artifact boundary (what you deploy) and a stable signal boundary (what you observe)
- Treat changes as versioned, testable, rollbackable units
- Use health + telemetry gates to scale safely
Typical workflow
- Define scope and success criteria (what should change, what must stay stable)
- Create or update a snapshot, then validate against a canary environment/site
- Deploy progressively with health/telemetry gates and explicit rollback criteria
- Confirm acceptance tests and operational dashboards before expanding
System boundary
Treat Execution as a repeatable interface between engineering intent (design) and runtime reality (deployments + signals). Keep site-specific details configurable so the same design scales across sites.
Example artifact
Implementation notes (conceptual)
topic: Execution
plan: define -> snapshot -> canary -> expand
signals: health + telemetry + events tied to version
rollback: select known-good snapshotWhy it matters
- Keeps control close to equipment (latency + resilience)
- Provides safe remote operations and updates
- Buffers and forwards data across flaky links
Engineering outcomes
- Edge should keep control executing even when cloud is down
- Supervision + safe restart policies prevent unattended crash loops
- Store-and-forward telemetry preserves evidence across outages
Quick acceptance checks
- Verify the agent can supervise runtime processes and restart safely
- Confirm store-and-forward buffering settings for expected outages
Common failure modes
- Drift between desired and actual running configuration
- Changes without clear rollback criteria
- Insufficient monitoring for acceptance after rollout
Acceptance tests
- Verify the deployed snapshot/version matches intent (no drift)
- Run a canary validation: behavior, health, and telemetry align with expectations
- Verify rollback works and restores known-good behavior
In the platform
- Connects to field devices via protocol adapters
- Runs store-and-forward for telemetry
- Reports health/diagnostics to the backend
Implementation checklist
- Verify the agent can supervise runtime processes and restart safely
- Confirm store-and-forward buffering settings for expected outages
- Validate adapter lifecycle (connect/disconnect/backoff) under stress
- Ensure the agent reports version + health signals to the backend
Rollout guidance
- Start with a canary site that matches real conditions
- Use health + telemetry gates; stop expansion on regressions
- Keep rollback to a known-good snapshot fast and rehearsed
Acceptance tests
- Verify the deployed snapshot/version matches intent (no drift)
- Run a canary validation: behavior, health, and telemetry align with expectations
- Verify rollback works and restores known-good behavior
Deep dive
Practical next steps
How teams typically apply this in real deployments.
Key takeaways
- Edge should keep control executing even when cloud is down
- Supervision + safe restart policies prevent unattended crash loops
- Store-and-forward telemetry preserves evidence across outages
Checklist
- Verify the agent can supervise runtime processes and restart safely
- Confirm store-and-forward buffering settings for expected outages
- Validate adapter lifecycle (connect/disconnect/backoff) under stress
- Ensure the agent reports version + health signals to the backend
Deep dive
Common questions
Quick answers that help during commissioning and operations.
What should continue working during a WAN outage?
Local control execution should continue. Telemetry should buffer locally and forward once links return; operations should show the site as offline/degraded, not broken.
How do we prevent crash loops?
Use backoff policies and safe-mode thresholds; surface last error context and isolate failing adapters instead of repeatedly restarting everything.
What’s the most common edge operational failure?
Connectivity and configuration drift. Ensure reconciliation (desired vs actual) and monitor adapter health and buffer depth.