Platform

Execution

What the edge agent does: supervising the runtime near machines, bridging protocols, and forwarding telemetry reliably.

Design intent

Use this lens when implementing Execution across a fleet: define clear boundaries, make change snapshot-based, and keep operational signals observable.

Edge should keep control executing even when cloud is down
Supervision + safe restart policies prevent unattended crash loops
Store-and-forward telemetry preserves evidence across outages

What it is

The edge agent runs near the machines, supervises the control runtime, bridges industrial protocols, and forwards telemetry upstream.

Design constraints

Edge should keep control executing even when cloud is down
Supervision + safe restart policies prevent unattended crash loops
Store-and-forward telemetry preserves evidence across outages

Architecture at a glance

Define a stable artifact boundary (what you deploy) and a stable signal boundary (what you observe)
Treat changes as versioned, testable, rollbackable units
Use health + telemetry gates to scale safely

Typical workflow

Define scope and success criteria (what should change, what must stay stable)
Create or update a snapshot, then validate against a canary environment/site
Deploy progressively with health/telemetry gates and explicit rollback criteria
Confirm acceptance tests and operational dashboards before expanding

System boundary

Treat Execution as a repeatable interface between engineering intent (design) and runtime reality (deployments + signals). Keep site-specific details configurable so the same design scales across sites.

Example artifact

Implementation notes (conceptual)

topic: Execution
plan: define -> snapshot -> canary -> expand
signals: health + telemetry + events tied to version
rollback: select known-good snapshot

Why it matters

Keeps control close to equipment (latency + resilience)
Provides safe remote operations and updates
Buffers and forwards data across flaky links

Engineering outcomes

Edge should keep control executing even when cloud is down
Supervision + safe restart policies prevent unattended crash loops
Store-and-forward telemetry preserves evidence across outages

Quick acceptance checks

Verify the agent can supervise runtime processes and restart safely
Confirm store-and-forward buffering settings for expected outages

Common failure modes

Drift between desired and actual running configuration
Changes without clear rollback criteria
Insufficient monitoring for acceptance after rollout

Acceptance tests

Verify the deployed snapshot/version matches intent (no drift)
Run a canary validation: behavior, health, and telemetry align with expectations
Verify rollback works and restores known-good behavior

In the platform

Connects to field devices via protocol adapters
Runs store-and-forward for telemetry
Reports health/diagnostics to the backend

Implementation checklist

Verify the agent can supervise runtime processes and restart safely
Confirm store-and-forward buffering settings for expected outages
Validate adapter lifecycle (connect/disconnect/backoff) under stress
Ensure the agent reports version + health signals to the backend

Rollout guidance

Start with a canary site that matches real conditions
Use health + telemetry gates; stop expansion on regressions
Keep rollback to a known-good snapshot fast and rehearsed

Acceptance tests

Verify the deployed snapshot/version matches intent (no drift)
Run a canary validation: behavior, health, and telemetry align with expectations
Verify rollback works and restores known-good behavior

Deep dive

Practical next steps

How teams typically apply this in real deployments.

Key takeaways

Edge should keep control executing even when cloud is down
Supervision + safe restart policies prevent unattended crash loops
Store-and-forward telemetry preserves evidence across outages

Checklist

Verify the agent can supervise runtime processes and restart safely
Confirm store-and-forward buffering settings for expected outages
Validate adapter lifecycle (connect/disconnect/backoff) under stress
Ensure the agent reports version + health signals to the backend

Next steps

Common questions

Quick answers that help during commissioning and operations.

What should continue working during a WAN outage?

Local control execution should continue. Telemetry should buffer locally and forward once links return; operations should show the site as offline/degraded, not broken.

How do we prevent crash loops?

Use backoff policies and safe-mode thresholds; surface last error context and isolate failing adapters instead of repeatedly restarting everything.

What’s the most common edge operational failure?

Connectivity and configuration drift. Ensure reconciliation (desired vs actual) and monitor adapter health and buffer depth.

Execution

Design intent

What it is

Design constraints

Architecture at a glance

Typical workflow

System boundary

Example artifact

Why it matters

Engineering outcomes

Quick acceptance checks

Common failure modes

Acceptance tests

In the platform

Implementation checklist

Rollout guidance

Acceptance tests

Related deep dives

Key takeaways

Checklist

Next steps