Platform

Execution

What the edge agent does: supervising the runtime near machines, bridging protocols, and forwarding telemetry reliably.

Bootctrl architecture overview

Design intent

Use this lens when implementing Execution across a fleet: define clear boundaries, make change snapshot-based, and keep operational signals observable.

  • Edge should keep control executing even when cloud is down
  • Supervision + safe restart policies prevent unattended crash loops
  • Store-and-forward telemetry preserves evidence across outages

What it is

The edge agent runs near the machines, supervises the control runtime, bridges industrial protocols, and forwards telemetry upstream.

Design constraints

  • Edge should keep control executing even when cloud is down
  • Supervision + safe restart policies prevent unattended crash loops
  • Store-and-forward telemetry preserves evidence across outages

Architecture at a glance

  • Define a stable artifact boundary (what you deploy) and a stable signal boundary (what you observe)
  • Treat changes as versioned, testable, rollbackable units
  • Use health + telemetry gates to scale safely

Typical workflow

  • Define scope and success criteria (what should change, what must stay stable)
  • Create or update a snapshot, then validate against a canary environment/site
  • Deploy progressively with health/telemetry gates and explicit rollback criteria
  • Confirm acceptance tests and operational dashboards before expanding

System boundary

Treat Execution as a repeatable interface between engineering intent (design) and runtime reality (deployments + signals). Keep site-specific details configurable so the same design scales across sites.

Example artifact

Implementation notes (conceptual)

topic: Execution
plan: define -> snapshot -> canary -> expand
signals: health + telemetry + events tied to version
rollback: select known-good snapshot

Why it matters

  • Keeps control close to equipment (latency + resilience)
  • Provides safe remote operations and updates
  • Buffers and forwards data across flaky links

Engineering outcomes

  • Edge should keep control executing even when cloud is down
  • Supervision + safe restart policies prevent unattended crash loops
  • Store-and-forward telemetry preserves evidence across outages

Quick acceptance checks

  • Verify the agent can supervise runtime processes and restart safely
  • Confirm store-and-forward buffering settings for expected outages

Common failure modes

  • Drift between desired and actual running configuration
  • Changes without clear rollback criteria
  • Insufficient monitoring for acceptance after rollout

Acceptance tests

  • Verify the deployed snapshot/version matches intent (no drift)
  • Run a canary validation: behavior, health, and telemetry align with expectations
  • Verify rollback works and restores known-good behavior

In the platform

  • Connects to field devices via protocol adapters
  • Runs store-and-forward for telemetry
  • Reports health/diagnostics to the backend

Implementation checklist

  • Verify the agent can supervise runtime processes and restart safely
  • Confirm store-and-forward buffering settings for expected outages
  • Validate adapter lifecycle (connect/disconnect/backoff) under stress
  • Ensure the agent reports version + health signals to the backend

Rollout guidance

  • Start with a canary site that matches real conditions
  • Use health + telemetry gates; stop expansion on regressions
  • Keep rollback to a known-good snapshot fast and rehearsed

Acceptance tests

  • Verify the deployed snapshot/version matches intent (no drift)
  • Run a canary validation: behavior, health, and telemetry align with expectations
  • Verify rollback works and restores known-good behavior

Deep dive

Practical next steps

How teams typically apply this in real deployments.

Key takeaways

  • Edge should keep control executing even when cloud is down
  • Supervision + safe restart policies prevent unattended crash loops
  • Store-and-forward telemetry preserves evidence across outages

Checklist

  • Verify the agent can supervise runtime processes and restart safely
  • Confirm store-and-forward buffering settings for expected outages
  • Validate adapter lifecycle (connect/disconnect/backoff) under stress
  • Ensure the agent reports version + health signals to the backend

Deep dive

Common questions

Quick answers that help during commissioning and operations.

What should continue working during a WAN outage?

Local control execution should continue. Telemetry should buffer locally and forward once links return; operations should show the site as offline/degraded, not broken.

How do we prevent crash loops?

Use backoff policies and safe-mode thresholds; surface last error context and isolate failing adapters instead of repeatedly restarting everything.

What’s the most common edge operational failure?

Connectivity and configuration drift. Ensure reconciliation (desired vs actual) and monitor adapter health and buffer depth.