Platform

Operations

How edge sites are operated day-to-day: lifecycle management, health reporting, diagnostics, and safe remote actions.

Bootctrl architecture overview

Design intent

Use this lens when implementing Operations across a fleet: define clear boundaries, make change snapshot-based, and keep operational signals observable.

  • Operations scales when desired vs actual state is always visible
  • Safe-mode/backoff policies prevent “restart storms”
  • Runbooks + deterministic rollback cut MTTR dramatically

What it is

Edge operations covers everything around the runtime: process supervision, configuration sync, updates, and collecting the signals that tell you a site is healthy.

Operating model

  • Edge is treated as a managed node: declared desired state + observed actual state
  • Updates are delivered as versioned artifacts with explicit rollout control
  • Local autonomy is preserved: edge continues executing even during cloud outages

How it works (high level)

  • A supervisor layer watches the runtime and critical adapters, restarting when policy says it is safe
  • Health signals and diagnostics are emitted continuously and sent upstream when possible
  • The platform reconciles “what should be running” with “what is running” and flags drift

Design constraints

  • Operations scales when desired vs actual state is always visible
  • Safe-mode/backoff policies prevent “restart storms”
  • Runbooks + deterministic rollback cut MTTR dramatically

Architecture at a glance

  • Endpoints (protocol sessions) → points (signals) → mappings (typed bindings) → control app ports
  • Adapters isolate variable-latency protocol work from deterministic control execution paths
  • Validation and data-quality checks sit between “connected” and “correct”
  • This is a UI + backend + edge concern: changes affect real-world actuation

Typical workflow

  • Define endpoints and point templates (units, scaling, ownership)
  • Bind points to app ports and validate types/limits
  • Commission using a canary device and verify data quality (staleness/range)
  • Roll out with rate limits and monitoring for flaps and errors

System boundary

Treat Operations as a repeatable interface between engineering intent (design) and runtime reality (deployments + signals). Keep site-specific details configurable so the same design scales across sites.

Example artifact

I/O mapping table (conceptual)

point_name, protocol, address, type, unit, scale, direction, owner
pump_speed, modbus, 40021, REAL, rpm, 0.1, read, device:pump-1
valve_cmd,  modbus, 00013, BOOL, -,   -,   write, app:fb-network

Why it matters

  • Reduces the need for on-site interventions
  • Provides predictable upgrade and rollback mechanics
  • Creates actionable diagnostics instead of “it stopped”

Engineering outcomes

  • Operations scales when desired vs actual state is always visible
  • Safe-mode/backoff policies prevent “restart storms”
  • Runbooks + deterministic rollback cut MTTR dramatically

Quick acceptance checks

  • Define restart/backoff policies and a “safe mode” threshold
  • Track desired vs actual version/config and alert on drift

Common failure modes

  • Units/scaling mismatch (values look “reasonable” but are wrong)
  • Swapped addresses/endianness/encoding issues that only show under load
  • Staleness: values stop changing but connectivity stays “green”
  • Write conflicts from unclear single-writer ownership

Acceptance tests

  • Step input values and verify expected output actuation (end-to-end)
  • Inject stale/noisy values and confirm guards flag or suppress them
  • Confirm single-writer ownership with a write-conflict test
  • Verify the deployed snapshot/version matches intent (no drift)
  • Run a canary validation: behavior, health, and telemetry align with expectations
  • Verify rollback works and restores known-good behavior

In the platform

  • Supervise runtime processes and restart strategies
  • Report health, version, and deployment status
  • Expose diagnostics for commissioning and support

Failure modes and responses

  • Crash loops → backoff + safe-mode policy + surface the last error context
  • Partial degradation → isolate failing adapter while keeping the core runtime alive
  • Misconfiguration → detect drift from snapshot and recommend rollback or re-deploy
  • Connectivity loss → continue local control, queue telemetry, and mark site as offline/degraded

What to monitor

  • Runtime process uptime and restart counts
  • Adapter health and protocol error rates
  • Deployment reconciliation state (desired vs actual)
  • Local telemetry buffer depth and delivery latency

Implementation checklist

  • Define restart/backoff policies and a “safe mode” threshold
  • Track desired vs actual version/config and alert on drift
  • Document field runbooks: rollback, diagnostics capture, escalation
  • Monitor buffer depth and delivery latency during connectivity issues

Rollout guidance

  • Start with a canary site that matches real conditions
  • Use health + telemetry gates; stop expansion on regressions
  • Keep rollback to a known-good snapshot fast and rehearsed

Acceptance tests

  • Step input values and verify expected output actuation (end-to-end)
  • Inject stale/noisy values and confirm guards flag or suppress them
  • Confirm single-writer ownership with a write-conflict test
  • Verify the deployed snapshot/version matches intent (no drift)
  • Run a canary validation: behavior, health, and telemetry align with expectations
  • Verify rollback works and restores known-good behavior

Deep dive

Practical next steps

How teams typically apply this in real deployments.

Key takeaways

  • Operations scales when desired vs actual state is always visible
  • Safe-mode/backoff policies prevent “restart storms”
  • Runbooks + deterministic rollback cut MTTR dramatically

Checklist

  • Define restart/backoff policies and a “safe mode” threshold
  • Track desired vs actual version/config and alert on drift
  • Document field runbooks: rollback, diagnostics capture, escalation
  • Monitor buffer depth and delivery latency during connectivity issues

Deep dive

Common questions

Quick answers that help during commissioning and operations.

What is “drift” in edge operations?

Drift is when the running version/config differs from the intended snapshot/deployment plan. Detect it via reconciliation and fix by re-deploying or rolling back.

How do we keep edge autonomous but manageable?

Keep execution local, but keep desired state centralized and observable. The system should reconcile and report, not require constant remote control.

What should we alert on first?

Crash loops/restart bursts, adapter health degradation, deployment reconciliation failures, and telemetry buffer saturation.