Platform

Operations

How edge sites are operated day-to-day: lifecycle management, health reporting, diagnostics, and safe remote actions.

Design intent

Use this lens when implementing Operations across a fleet: define clear boundaries, make change snapshot-based, and keep operational signals observable.

Operations scales when desired vs actual state is always visible
Safe-mode/backoff policies prevent “restart storms”
Runbooks + deterministic rollback cut MTTR dramatically

What it is

Edge operations covers everything around the runtime: process supervision, configuration sync, updates, and collecting the signals that tell you a site is healthy.

Operating model

Edge is treated as a managed node: declared desired state + observed actual state
Updates are delivered as versioned artifacts with explicit rollout control
Local autonomy is preserved: edge continues executing even during cloud outages

How it works (high level)

A supervisor layer watches the runtime and critical adapters, restarting when policy says it is safe
Health signals and diagnostics are emitted continuously and sent upstream when possible
The platform reconciles “what should be running” with “what is running” and flags drift

Design constraints

Operations scales when desired vs actual state is always visible
Safe-mode/backoff policies prevent “restart storms”
Runbooks + deterministic rollback cut MTTR dramatically

Architecture at a glance

Endpoints (protocol sessions) → points (signals) → mappings (typed bindings) → control app ports
Adapters isolate variable-latency protocol work from deterministic control execution paths
Validation and data-quality checks sit between “connected” and “correct”
This is a UI + backend + edge concern: changes affect real-world actuation

Typical workflow

Define endpoints and point templates (units, scaling, ownership)
Bind points to app ports and validate types/limits
Commission using a canary device and verify data quality (staleness/range)
Roll out with rate limits and monitoring for flaps and errors

System boundary

Treat Operations as a repeatable interface between engineering intent (design) and runtime reality (deployments + signals). Keep site-specific details configurable so the same design scales across sites.

Example artifact

I/O mapping table (conceptual)

point_name, protocol, address, type, unit, scale, direction, owner
pump_speed, modbus, 40021, REAL, rpm, 0.1, read, device:pump-1
valve_cmd,  modbus, 00013, BOOL, -,   -,   write, app:fb-network

Why it matters

Reduces the need for on-site interventions
Provides predictable upgrade and rollback mechanics
Creates actionable diagnostics instead of “it stopped”

Engineering outcomes

Operations scales when desired vs actual state is always visible
Safe-mode/backoff policies prevent “restart storms”
Runbooks + deterministic rollback cut MTTR dramatically

Quick acceptance checks

Define restart/backoff policies and a “safe mode” threshold
Track desired vs actual version/config and alert on drift

Common failure modes

Units/scaling mismatch (values look “reasonable” but are wrong)
Swapped addresses/endianness/encoding issues that only show under load
Staleness: values stop changing but connectivity stays “green”
Write conflicts from unclear single-writer ownership

Acceptance tests

Step input values and verify expected output actuation (end-to-end)
Inject stale/noisy values and confirm guards flag or suppress them
Confirm single-writer ownership with a write-conflict test
Verify the deployed snapshot/version matches intent (no drift)
Run a canary validation: behavior, health, and telemetry align with expectations
Verify rollback works and restores known-good behavior

In the platform

Supervise runtime processes and restart strategies
Report health, version, and deployment status
Expose diagnostics for commissioning and support

Failure modes and responses

Crash loops → backoff + safe-mode policy + surface the last error context
Partial degradation → isolate failing adapter while keeping the core runtime alive
Misconfiguration → detect drift from snapshot and recommend rollback or re-deploy
Connectivity loss → continue local control, queue telemetry, and mark site as offline/degraded

What to monitor

Runtime process uptime and restart counts
Adapter health and protocol error rates
Deployment reconciliation state (desired vs actual)
Local telemetry buffer depth and delivery latency

Implementation checklist

Define restart/backoff policies and a “safe mode” threshold
Track desired vs actual version/config and alert on drift
Document field runbooks: rollback, diagnostics capture, escalation
Monitor buffer depth and delivery latency during connectivity issues

Rollout guidance

Start with a canary site that matches real conditions
Use health + telemetry gates; stop expansion on regressions
Keep rollback to a known-good snapshot fast and rehearsed

Acceptance tests

Step input values and verify expected output actuation (end-to-end)
Inject stale/noisy values and confirm guards flag or suppress them
Confirm single-writer ownership with a write-conflict test
Verify the deployed snapshot/version matches intent (no drift)
Run a canary validation: behavior, health, and telemetry align with expectations
Verify rollback works and restores known-good behavior

Deep dive

Practical next steps

How teams typically apply this in real deployments.

Key takeaways

Operations scales when desired vs actual state is always visible
Safe-mode/backoff policies prevent “restart storms”
Runbooks + deterministic rollback cut MTTR dramatically

Checklist

Define restart/backoff policies and a “safe mode” threshold
Track desired vs actual version/config and alert on drift
Document field runbooks: rollback, diagnostics capture, escalation
Monitor buffer depth and delivery latency during connectivity issues

Next steps

Common questions

Quick answers that help during commissioning and operations.

What is “drift” in edge operations?

Drift is when the running version/config differs from the intended snapshot/deployment plan. Detect it via reconciliation and fix by re-deploying or rolling back.

How do we keep edge autonomous but manageable?

Keep execution local, but keep desired state centralized and observable. The system should reconcile and report, not require constant remote control.

What should we alert on first?

Crash loops/restart bursts, adapter health degradation, deployment reconciliation failures, and telemetry buffer saturation.

Operations

Design intent

What it is

Operating model

How it works (high level)

Design constraints

Architecture at a glance

Typical workflow

System boundary

Example artifact

Why it matters

Engineering outcomes

Quick acceptance checks

Common failure modes

Acceptance tests

In the platform

Failure modes and responses

What to monitor

Implementation checklist

Rollout guidance

Acceptance tests

Related deep dives

Key takeaways

Checklist

Next steps