Platform
Operations
How edge sites are operated day-to-day: lifecycle management, health reporting, diagnostics, and safe remote actions.
Design intent
Use this lens when implementing Operations across a fleet: define clear boundaries, make change snapshot-based, and keep operational signals observable.
- Operations scales when desired vs actual state is always visible
- Safe-mode/backoff policies prevent “restart storms”
- Runbooks + deterministic rollback cut MTTR dramatically
What it is
Edge operations covers everything around the runtime: process supervision, configuration sync, updates, and collecting the signals that tell you a site is healthy.
Operating model
- Edge is treated as a managed node: declared desired state + observed actual state
- Updates are delivered as versioned artifacts with explicit rollout control
- Local autonomy is preserved: edge continues executing even during cloud outages
How it works (high level)
- A supervisor layer watches the runtime and critical adapters, restarting when policy says it is safe
- Health signals and diagnostics are emitted continuously and sent upstream when possible
- The platform reconciles “what should be running” with “what is running” and flags drift
Design constraints
- Operations scales when desired vs actual state is always visible
- Safe-mode/backoff policies prevent “restart storms”
- Runbooks + deterministic rollback cut MTTR dramatically
Architecture at a glance
- Endpoints (protocol sessions) → points (signals) → mappings (typed bindings) → control app ports
- Adapters isolate variable-latency protocol work from deterministic control execution paths
- Validation and data-quality checks sit between “connected” and “correct”
- This is a UI + backend + edge concern: changes affect real-world actuation
Typical workflow
- Define endpoints and point templates (units, scaling, ownership)
- Bind points to app ports and validate types/limits
- Commission using a canary device and verify data quality (staleness/range)
- Roll out with rate limits and monitoring for flaps and errors
System boundary
Treat Operations as a repeatable interface between engineering intent (design) and runtime reality (deployments + signals). Keep site-specific details configurable so the same design scales across sites.
Example artifact
I/O mapping table (conceptual)
point_name, protocol, address, type, unit, scale, direction, owner
pump_speed, modbus, 40021, REAL, rpm, 0.1, read, device:pump-1
valve_cmd, modbus, 00013, BOOL, -, -, write, app:fb-networkWhy it matters
- Reduces the need for on-site interventions
- Provides predictable upgrade and rollback mechanics
- Creates actionable diagnostics instead of “it stopped”
Engineering outcomes
- Operations scales when desired vs actual state is always visible
- Safe-mode/backoff policies prevent “restart storms”
- Runbooks + deterministic rollback cut MTTR dramatically
Quick acceptance checks
- Define restart/backoff policies and a “safe mode” threshold
- Track desired vs actual version/config and alert on drift
Common failure modes
- Units/scaling mismatch (values look “reasonable” but are wrong)
- Swapped addresses/endianness/encoding issues that only show under load
- Staleness: values stop changing but connectivity stays “green”
- Write conflicts from unclear single-writer ownership
Acceptance tests
- Step input values and verify expected output actuation (end-to-end)
- Inject stale/noisy values and confirm guards flag or suppress them
- Confirm single-writer ownership with a write-conflict test
- Verify the deployed snapshot/version matches intent (no drift)
- Run a canary validation: behavior, health, and telemetry align with expectations
- Verify rollback works and restores known-good behavior
In the platform
- Supervise runtime processes and restart strategies
- Report health, version, and deployment status
- Expose diagnostics for commissioning and support
Failure modes and responses
- Crash loops → backoff + safe-mode policy + surface the last error context
- Partial degradation → isolate failing adapter while keeping the core runtime alive
- Misconfiguration → detect drift from snapshot and recommend rollback or re-deploy
- Connectivity loss → continue local control, queue telemetry, and mark site as offline/degraded
What to monitor
- Runtime process uptime and restart counts
- Adapter health and protocol error rates
- Deployment reconciliation state (desired vs actual)
- Local telemetry buffer depth and delivery latency
Implementation checklist
- Define restart/backoff policies and a “safe mode” threshold
- Track desired vs actual version/config and alert on drift
- Document field runbooks: rollback, diagnostics capture, escalation
- Monitor buffer depth and delivery latency during connectivity issues
Rollout guidance
- Start with a canary site that matches real conditions
- Use health + telemetry gates; stop expansion on regressions
- Keep rollback to a known-good snapshot fast and rehearsed
Acceptance tests
- Step input values and verify expected output actuation (end-to-end)
- Inject stale/noisy values and confirm guards flag or suppress them
- Confirm single-writer ownership with a write-conflict test
- Verify the deployed snapshot/version matches intent (no drift)
- Run a canary validation: behavior, health, and telemetry align with expectations
- Verify rollback works and restores known-good behavior
Deep dive
Practical next steps
How teams typically apply this in real deployments.
Key takeaways
- Operations scales when desired vs actual state is always visible
- Safe-mode/backoff policies prevent “restart storms”
- Runbooks + deterministic rollback cut MTTR dramatically
Checklist
- Define restart/backoff policies and a “safe mode” threshold
- Track desired vs actual version/config and alert on drift
- Document field runbooks: rollback, diagnostics capture, escalation
- Monitor buffer depth and delivery latency during connectivity issues
Next steps
Related topics
Deep dive
Common questions
Quick answers that help during commissioning and operations.
What is “drift” in edge operations?
Drift is when the running version/config differs from the intended snapshot/deployment plan. Detect it via reconciliation and fix by re-deploying or rolling back.
How do we keep edge autonomous but manageable?
Keep execution local, but keep desired state centralized and observable. The system should reconcile and report, not require constant remote control.
What should we alert on first?
Crash loops/restart bursts, adapter health degradation, deployment reconciliation failures, and telemetry buffer saturation.