Certification-Grade Reliability and Audit Plan

This playbook summarizes the timing, redundancy, verification, and compliance controls required to operate the station at certification-level reliability.

Timing and Signal Fidelity

  • Deterministic timing: Lock system clocks with chrony disciplined by GPS PPS; alert when drift exceeds tolerance for SAME frame spacing or attention tone cadence.
  • Test harness: Measure SAME header spacing, 1600/800 Hz AFSK at 520.83 bps, attention tone duration/levels, and EOM handling through both loopback and over-the-air captures.
  • SAME alignment: Validate header start offsets, inter-frame spacing, and bit timing against tolerances before releases.
  • Attention tone checks: Verify tone length, level matching, and stereo balance on each build.
  • EOM handling: Confirm three clean EOM bursts terminate playback and clear encoder/decoder states.

Redundancy and Failover

  • Active/standby nodes: Run dual nodes with keepalived/VRRP for floating IP failover; alert on role flips and gratuitous ARP anomalies.
  • State durability: Use Postgres streaming replication with synchronous commit for alert state tables; keep Redis behind Sentinel with AOF persistence for transient queues.
  • Operational drills: Maintain a manual takeover playbook (power, network, services, DNS/IP steps) and rehearse regularly to validate staff readiness.

SDR Verification and RF Hygiene

  • Calibration: Measure and record PPM per SDR stick; refuse service when calibration or SNR gates fail.
  • Quality gating: Require minimum SNR and clean constellation before decoding; log rejects with timestamps and stick serials.
  • Audit artifacts: Archive short IQ snippets immediately before and after alerts for later dispute resolution.

Hardware Controls

  • Stable device naming: Enforce udev rules for persistent names and per-device serial whitelists to prevent accidental role swaps.
  • Portable GPIO: Use libgpiod for all GPIO access to remain portable across kernels and boards.

Ingestion Hygiene

  • Deduplication: De-duplicate CAP messages by identifier and sent time; prefer IPAWS primary feeds over mirrors.
  • Storm protection: Rate-limit ingestion bursts to prevent replay amplification or malformed floods.

Backups and Compliance

  • Tamper resistance: Store logs in WORM-style targets with signed hash chains for audit trails.
  • Access security: Require client mTLS for IPAWS COG access and track certificate expiration with proactive alarms.
  • Backups: Schedule periodic backups for databases and config, including replication metadata and udev/GPIO rules.

Monitoring and Observability

  • Metrics and dashboards: Export metrics to Prometheus and visualize in Grafana; alert on timing drift, VRRP state changes, replication lag, Sentinel failover, SNR gating, and IQ archival failures.
  • Auth and API exposure: Use Keycloak for SSO and DreamFactory for read-only REST over Postgres where external dashboards need limited access.

Simulation and Operator Readiness

  • Scenario simulator: Provide a simulator to run RWT, RMT, and EAN preemption scenarios, capturing operator acknowledgements and timing deltas.
  • Playback review: Record simulated audio/IQ alongside logs to confirm end-to-end timing and UI prompts.

Acceptance Checklist

  • Timing harness validates SAME headers, AFSK bit rates, attention tone duration/level, and EOM clearing in loopback and OTA paths.
  • Chrony+GPS PPS health alarms trigger on drift beyond tolerance; VRRP, replication, and Sentinel failovers alert immediately.
  • SDRs calibrated per-stick with enforced SNR gates and archived IQ slices before/after alerts.
  • Udev persistent names, serial whitelists, and libgpiod-based GPIO confirmed on both nodes.
  • CAP ingestion de-duplication, IPAWS preference, and rate-limiting verified under load tests.
  • WORM log storage with signed hash chains and mTLS-enforced IPAWS connectivity in place; certificate expiry monitoring active.
  • Simulator exercises RWT/RMT/EAN preemption workflows and logs operator acknowledgements for drills.

This document is served from docs/process/certification_reliability_plan.md in the EAS Station installation.