Architecture¶
This document describes the internal architecture of Supervice for contributors and anyone interested in how it works under the hood.
Module Overview¶
supervice/
├── main.py Entry point, argument parsing, daemonization
├── core.py Supervisor orchestrator — central coordinator
├── process.py Process lifecycle with async state machine
├── config.py INI config parser with validation
├── models.py Data models (dataclasses)
├── rpc.py Unix socket RPC server (JSON over length-prefixed protocol)
├── client.py CLI client (supervicectl) and Controller class
├── events.py EventBus — async pub/sub for state changes
├── health.py Health check implementations (TCP, script)
└── logger.py Logging setup with rotation
Data Flow¶
Config File (INI)
│
▼
┌──────────────┐
│ parse_config │ config.py
└──────┬───────┘
│ SupervisorConfig
▼
┌──────────────┐
│ Supervisor │ core.py
│ load_config │
└──────┬───────┘
│ Creates Process instances
▼
┌──────────────────────┐
│ async run() │
│ │
│ ┌────────────────┐ │
│ │ Process(s) │ │ process.py
│ │ supervise() │ │
│ └────────┬───────┘ │
│ │ │
│ ┌────────▼───────┐ │
│ │ RPCServer │ │ rpc.py
│ │ (Unix sock) │ │
│ └────────┬───────┘ │
│ │ │
│ ┌────────▼───────┐ │
│ │ EventBus │ │ events.py
│ │ (pub/sub) │ │
│ └────────────────┘ │
└──────────────────────┘
▲
│ JSON/Unix Socket
│
┌──────────────────────┐
│ Controller │ client.py
│ (supervicectl) │
└──────────────────────┘
Process State Machine¶
Each Process instance manages a single OS process through a state machine:
┌──────────┐
│ STOPPED │ ◄── initial state
└────┬─────┘
│ should_run = true
▼
┌──────────┐
┌───▶│ STARTING │
│ └────┬─────┘
│ │ spawn succeeds
│ ▼
│ ┌──────────┐ ┌───────────┐
│ │ RUNNING │────────▶│ UNHEALTHY │
│ └────┬─────┘ health │ │
│ │ fails └─────┬─────┘
│ │ │ auto-restart
│ ▼ │ (kill + restart)
│ ┌──────────┐ │
│ │ STOPPING │ ◄─────────────┘
│ └────┬─────┘
│ │ process exits
│ ▼
│ ┌──────────┐
│ │ EXITED │
│ └────┬─────┘
│ │ autorestart = true
│ ▼ ┌───────┐
│ ┌──────────┐ retries > max │ FATAL │
└────│ BACKOFF │─────────────────▶│ │
└──────────┘ └───────┘
State Transitions¶
From |
To |
Trigger |
|---|---|---|
|
|
|
|
|
Process spawned successfully |
|
|
Spawn failed (command not found, permission error) |
|
|
Stop requested or health check restart |
|
|
Health check failures exceed threshold |
|
|
Health check passes again |
|
|
Auto-restart triggered |
|
|
Process exited after stop signal |
|
|
Process exited |
|
|
|
|
|
Backoff delay elapsed |
|
|
Retry count exceeds |
Concurrency Safety¶
State transitions are protected by an asyncio.Lock (_state_lock) to prevent
race conditions between the supervision loop, RPC commands, and health check
tasks.
Supervisor (core.py)¶
The Supervisor class is the central coordinator:
Loads configuration — Parses INI, creates
ProcessinstancesStarts supervision — Launches async tasks for each process
Signal handling — SIGINT/SIGTERM trigger shutdown, SIGHUP is ignored
PID file locking — Prevents multiple instances via
fcntl.flock()Manages RPC server — Delegates commands to individual processes
Hot reload — Adds/removes processes based on config changes
Shutdown Sequence¶
Receive SIGINT/SIGTERM
Set shutdown event
Release PID file lock
Stop RPC server
Stop EventBus
Stop all processes (with
shutdown_timeout)Exit
RPC Server (rpc.py)¶
The RPC server listens on a Unix domain socket with restrictive permissions
(0o600).
Protocol¶
Length-prefixed JSON over Unix socket:
┌─────────────┬──────────────────┐
│ 4-byte len │ JSON payload │
│ (uint32 BE) │ │
└─────────────┴──────────────────┘
Commands¶
Command |
Parameters |
Description |
|---|---|---|
|
(none) |
List all processes with state, PID, uptime |
|
|
Start a process |
|
|
Stop a process |
|
|
Restart a process |
|
|
Start all processes in group |
|
|
Stop all processes in group |
|
(none) |
Reload configuration |
Security¶
Socket created with
umask(0o177)for atomic restrictive permissionsUnknown commands are rejected with
UNKNOWN_COMMANDerrorInvalid JSON is rejected with
INVALID_JSONerrorMaximum message size: 1 MB
EventBus (events.py)¶
Async publish/subscribe system for process state changes.
Design¶
Bounded
asyncio.Queue(default 1000 events) prevents memory exhaustionWhen queue is full, oldest events are dropped with a warning
Subscribers receive events asynchronously via
await handler(event)Event processing errors are logged but don’t crash the bus
Event Types¶
Event |
Payload |
|---|---|
|
processname, groupname, from_state, pid |
|
processname, groupname, from_state, pid |
|
processname, groupname, from_state, pid |
|
processname, groupname, from_state, pid |
|
processname, groupname, from_state, pid |
|
processname, groupname, from_state, pid |
|
processname, groupname, from_state, pid |
|
processname, groupname, from_state, pid |
|
processname, message, pid |
|
processname, message, failures, pid |
Health Checks (health.py)¶
Health checks run as separate asyncio.Task instances alongside each process.
Architecture¶
Process.supervise()
│
├── spawn() ──▶ _start_health_checks() ──▶ asyncio.Task(_run_health_checks)
│ │
│ ├── sleep(start_period)
│ ├── loop:
│ │ ├── checker.check()
│ │ ├── handle result
│ │ └── sleep(interval)
│ │
└── kill() ───▶ _stop_health_checks() ──────────▶ task.cancel()
Factory Pattern¶
create_health_checker() returns the appropriate checker based on config:
HealthCheckType.TCP→TCPHealthCheckerHealthCheckType.SCRIPT→ScriptHealthCheckerHealthCheckType.NONE→None
Daemonization (main.py)¶
The _daemonize() function implements standard Unix double-fork:
First fork — Parent exits, child continues
setsid()— Creates new session, detaches from terminalSecond fork — Prevents reacquisition of controlling terminal
Redirect stdio — stdin/stdout/stderr →
/dev/null
Child Process Management¶
Process Groups¶
Each child process is started with start_new_session=True, creating a new
process group. This ensures that os.killpg() kills the entire process tree
(the main process and all its children), not just the top-level process.
Orphan Prevention (Linux)¶
On Linux, prctl(PR_SET_PDEATHSIG, SIGKILL) is set in the preexec_fn to
ensure child processes are killed if the parent dies unexpectedly.
User Switching¶
When user is configured, the preexec_fn callback:
Calls
os.initgroups()to set supplementary groupsCalls
os.setgid()to set the group IDCalls
os.setuid()to set the user ID
Failures exit with code 126 (EXIT_CODE_USER_SWITCH_FAILED), which the parent
process interprets as a FATAL state.