Architecture

This document describes the internal architecture of Supervice for contributors and anyone interested in how it works under the hood.

Module Overview

supervice/
├── main.py        Entry point, argument parsing, daemonization
├── core.py        Supervisor orchestrator — central coordinator
├── process.py     Process lifecycle with async state machine
├── config.py      INI config parser with validation
├── models.py      Data models (dataclasses)
├── rpc.py         Unix socket RPC server (JSON over length-prefixed protocol)
├── client.py      CLI client (supervicectl) and Controller class
├── events.py      EventBus — async pub/sub for state changes
├── health.py      Health check implementations (TCP, script)
└── logger.py      Logging setup with rotation

Data Flow

                    Config File (INI)
                         │
                         ▼
                  ┌──────────────┐
                  │ parse_config │  config.py
                  └──────┬───────┘
                         │ SupervisorConfig
                         ▼
                  ┌──────────────┐
                  │  Supervisor  │  core.py
                  │ load_config  │
                  └──────┬───────┘
                         │ Creates Process instances
                         ▼
              ┌──────────────────────┐
              │     async run()      │
              │                      │
              │  ┌────────────────┐  │
              │  │   Process(s)   │  │  process.py
              │  │   supervise()  │  │
              │  └────────┬───────┘  │
              │           │          │
              │  ┌────────▼───────┐  │
              │  │   RPCServer    │  │  rpc.py
              │  │   (Unix sock)  │  │
              │  └────────┬───────┘  │
              │           │          │
              │  ┌────────▼───────┐  │
              │  │   EventBus     │  │  events.py
              │  │   (pub/sub)    │  │
              │  └────────────────┘  │
              └──────────────────────┘
                         ▲
                         │ JSON/Unix Socket
                         │
              ┌──────────────────────┐
              │     Controller       │  client.py
              │   (supervicectl)     │
              └──────────────────────┘

Process State Machine

Each Process instance manages a single OS process through a state machine:

                          ┌──────────┐
                          │ STOPPED  │ ◄── initial state
                          └────┬─────┘
                               │ should_run = true
                               ▼
                          ┌──────────┐
                     ┌───▶│ STARTING │
                     │    └────┬─────┘
                     │         │ spawn succeeds
                     │         ▼
                     │    ┌──────────┐         ┌───────────┐
                     │    │ RUNNING  │────────▶│ UNHEALTHY │
                     │    └────┬─────┘  health │           │
                     │         │        fails  └─────┬─────┘
                     │         │                     │ auto-restart
                     │         ▼                     │ (kill + restart)
                     │    ┌──────────┐               │
                     │    │ STOPPING │ ◄─────────────┘
                     │    └────┬─────┘
                     │         │ process exits
                     │         ▼
                     │    ┌──────────┐
                     │    │  EXITED  │
                     │    └────┬─────┘
                     │         │ autorestart = true
                     │         ▼                        ┌───────┐
                     │    ┌──────────┐    retries > max  │ FATAL │
                     └────│ BACKOFF  │─────────────────▶│       │
                          └──────────┘                  └───────┘

State Transitions

From

To

Trigger

STOPPED

STARTING

should_run set to true

STARTING

RUNNING

Process spawned successfully

STARTING

FATAL

Spawn failed (command not found, permission error)

RUNNING

STOPPING

Stop requested or health check restart

RUNNING

UNHEALTHY

Health check failures exceed threshold

UNHEALTHY

RUNNING

Health check passes again

UNHEALTHY

STOPPING

Auto-restart triggered

STOPPING

STOPPED

Process exited after stop signal

STOPPING

EXITED

Process exited

EXITED

BACKOFF

autorestart is true

BACKOFF

STARTING

Backoff delay elapsed

BACKOFF

FATAL

Retry count exceeds startretries

Concurrency Safety

State transitions are protected by an asyncio.Lock (_state_lock) to prevent race conditions between the supervision loop, RPC commands, and health check tasks.

Supervisor (core.py)

The Supervisor class is the central coordinator:

  1. Loads configuration — Parses INI, creates Process instances

  2. Starts supervision — Launches async tasks for each process

  3. Signal handling — SIGINT/SIGTERM trigger shutdown, SIGHUP is ignored

  4. PID file locking — Prevents multiple instances via fcntl.flock()

  5. Manages RPC server — Delegates commands to individual processes

  6. Hot reload — Adds/removes processes based on config changes

Shutdown Sequence

  1. Receive SIGINT/SIGTERM

  2. Set shutdown event

  3. Release PID file lock

  4. Stop RPC server

  5. Stop EventBus

  6. Stop all processes (with shutdown_timeout)

  7. Exit

RPC Server (rpc.py)

The RPC server listens on a Unix domain socket with restrictive permissions (0o600).

Protocol

Length-prefixed JSON over Unix socket:

┌─────────────┬──────────────────┐
│ 4-byte len  │   JSON payload   │
│ (uint32 BE) │                  │
└─────────────┴──────────────────┘

Commands

Command

Parameters

Description

status

(none)

List all processes with state, PID, uptime

start

name

Start a process

stop

name

Stop a process

restart

name, force (optional)

Restart a process

startgroup

name

Start all processes in group

stopgroup

name

Stop all processes in group

reload

(none)

Reload configuration

Security

  • Socket created with umask(0o177) for atomic restrictive permissions

  • Unknown commands are rejected with UNKNOWN_COMMAND error

  • Invalid JSON is rejected with INVALID_JSON error

  • Maximum message size: 1 MB

EventBus (events.py)

Async publish/subscribe system for process state changes.

Design

  • Bounded asyncio.Queue (default 1000 events) prevents memory exhaustion

  • When queue is full, oldest events are dropped with a warning

  • Subscribers receive events asynchronously via await handler(event)

  • Event processing errors are logged but don’t crash the bus

Event Types

Event

Payload

PROCESS_STATE_STARTING

processname, groupname, from_state, pid

PROCESS_STATE_RUNNING

processname, groupname, from_state, pid

PROCESS_STATE_BACKOFF

processname, groupname, from_state, pid

PROCESS_STATE_STOPPING

processname, groupname, from_state, pid

PROCESS_STATE_EXITED

processname, groupname, from_state, pid

PROCESS_STATE_STOPPED

processname, groupname, from_state, pid

PROCESS_STATE_FATAL

processname, groupname, from_state, pid

PROCESS_STATE_UNHEALTHY

processname, groupname, from_state, pid

HEALTHCHECK_PASSED

processname, message, pid

HEALTHCHECK_FAILED

processname, message, failures, pid

Health Checks (health.py)

Health checks run as separate asyncio.Task instances alongside each process.

Architecture

Process.supervise()
     │
     ├── spawn() ──▶ _start_health_checks() ──▶ asyncio.Task(_run_health_checks)
     │                                                │
     │                                                ├── sleep(start_period)
     │                                                ├── loop:
     │                                                │   ├── checker.check()
     │                                                │   ├── handle result
     │                                                │   └── sleep(interval)
     │                                                │
     └── kill() ───▶ _stop_health_checks() ──────────▶ task.cancel()

Factory Pattern

create_health_checker() returns the appropriate checker based on config:

  • HealthCheckType.TCPTCPHealthChecker

  • HealthCheckType.SCRIPTScriptHealthChecker

  • HealthCheckType.NONENone

Daemonization (main.py)

The _daemonize() function implements standard Unix double-fork:

  1. First fork — Parent exits, child continues

  2. setsid() — Creates new session, detaches from terminal

  3. Second fork — Prevents reacquisition of controlling terminal

  4. Redirect stdio — stdin/stdout/stderr → /dev/null

Child Process Management

Process Groups

Each child process is started with start_new_session=True, creating a new process group. This ensures that os.killpg() kills the entire process tree (the main process and all its children), not just the top-level process.

Orphan Prevention (Linux)

On Linux, prctl(PR_SET_PDEATHSIG, SIGKILL) is set in the preexec_fn to ensure child processes are killed if the parent dies unexpectedly.

User Switching

When user is configured, the preexec_fn callback:

  1. Calls os.initgroups() to set supplementary groups

  2. Calls os.setgid() to set the group ID

  3. Calls os.setuid() to set the user ID

Failures exit with code 126 (EXIT_CODE_USER_SWITCH_FAILED), which the parent process interprets as a FATAL state.