Training with Callbacks Demo#

Introduction#

This example presents a stable training framework for a three-node hyper-causal model. It demonstrates a complete adaptive optimization cycle using depth-aware recursion, finite-difference gradients, and callback-driven monitoring. The goal is to illustrate how a causal system can be trained under dynamic conditions without relying on classical backpropagation.

Key mechanisms include:

DepthScheduler – controls recursion depth dynamically per epoch.
Adaptive learning parameters – rescale learning rate and perturbation with recursion depth.
Finite-difference gradient estimation – numerical gradient replacement for stability.
Gradient clipping – limits the trust region to prevent divergence.
Callback telemetry – monitors all metrics and parameters across epochs.
Early stopping – detects convergence based on stability in total loss.

—

General Flow Structure#

The training loop coordinates recursive backends connected through causal dependencies. Each backend adjusts its transformation depth, projects future states, and contributes to a composite loss. The callback system regulates both the internal optimization and the external logging.

DepthAwareBackend: applies recursive transformations \(S_t^{(d)} = \tanh(W^{(d)}S_{t-1} + b^{(d)})\).
Projection: expands each state into \(K\) future branches via a linear projector.
Loss aggregation: combines task, consistency, and coherence components.
Optimizer: updates parameters using finite-difference gradients and gradient clipping.
Scheduler: increases recursion depth gradually to control system complexity.

—

How to Run#

# From the project root
python -m examples.ex_training_with_callbacks_demo

# Or directly
python examples/ex_training_with_callbacks_demo.py

—

Relevant Code Snippets#

Definition of the DepthAwareBackend class (recursive tanh transformation and projection)#

            self.w = float(np.asarray(params["w"]).reshape(()))
        if "b" in params:
            self.b = float(np.asarray(params["b"]).reshape(()))

    def run(self, params: dict | None = None) -> np.ndarray:
        """
        Apply depth-recursive tanh transformation.

        Parameters
        ----------
        params : dict or None, optional
            Optional parameter override for this call.

        Returns
        -------
        np.ndarray
            Validated current state vector.
        """
        if params:
            self.set_params(params)
        x = self._require_input().astype(float)
        s = x
        for _ in range(max(1, int(self.depth))):
            s = np.tanh(self.w * s + self.b)
        return self._validate_state(s)

    def project_future(self, s_t: np.ndarray, branches: int = 2) -> np.ndarray:
        """
        Project future states around ``s_t`` using a depth-adjusted span.

        Parameters
        ----------
        s_t : np.ndarray
            Current state vector.
        branches : int, optional
            Number of future branches (K), by default 2.

        Returns
        -------
        np.ndarray
            Future branch matrix with shape ``(K, D)``.
        """
        s = self._validate_state(s_t)
        k = max(2, int(branches))
        # span reduced with depth, but with a high floor (0.10)
        span = max(self._span_floor, self._base_span / (1.0 + 0.3 * (self.depth - 1)))
        self._projector = LinearProjector(weight=1.0, bias=0.0, span=span)
        fut = self._projector.project(s, branches=k)
        return self._validate_branches(fut)


# ------------------------ Construction utils ------------------------
def build_model_chain(D=3):
    """
    Build a three-node HCModel with depth-aware backends.

    Parameters
    ----------
    D : int, optional
        State dimensionality, by default 3.

    Returns
    -------
    tuple
        (model, nodes, backends)
    """
    cfg = BackendConfig(output_dim=D, seed=11)
    b0 = DepthAwareBackend(cfg, w=0.90, b=0.03, proj_span=0.22)
    b1 = DepthAwareBackend(cfg, w=0.97, b=0.02, proj_span=0.25)
    b2 = DepthAwareBackend(cfg, w=1.05, b=0.00, proj_span=0.30)
    pol = MeanPolicy()
    n0, n1, n2 = HCNode(b0, pol), HCNode(b1, pol), HCNode(b2, pol)
    model = HCModel([n0, n1, n2])
    return model, [n0, n1, n2], [b0, b1, b2]


def params_pack(backends):
    """
    Flatten parameters from all backends into a single dictionary.

    Parameters
    ----------
    backends : list
        List of backend instances.

    Returns
    -------
    dict
        Flattened parameter dictionary suitable for the optimizer.
    """
    packed = {}
    for i, be in enumerate(backends):
        for k, v in be.get_params().items():
            packed[f"b{i}_{k}"] = np.array(v, dtype=float)
    return packed


def params_unpack(backends, packed):
    """
    Distribute flat parameters back to their corresponding backends.

    Parameters
    ----------
    backends : list
        Backend instances to update.
    packed : dict
        Flattened parameter dictionary.
    """
    for i, be in enumerate(backends):
        sub = {}
        for k in ("w", "b"):
            key = f"b{i}_{k}"
            if key in packed:
                sub[k] = packed[key]
        be.set_params(sub)


# ------------------- CENTRAL finite-difference grads -------------------
def central_diff_grads(loss_fn, params, apply_params_fn, eps: float):
    """
    Central finite-difference gradients (more stable than forward diff).

    Gradient ≈ (f(x + eps) - f(x - eps)) / (2 * eps)

    Parameters
    ----------
    loss_fn : callable
        Function that recomputes the full loss given current parameters.
    params : dict
        Parameter dictionary (values are scalar arrays).
    apply_params_fn : callable
        Function to apply a parameter dictionary to the model.
    eps : float

Main function stable_training_demo() (training loop with callbacks and adaptive learning)#

    BASE_EPS = 1e-3
    LOG_PATH = Path("runs/telemetry_stable.jsonl")

    # Model
    model, nodes, backends = build_model_chain(D=D)

    # Data
    t = np.arange(T, dtype=float)
    x_seq = np.stack([
        0.30 * np.sin(0.35 * t + 0.00),
        0.20 * np.sin(0.35 * t + 0.70),
        0.10 * np.cos(0.35 * t + 0.30),
    ], axis=1)
    target_seq = np.zeros((T, D), dtype=float)

    # Losses
    loss_task = MSELoss()
    loss_cons = ConsistencyLoss(alpha=0.8, beta=1.2)
    loss_coh = CoherenceLoss(mode="variance")

    # Optimizer + telemetry
    params = params_pack(backends)
    opt = make_gradient_descent(lr=BASE_LR)  # base LR; recalibrated by depth each epoch
    state = opt.initialize(params)

    callbacks = CallbackList([
        TelemetryLogger(path=LOG_PATH, flush_interval=8),
        MemoryLogger(),
    ])
    # Real DepthScheduler (1 → 3 across EPOCHS-1; clamped at 3)
    depth_cb = DepthScheduler(target_attr="depth", start=1, end=3, epochs=EPOCHS - 1)

    def apply_params_fn(packed):
        params_unpack(backends, packed)

    def forward_and_losses():
        """
        Compute forward pass over the sequence and all loss terms.

        Returns
        -------
        tuple
            (total_loss, details_dict, last_state_sequence)
        """
        total_task = total_cons = total_coh = 0.0
        s_tm1 = None
        y_last = []
        for step in range(T):
            callbacks.on_step_begin(step, {"step": int(step)})
            s_t, s_hat, infos = model.forward_chain(x_seq[step], s_tm1=s_tm1, branches=K)
            y_last.append(s_t)
            total_task += loss_task(s_t, target_seq[step])
            if s_tm1 is not None:
                total_cons += loss_cons(s_tm1, s_t, s_hat)
            coh_vals = []
            for info in infos:
                br = info.get("branches", None)
                if isinstance(br, np.ndarray) and br.ndim == 2:
                    coh_vals.append(loss_coh(br))
            if coh_vals:
                total_coh += float(np.mean(coh_vals))
            s_tm1 = s_t
            callbacks.on_step_end(step, {"step": int(step)})
        task = total_task / T
        cons = total_cons / max(1, T - 1)
        coh = total_coh / T
        total = task + 0.5 * cons + 0.3 * coh
        return total, {"task": float(task), "cons": float(cons), "coh": float(coh), "total": float(total)}, np.asarray(y_last)

    best = None
    patience = 1
    bad_epochs = 0

    for epoch in range(EPOCHS):
        # 1) Adjust depth (real)
        for be in backends:
            depth_cb.on_epoch_begin(epoch, {"backend": be})
        depth_mean = float(np.mean([be.depth for be in backends]))

        # 2) Depth-adaptive LR and EPS
        lr_eff = BASE_LR / (depth_mean ** 2)
        eps_eff = BASE_EPS / (1.0 + 0.5 * (depth_mean - 1.0))
        opt = make_gradient_descent(lr=lr_eff)  # recreate with effective LR

        # 3) JSON-safe telemetry
        callbacks.on_epoch_begin(epoch, {"epoch": int(epoch), "depth": float(depth_mean), "lr_eff": float(lr_eff), "eps_eff": float(eps_eff)})

        # 4) Forward before update
        total0, det0, _ = forward_and_losses()

        # 5) Central gradients + clipping + step
        def loss_wrapper():
            l, _, _ = forward_and_losses()
            return l

        grads = central_diff_grads(loss_wrapper, params, apply_params_fn, eps=eps_eff)
        grads = clip_grads(grads, max_norm=5e-2)
        params, state = opt.step(params, grads, state)
        apply_params_fn(params)

        # 6) Forward after update
        total1, det1, _ = forward_and_losses()

        # 7) Telemetry end
        callbacks.on_epoch_end(epoch, {
            "epoch": int(epoch),
            "loss_before": det0,
            "loss_after": det1,
        })

        # 8) Early stopping
        if best is None or det1["total"] < best["total"]:
            best = {"epoch": int(epoch), **det1}
            bad_epochs = 0
        else:
            bad_epochs += 1

        depths = [int(be.depth) for be in backends]
        print(f"[Epoch {epoch}] total_before={det0['total']:.6f} total_after={det1['total']:.6f} depth={depths} lr_eff={lr_eff:.3e} eps_eff={eps_eff:.3e}")

        if bad_epochs > patience:
            print(f"Early stopping activated at epoch {epoch}. Best total={best['total']:.6f} (epoch {best['epoch']}).")
            break

    # Final metrics using the last forward
    _, _, y_pred_seq = forward_and_losses()
    smape_val = smape_safe(target_seq[:, 0], y_pred_seq[:, 0])
    rmse_val = rmse(target_seq[:, 0], y_pred_seq[:, 0])
    over_val = overshoot(target_seq[:, 0], y_pred_seq[:, 0])
    rob_val = robustness(target_seq[:, 0], y_pred_seq[:, 0])

    print("\n=== Final metrics (channel 0) ===")
    print(f"SMAPE:      {smape_val:.6f} %")
    print(f"RMSE:       {rmse_val:.6f}")
    print(f"Overshoot:  {over_val:.6f}")
    print(f"Robustness: {rob_val:.6f}")
    print("\nBest epoch snapshot:", best)

    if LOG_PATH.exists():
        print(f"\nTelemetry JSONL → {LOG_PATH.resolve()}")

    return {"best": best, "metrics": {"smape": smape_val, "rmse": rmse_val, "overshoot": over_val, "robustness": rob_val}}


# ---------------------------- Entry point ----------------------------
if __name__ == "__main__":
    out = stable_training_demo()
    print("\nSummary:")
    print(json.dumps(out, indent=2))

—

Functional Explanation#

The model trains through controlled recursion and adaptive numerical optimization. Each component has a defined mathematical role in stabilizing and guiding the learning process.

Recursive Depth Evolution

Each backend performs a recursive update of the internal state:

\[S_t^{(d)} = \tanh(W^{(d)} S_{t-1} + b^{(d)})\]

The recursion depth \(d\) determines the number of internal evaluations per epoch. Increasing \(d\) allows the model to capture higher-order temporal dependencies.
Future Projection

Each current state generates \(K\) predicted future states:

\[S_{t+1}^{(k)} = S_t + \Delta_d \cdot \mathcal{P}_k(S_t)\]

where \(\mathcal{P}_k\) is a projection operator and \(\Delta_d\) scales with depth. This projection step introduces local temporal uncertainty and allows causal branching.
Loss Structure

The total loss integrates three objectives:

\[\mathcal{L}_{total} = \mathcal{L}_{task} + 0.5\,\mathcal{L}_{consistency} + 0.3\,\mathcal{L}_{coherence}\]
- Task loss \(\mathcal{L}_{task} = \frac{1}{T}\sum_t \|S_t - Y_t\|^2\) minimizes prediction error.
- Consistency loss maintains temporal smoothness: \(\mathcal{L}_{consistency} = \alpha\|S_t - S_{t-1}\|^2 + \beta\|S_t - \hat{S}_{t+1}\|^2\).
- Coherence loss enforces similarity among projected branches: \(\mathcal{L}_{coherence} = \mathrm{Var}(S_{t+1}^{(k)})\).
Gradient Estimation

Finite differences are used to estimate local gradients:

\[g_i = \frac{\mathcal{L}(\theta_i + \epsilon) - \mathcal{L}(\theta_i - \epsilon)}{2\epsilon}\]

This method avoids symbolic differentiation and remains stable under non-smooth operations.
Adaptive Learning Parameters

The effective parameters adjust with recursion depth:

\[\eta_{\text{eff}} = \frac{\eta_0}{d^2}, \qquad \epsilon_{\text{eff}} = \frac{\epsilon_0}{1 + 0.5(d - 1)}\]

These relations reduce step size and perturbation magnitude as depth increases, improving convergence stability for deeper causal recursions.
Gradient Clipping

All gradient vectors are constrained within a trust region:

\[g_i' = g_i \cdot \min\left(1, \frac{\tau}{\|g\|_2}\right)\]

where \(\tau\) is the clipping threshold. This ensures controlled parameter updates.
Callback Coordination
- DepthScheduler: adjusts recursion depth at specific epochs.
- TelemetryLogger: records per-epoch statistics to JSONL.
- MemoryLogger: stores metrics in memory for later visualization.
These components synchronize the optimization and provide complete training traceability.

—

Exact Output#

[Epoch 0] total_before=0.031141 total_after=0.030663 depth=[1, 1, 1] lr_eff=5.000e-02 eps_eff=1.000e-03
[Epoch 1] total_before=0.030663 total_after=0.030217 depth=[1, 1, 1] lr_eff=5.000e-02 eps_eff=1.000e-03
[Epoch 2] total_before=0.030217 total_after=0.029800 depth=[1, 1, 1] lr_eff=5.000e-02 eps_eff=1.000e-03
[Epoch 3] total_before=0.025941 total_after=0.025629 depth=[2, 2, 2] lr_eff=1.250e-02 eps_eff=6.667e-04
[Epoch 4] total_before=0.025629 total_after=0.025326 depth=[2, 2, 2] lr_eff=1.250e-02 eps_eff=6.667e-04
[Epoch 5] total_before=0.025326 total_after=0.025030 depth=[2, 2, 2] lr_eff=1.250e-02 eps_eff=6.667e-04
[Epoch 6] total_before=0.025030 total_after=0.024742 depth=[2, 2, 2] lr_eff=1.250e-02 eps_eff=6.667e-04
[Epoch 7] total_before=0.024742 total_after=0.024462 depth=[2, 2, 2] lr_eff=1.250e-02 eps_eff=6.667e-04
[Epoch 8] total_before=0.024462 total_after=0.024190 depth=[2, 2, 2] lr_eff=1.250e-02 eps_eff=6.667e-04
[Epoch 9] total_before=0.022960 total_after=0.022723 depth=[3, 3, 3] lr_eff=5.556e-03 eps_eff=5.000e-04
[Epoch 10] total_before=0.022723 total_after=0.022489 depth=[3, 3, 3] lr_eff=5.556e-03 eps_eff=5.000e-04
[Epoch 11] total_before=0.022489 total_after=0.022259 depth=[3, 3, 3] lr_eff=5.556e-03 eps_eff=5.000e-04

=== Final metrics (channel 0) ===
SMAPE:      100.000000 %
RMSE:       0.165016
Overshoot:  0.000000
Robustness: 0.973491

Best epoch snapshot: {'epoch': 11, 'task': 0.018001068416014097, 'cons': 0.0010439829584105837, 'coh': 0.012454292234950454, 'total': 0.022259347565704524}

Telemetry JSONL → runs/telemetry_stable.jsonl

Summary:
{
  "best": {
    "epoch": 11,
    "task": 0.018001068416014097,
    "cons": 0.0010439829584105837,
    "coh": 0.012454292234950454,
    "total": 0.022259347565704524
  },
  "metrics": {
    "smape": 100.0,
    "rmse": 0.1650163483217719,
    "overshoot": 0.0,
    "robustness": 0.9734914432630335
  }
}

Training with Callbacks Demo

Contents