Advanced Training with Callbacks#

Introduction#

This example presents an advanced configuration of the hyper-causal training framework. It extends the previous demo by incorporating external depth scheduling, freeze epochs, and adaptive gradient clipping controlled by recursion depth. The design emphasizes deterministic, depth-aware optimization behavior with stable convergence properties.

Key components include:

External DepthScheduler – adjusts recursion depth independently of the logger.
Freeze epochs – disable updates after depth transitions to stabilize new configurations.
Adaptive gradient clipping – scales gradient bounds dynamically with mean recursion depth.
Epoch-dependent learning decay – combines depth and time scaling for step-size control.
Parameter checkpointing – stores best-performing parameters in JSON format.
Robust metric evaluation – computes SMAPE, RMSE, Overshoot, and Robustness.

—

General Flow Structure#

The training loop performs controlled optimization with three main backends, each using recursive transformations and linear projections. Depth increases are scheduled externally, triggering freeze epochs where parameters remain static, allowing gradient statistics to stabilize before continuing updates.

DepthAwareBackend: recursive causal unit using \(S_t^{(d)} = \tanh(W^{(d)}S_{t-1} + b^{(d)})\).
Projection step: generates K possible futures with span scaled by depth.
Central finite-difference gradients: provide stable numerical estimation.
Adaptive learning control: combines recursion-based scaling with epoch decay.
Clipping: applied adaptively based on the mean depth to limit the L2 norm of the gradient.

—

How to Run#

# From the project root
python -m examples.ex_training_with_callbacks_advanced

# Or directly
python examples/ex_training_with_callbacks_advanced.py

—

Relevant Code Snippets#

Definition of the DepthAwareBackend class and gradient control utilities.#

        np.ndarray
            Current state vector.
        """
        if params:
            self.set_params(params)
        x = self._require_input().astype(float)
        s = x
        for _ in range(max(1, int(self.depth))):
            s = np.tanh(self.w * s + self.b)
        return self._validate_state(s)

    def project_future(self, s_t: np.ndarray, branches: int = 2) -> np.ndarray:
        """
        Generate future states with depth-dependent span.

        Returns
        -------
        np.ndarray
            Projected future states of shape (K, D).
        """
        s = self._validate_state(s_t)
        k = max(2, int(branches))
        span = max(self._span_floor, self._base_span / (1.0 + 0.3 * (self.depth - 1)))
        self._projector = LinearProjector(weight=1.0, bias=0.0, span=span)
        fut = self._projector.project(s, branches=k)
        return self._validate_branches(fut)


# ----------------------------------------------------------------------
# Model-building utilities
# ----------------------------------------------------------------------
def build_model_chain(D=3):
    """Build a three-node model chain with depth-aware backends."""
    cfg = BackendConfig(output_dim=D, seed=11)
    b0 = DepthAwareBackend(cfg, w=0.90, b=0.03, proj_span=0.22)
    b1 = DepthAwareBackend(cfg, w=0.97, b=0.02, proj_span=0.25)
    b2 = DepthAwareBackend(cfg, w=1.05, b=0.00, proj_span=0.30)
    pol = MeanPolicy()
    n0, n1, n2 = HCNode(b0, pol), HCNode(b1, pol), HCNode(b2, pol)
    model = HCModel([n0, n1, n2])
    return model, [n0, n1, n2], [b0, b1, b2]


def params_pack(backends):
    """Flatten parameters of all backends into a single dictionary."""
    packed = {}
    for i, be in enumerate(backends):
        for k, v in be.get_params().items():
            packed[f"b{i}_{k}"] = np.array(v, dtype=float)
    return packed


def params_unpack(backends, packed):
    """Distribute flat parameters back to each backend."""
    for i, be in enumerate(backends):
        sub = {}
        for k in ("w", "b"):
            key = f"b{i}_{k}"
            if key in packed:
                sub[k] = packed[key]
        be.set_params(sub)


# ----------------------------------------------------------------------
# Finite-difference gradients (central)
# ----------------------------------------------------------------------
def central_diff_grads(loss_fn, params, apply_params_fn, eps: float):
    """
    Compute central finite-difference gradients for better stability.

    Gradient ~= (f(x + eps) - f(x - eps)) / (2 * eps)

    """
    grads = {}
    base = {k: v.copy() for k, v in params.items()}

    def setp(p): apply_params_fn(p)
    setp(base)
    _ = loss_fn()

    for k, v in base.items():
        vp = {kk: vv.copy() for kk, vv in base.items()}
        vm = {kk: vv.copy() for kk, vv in base.items()}
        vp[k] = v + eps
        vm[k] = v - eps
        setp(vp); lp = loss_fn()
        setp(vm); lm = loss_fn()
        g = (lp - lm) / (2.0 * eps)
        grads[k] = np.array([g], dtype=float)

    setp(base)
    return grads


def grad_norm(grads: dict) -> float:
    """Compute L2 norm of gradients."""
    sq = 0.0
    for g in grads.values():
        val = float(np.asarray(g).reshape(()))
        sq += val * val
    return float(np.sqrt(sq))


def clip_grads_adaptive(grads: dict, depth_mean: float) -> tuple[dict, float, float]:
    """
    Adaptive gradient clipping based on mean depth.

    Returns
    -------
    tuple
        (clipped_gradients, norm_before, norm_after)
    """
    if depth_mean < 1.5:
        max_norm = 5e-2
    elif depth_mean < 2.5:
        max_norm = 7.5e-2
    else:
        max_norm = 1e-1

    n_before = grad_norm(grads)
    if n_before <= max_norm or n_before == 0.0:
        return grads, n_before, n_before

    scale = max_norm / n_before
    clipped = {k: np.array([float(np.asarray(v).reshape(())) * scale], dtype=float) for k, v in grads.items()}
    n_after = grad_norm(clipped)
    return clipped, n_before, n_after


# ----------------------------------------------------------------------
# Training procedure
# ----------------------------------------------------------------------
def advanced_training_with_freeze():
    """
    Perform advanced hyper-causal training with depth freeze and adaptive clipping.

    Returns
    -------
    dict
        Dictionary containing best loss snapshot and final metrics.
    """
    D, K, T = 3, 5, 48
    EPOCHS = 16
    BASE_LR = 5e-2
    BASE_EPS = 1e-3
    LOG_PATH = Path("runs/telemetry_stable.jsonl")
    SAVE_BEST = True
    BEST_PATH = Path("runs/best_params.json")

    model, nodes, backends = build_model_chain(D=D)

    # Data
    t = np.arange(T, dtype=float)
    x_seq = np.stack([
        0.30 * np.sin(0.35 * t + 0.00),
        0.20 * np.sin(0.35 * t + 0.70),
        0.10 * np.cos(0.35 * t + 0.30),
    ], axis=1)
    target_seq = np.zeros((T, D), dtype=float)

    # Losses
    loss_task = MSELoss()
    loss_cons = ConsistencyLoss(alpha=0.8, beta=1.2)
    loss_coh = CoherenceLoss(mode="variance")

    # Optimizer and callbacks
    params = params_pack(backends)
    opt = make_gradient_descent(lr=BASE_LR)
    state = opt.initialize(params)

    callbacks = CallbackList([
        TelemetryLogger(path=LOG_PATH, flush_interval=8),
        MemoryLogger(),
    ])
    depth_cb = DepthScheduler(target_attr="depth", start=1, end=3, epochs=EPOCHS - 1)

    def apply_params_fn(packed): params_unpack(backends, packed)

    def forward_and_losses():
        total_task = total_cons = total_coh = 0.0
        s_tm1 = None

Main training procedure advanced_training_with_freeze() implementing freeze logic and adaptive clipping.#

                "freeze": True,
            })
            if best is None or det0["total"] < best["total"]:
                best = {"epoch": int(epoch), **det0}
                best_params = {k: float(np.asarray(v).reshape(())) for k, v in params.items()}
                bad_epochs = 0
            else:
                bad_epochs += 1
            print(f"[Epoch {epoch}] FREEZE depth={depths} total={det0['total']:.6f} lr_eff={lr_eff:.3e} eps_eff={eps_eff:.3e}")
            if bad_epochs > patience:
                print(f"Early stopping activated at epoch {epoch} (freeze). Best total={best['total']:.6f} (epoch {best['epoch']}).")
                break
            continue

        def loss_wrapper():
            l, _, _ = forward_and_losses()
            return l

        grads = central_diff_grads(loss_wrapper, params, apply_params_fn, eps=eps_eff)
        grads, gnorm_before, gnorm_after = clip_grads_adaptive(grads, depth_mean)
        params, state = opt.step(params, grads, state)
        apply_params_fn(params)

        total1, det1, _ = forward_and_losses()

        callbacks.on_epoch_end(epoch, {
            "epoch": int(epoch),
            "loss_before": det0,
            "loss_after": det1,
            "grad_norm_before": float(gnorm_before),
            "grad_norm_after": float(gnorm_after),
            "freeze": False,
        })

        if best is None or det1["total"] < best["total"]:
            best = {"epoch": int(epoch), **det1}
            best_params = {k: float(np.asarray(v).reshape(())) for k, v in params.items()}
            bad_epochs = 0
        else:
            bad_epochs += 1

        print(
            f"[Epoch {epoch}] total_before={det0['total']:.6f} total_after={det1['total']:.6f} "
            f"depth={depths} lr_eff={lr_eff:.3e} eps_eff={eps_eff:.3e} "
            f"||g||_before={gnorm_before:.3e} ||g||_after={gnorm_after:.3e}"
        )

        if bad_epochs > patience:
            print(f"Early stopping activated at epoch {epoch}. Best total={best['total']:.6f} (epoch {best['epoch']}).")
            break

    if SAVE_BEST and best_params is not None:
        BEST_PATH.parent.mkdir(parents=True, exist_ok=True)
        with BEST_PATH.open("w") as f:
            json.dump(best_params, f, indent=2)
        print(f"\nBest parameters saved at: {BEST_PATH.resolve()}")

    _, _, y_pred_seq = forward_and_losses()
    smape_val = smape_safe(target_seq[:, 0], y_pred_seq[:, 0])
    rmse_val = rmse(target_seq[:, 0], y_pred_seq[:, 0])
    over_val = overshoot(target_seq[:, 0], y_pred_seq[:, 0])
    rob_val = robustness(target_seq[:, 0], y_pred_seq[:, 0])

    print("\n=== Final metrics (channel 0) ===")
    print(f"SMAPE:      {smape_val:.6f} %")
    print(f"RMSE:       {rmse_val:.6f}")
    print(f"Overshoot:  {over_val:.6f}")
    print(f"Robustness: {rob_val:.6f}")
    print("\nBest epoch snapshot:", {
        "epoch": int(best["epoch"]),
        "task": float(best["task"]),
        "cons": float(best["cons"]),
        "coh":  float(best["coh"]),
        "total": float(best["total"]),
    })

    if LOG_PATH.exists():
        print(f"\nTelemetry JSONL → {LOG_PATH.resolve()}")

    return {
        "best": {
            "epoch": int(best["epoch"]),
            "task": float(best["task"]),
            "cons": float(best["cons"]),
            "coh":  float(best["coh"]),
            "total": float(best["total"]),
        },
        "metrics": {
            "smape": smape_val,
            "rmse": rmse_val,
            "overshoot": over_val,
            "robustness": rob_val
        }
    }


# ----------------------------------------------------------------------
# Entry point
# ----------------------------------------------------------------------
if __name__ == "__main__":
    out = advanced_training_with_freeze()
    print("\nSummary:")

—

Functional Explanation#

This training routine enhances the baseline model with depth adaptation and parameter freezing, resulting in a more controlled optimization process that preserves gradient stability. All components operate deterministically, and the system can reproduce results across runs.

Recursive Backend Dynamics

Each backend updates its state using depth-controlled recursion:

\[S_t^{(d)} = \tanh(W^{(d)} S_{t-1} + b^{(d)})\]

Depth \(d\) determines the number of recursive evaluations per step. Increasing depth raises representational capacity but requires finer gradient control.
Future Projection and Span Scaling

Future states are generated through a linear projection mechanism with depth-dependent span:

\[S_{t+1}^{(k)} = S_t + \Delta_d \cdot \mathcal{P}_k(S_t)\]

where \(\Delta_d\) decreases with increasing depth to maintain bounded perturbations. This ensures consistent diversity among causal branches without instability.
Composite Loss Function

The loss combines predictive, consistency, and coherence terms:

\[\mathcal{L}_{total} = \mathcal{L}_{task} + 0.5\,\mathcal{L}_{consistency} + 0.3\,\mathcal{L}_{coherence}\]

Each term regulates a specific property: - Task: prediction accuracy. - Consistency: smooth temporal evolution. - Coherence: branch uniformity at projection level.
Finite-Difference Gradient Estimation

Gradients are computed numerically:

\[g_i = \frac{\mathcal{L}(\theta_i + \epsilon) - \mathcal{L}(\theta_i - \epsilon)}{2\epsilon}\]

This avoids dependency on differentiable computation graphs and maintains stability under recursion.
Adaptive Learning and Perturbation Scaling

Learning parameters decay with both depth and epoch index:

\[\eta_{\text{eff}} = \frac{\eta_0}{(1 + 0.5(d - 1))(1 + 0.5e)}, \quad \epsilon_{\text{eff}} = \frac{\epsilon_0}{(1 + 0.3(d - 1))(1 + 0.3e)}\]

where \(e\) is the current epoch. This provides temporal damping, ensuring smaller updates as the system stabilizes.
Freeze Epochs

After each depth increase, one epoch executes without parameter updates:

\[\theta_{t+1} = \theta_t \quad \text{if depth\_changed=True}\]

This step prevents transient gradient noise from destabilizing new recursion levels.
Adaptive Gradient Clipping

The clipping threshold increases with mean depth \(\bar{d}\):

\[\begin{split}\tau = \begin{cases} 5\times10^{-2}, & \bar{d} < 1.5 \\ 7.5\times10^{-2}, & 1.5 \le \bar{d} < 2.5 \\ 1\times10^{-1}, & \bar{d} \ge 2.5 \end{cases}\end{split}\]

The gradients are then rescaled:

\[g_i' = g_i \cdot \min\left(1, \frac{\tau}{\|g\|_2}\right)\]

providing consistent control across all recursion depths.
Metric Evaluation

After training, four metrics summarize performance:
- SMAPE: symmetric mean absolute percentage error.
- RMSE: root mean square error.
- Overshoot: excess deviation in prediction amplitude.
- Robustness: correlation-based stability ratio.
All metrics are computed on the first output channel for reproducibility.

—

Exact Output#

[Epoch 0] total_before=0.031141 total_after=0.030663 depth=[1, 1, 1] lr_eff=5.000e-02 eps_eff=1.000e-03 ||g||_before=1.971e-01 ||g||_after=5.000e-02
[Epoch 1] total_before=0.030663 total_after=0.030362 depth=[1, 1, 1] lr_eff=3.333e-02 eps_eff=7.692e-04 ||g||_before=1.848e-01 ||g||_after=5.000e-02
[Epoch 2] total_before=0.030362 total_after=0.030145 depth=[1, 1, 1] lr_eff=2.500e-02 eps_eff=6.250e-04 ||g||_before=1.767e-01 ||g||_after=5.000e-02
[Epoch 3] total_before=0.030145 total_after=0.029977 depth=[1, 1, 1] lr_eff=2.000e-02 eps_eff=5.263e-04 ||g||_before=1.707e-01 ||g||_after=5.000e-02
[Epoch 4] FREEZE depth=[2, 2, 2] total=0.026492 lr_eff=1.111e-02 eps_eff=3.497e-04
[Epoch 5] total_before=0.026492 total_after=0.026122 depth=[2, 2, 2] lr_eff=9.524e-03 eps_eff=3.077e-04 ||g||_before=5.258e-01 ||g||_after=7.500e-02
[Epoch 6] total_before=0.026122 total_after=0.025806 depth=[2, 2, 2] lr_eff=8.333e-03 eps_eff=2.747e-04 ||g||_before=5.114e-01 ||g||_after=7.500e-02
[Epoch 7] total_before=0.025806 total_after=0.025532 depth=[2, 2, 2] lr_eff=7.407e-03 eps_eff=2.481e-04 ||g||_before=4.988e-01 ||g||_after=7.500e-02
[Epoch 8] total_before=0.025532 total_after=0.025291 depth=[2, 2, 2] lr_eff=6.667e-03 eps_eff=2.262e-04 ||g||_before=4.876e-01 ||g||_after=7.500e-02
[Epoch 9] total_before=0.025291 total_after=0.025076 depth=[2, 2, 2] lr_eff=6.061e-03 eps_eff=2.079e-04 ||g||_before=4.775e-01 ||g||_after=7.500e-02
[Epoch 10] total_before=0.025076 total_after=0.024883 depth=[2, 2, 2] lr_eff=5.556e-03 eps_eff=1.923e-04 ||g||_before=4.683e-01 ||g||_after=7.500e-02
[Epoch 11] total_before=0.024883 total_after=0.024707 depth=[2, 2, 2] lr_eff=5.128e-03 eps_eff=1.789e-04 ||g||_before=4.599e-01 ||g||_after=7.500e-02
[Epoch 12] FREEZE depth=[3, 3, 3] total=0.023982 lr_eff=3.571e-03 eps_eff=1.359e-04
[Epoch 13] total_before=0.023982 total_after=0.023682 depth=[3, 3, 3] lr_eff=3.333e-03 eps_eff=1.276e-04 ||g||_before=9.096e-01 ||g||_after=1.000e-01
[Epoch 14] total_before=0.023682 total_after=0.023404 depth=[3, 3, 3] lr_eff=3.125e-03 eps_eff=1.202e-04 ||g||_before=8.949e-01 ||g||_after=1.000e-01
[Epoch 15] total_before=0.023404 total_after=0.023147 depth=[3, 3, 3] lr_eff=2.941e-03 eps_eff=1.136e-04 ||g||_before=8.810e-01 ||g||_after=1.000e-01

Best parameters saved at: runs/best_params.json

=== Final metrics (channel 0) ===
SMAPE:      100.000000 %
RMSE:       0.167734
Overshoot:  0.000000
Robustness: 0.972635

Best epoch snapshot: {'epoch': 15, 'task': 0.01889342623335241, 'cons': 0.0010430885080397983, 'coh': 0.01243963263570178, 'total': 0.023146860278082843}

Telemetry JSONL → runs/telemetry_stable.jsonl

Summary:
{
  "best": {
    "epoch": 15,
    "task": 0.01889342623335241,
    "cons": 0.0010430885080397983,
    "coh": 0.01243963263570178,
    "total": 0.023146860278082843
  },
  "metrics": {
    "smape": 100.0,
    "rmse": 0.16773380360977846,
    "overshoot": 0.0,
    "robustness": 0.9726352677136917
  }
}

Advanced Training with Callbacks

Contents