llms-foundation-models · 2026-04-28 · Tier 2

Hope Architecture: Nested Learning and Continuously Adapting LLMs

Hope Architecture: Nested Learning and Continuously Adapting LLMs

Date: 2026-04-28
Sources: AI Papers Academy · YouTube · AI Breakfast
Raw: raw/gmail/2026-04-28-starred.md


TL;DR

Google's Hope architecture introduces Nested Learning — a training paradigm where each module has its own objective, learning rate, and update frequency, directly inspired by the brain's multi-timescale plasticity. The core components replace both attention (via Self-Modifying Titans) and FFN layers (via Continuum Memory System). Hope outperforms Transformers and recurrent baselines on Needle-in-a-Haystack, language modeling, and continual learning benchmarks at 760M and 1.3B parameter scales. It's early (small scale, research setting) but represents the most architecturally coherent challenge to frozen-weights Transformers since SSMs.


The Problem: Anterograde Amnesia in Transformers

Standard LLMs learn during training, then weights freeze. Within a context window they can adapt via in-context learning, but that adaptation disappears at session end. The authors compare this to anterograde amnesia — the inability to form new long-term memories.

Prior attempts at continual learning in Transformers fail at catastrophic forgetting: new training overwrites old knowledge unpredictably. This is a fundamental problem with uniform gradient updates across all parameters.


Nested Learning

The core insight: learning should happen at multiple timescales simultaneously, like biological neural systems. Fast processes handle immediate information; slow processes consolidate long-term memory.

The building block is the Neural Learning Module (NLM): a component that actively learns from the incoming data stream, not just transforms it. Each NLM has:

  • Its own objective (what it's optimizing for)
  • Its own learning rate (how fast it updates)
  • Its own update frequency (how often per number of tokens)

Models are composed of many NLMs, each learning at its own pace. The result: updates are not synchronized across the entire network, so forgetting in one module doesn't propagate universally.


Hope Architecture Components

Traditional Transformer layer:
  [Attention] → [FFN]

Hope layer:
  [Self-Modifying Titans] → [Continuum Memory System]
       (replaces attention)        (replaces FFN)

Self-Modifying Titans (Immediate Context)

Unlike attention, which passively reads from the context window, Titans actively compress the input into their own weights in real time. Built from multiple NLMs:

  • One NLM memorizes the input sequence
  • Other NLMs generate the module's own learning rate and decay factors

The model decides how and when to learn from each input — not just what to output. This is meta-learning embedded in the forward pass.

Continuum Memory System (Long-Term Memory)

A sequence of slow-updating NLMs at different update frequencies. The key property: because modules update at different times, knowledge isn't overwritten everywhere simultaneously. If one module forgets a fact, it may persist in adjacent modules and propagate back.

This is a structural solution to catastrophic forgetting — not a regularization term applied to a standard architecture, but a genuine multi-timescale design.


Results

At 760M and 1.3B parameters:

  • Outperforms Transformers and recurrent baselines on Needle-in-a-Haystack (long-context retrieval)
  • Best average on language modeling and commonsense reasoning
  • Consistent improvement on continual learning benchmarks over all baselines

Important caveat: these results are at small scale (760M–1.3B). Whether Nested Learning scales to 7B, 70B, 700B is unknown. Transformers are highly optimized across that entire range. Hope is not.


Prior Context

Connects to the SSM/Mamba thread: Mamba (late 2023) was the last serious architectural challenger to Transformers — selective state space models with better long-sequence scaling. Hope attacks a different problem (continual adaptation, not long-context efficiency), but both are responses to Transformer limitations. The difference: Mamba tried to replace attention for efficiency; Hope tries to replace both attention and FFN for adaptability.

Connects to SuperLocalMemory (04-17): SuperLocalMemory (04-17) also targeted cross-session memory persistence for agents. Hope takes the opposite approach — instead of external memory storage, bake the persistence directly into the architecture's weight update mechanism. These are complementary hypotheses about where agent memory should live.

DeepSeek V4 Engram (04-24): Both Hope's CMS and DeepSeek V4's Engram Conditional Memory aim to separate factual storage from computation. Hope does it through architectural multi-timescale updates; Engram does it through explicit key-value retrieval. Two different structural solutions to the same problem: knowledge storage should be distinct from reasoning.


Why It Matters

If Nested Learning scales, it fundamentally changes what "training" means. Instead of a discrete train/deploy boundary, models would continuously incorporate new information during deployment. This would break the current paradigm of model versioning, fine-tuning cycles, and knowledge cutoff dates.

The self-modifying aspect of Titans also changes inference economics: computation isn't just matrix multiplication over static weights — it includes weight updates per token, which is more compute-intensive but potentially much more capable.


Research Angle (Tier 2 → Tier 1 intersection)

The closest Tier 1 connection: if Hope's CMS acts as a continuously updated KV cache analog, the update frequency design problem becomes a KV cache management problem — when do you evict? When do you consolidate? The KV cache literature (sparse attention, paged attention, H2O eviction) is entirely focused on a static key-value store. Hope asks whether the store itself should be dynamic and self-updating. That's an open research question worth tracking.

Worth Watching: Does Hope's architecture reduce hallucination rates by separating factual retrieval from reasoning? DeepSeek V4's Engram has a 94% hallucination rate despite explicit O(1) factual retrieval — suggesting the separation alone doesn't solve hallucination. Hope's multi-timescale update approach might fare differently.


Open Questions

  1. Scaling: does the multi-timescale update overhead become prohibitive beyond 7B parameters?
  2. Update frequency hyperparameters — how do you set them, and how sensitive are results to these choices?
  3. Does catastrophic forgetting reappear in Hope at longer training horizons, or does the multi-timescale design genuinely eliminate it?

Related Pages