thoughts on continual learning

The Problem with Memory-R1

Memory-R1^[1] introduced something clever: train agents to manage their own external memory banks. A "Memory Manager" learns to ADD, UPDATE, DELETE, and NOOP facts, while an "Answer Agent" retrieves and reasons over them. It works well for factual Q&A.

But it breaks down for complex reasoning. The action space is too simple—you can't compress a 50-step coding trajectory into a single ADD operation. The reward model is sparse and outcome-based, giving no signal about why a trajectory succeeded or failed. Most critically, it has no mechanism to learn from failures. A wrong strategy just pollutes the memory bank.

ReasoningBank's Insight

ReasoningBank^[2] showed us what we're actually missing: procedural memory. Instead of storing facts ("Paris is the capital of France"), store strategies ("when the API returns 403, check authentication headers before retrying"). They introduced Memory-aware Test-Time Scaling (MaTTS)—generate multiple rollouts, contrast them, and distill reusable reasoning patterns into a memory bank.

The key insight: failures contain information. A failed trajectory that tried approach A before succeeding with approach B teaches you a guardrail ("avoid A in context X").

Reinforced Adaptive Memory (RAM)

What if we combined both approaches? Take Memory-R1's RL framework for active memory management, but upgrade it to handle ReasoningBank-style procedural strategies.

Here are three modifications:

1. The Architect Agent

Replace the fact-based manager with an offline "Architect Agent" that compresses entire trajectories into reusable strategies. After each task attempt, it learns to distill:

Success patterns → skills ("to parse JSON responses, first validate structure")
Failure patterns → guardrails ("don't use regex on nested HTML")
Common patterns → rules ("authentication errors always require token refresh")

Train this via an outcome reward model (ORM) that measures: did updating the memory improve performance on held-out tasks?

2. Procedural Action Space

Expand beyond ADD/UPDATE/DELETE/NOOP to operations that match how we actually learn. The agent needs operations to distill successful trajectories into reusable skills, refine existing strategies when they fail in new ways, merge redundant patterns together, and prune outdated approaches that no longer work. This lets the agent build a compositional library of strategies rather than a flat key-value store.

3. The Filter Agent with Deterministic Rewards

Here's where it gets interesting. Instead of training a process reward model (PRM) to judge memory relevance—which is subjective and expensive—we use a deterministic reward signal.

The setup:

Collect initial trajectories on N tasks (most will fail)
Populate memory bank with different granularities: step-level episodes, distilled strategies, compressed skills
Store the best historical performance for each task: (s_best, l_best) where s is success/failure and l is trajectory length

During RL training of the Filter Agent, the reward signal compares current performance against historical bests. If both the current and historical attempts succeed, reward efficiency improvements. If the agent fixes a previously failing task, give a large reward. If it regresses on a task it used to solve, penalize it. If both fail, apply a small penalty. This creates a deterministic signal that encourages the agent to maintain successes, fix failures, and improve efficiency over time.

The Filter Agent learns to select the single best memory from Top-K retrieval at each step, trained to maximize cumulative reward. The reward is deterministic—it compares against historical best, not a subjective judgment.

Why This Works

Traditional RL for agents suffers from sparse rewards: you only know if the entire trajectory succeeded or failed. PRMs try to fix this by rating each step, but training them requires expensive human labeling or synthetic data generation.

Our approach sidesteps this: the reward at each step is implicitly dense because we're comparing trajectories in progress against historical bests. If retrieving memory X at step 5 leads to a 10-step solution when the best was 15 steps, the Filter Agent learns that X was valuable at step 5 for this task type.

As training progresses:

Filter Agent gets better at recall precision
Better recalls lead to better trajectories
Better trajectories become new benchmarks (l_best decreases)
The bar keeps rising, creating compound improvement

We can add an entropy-based threshold: only retrieve memories when the agent is uncertain (high entropy over next actions). This prevents memorization and encourages generalization.

Why This Matches Human Learning

Humans don't learn by rating individual thoughts. We learn by doing tasks repeatedly and noticing: "Last time took me 3 hours, this time took 30 minutes—I must have learned something useful."

The memory bank becomes a crystallization of experience. Early on, it's full of low-level episodes ("clicked the wrong button, got error"). With repetition, the Architect Agent compresses these into higher-level strategies ("always verify form state before submission"). The Filter Agent learns which strategies apply when.

This is closer to how expertise actually develops: through repeated practice with implicit feedback from improving performance.

Open Questions

Some things worth exploring:

Memory decay: Should old strategies fade if they stop being useful? Perhaps weight by recency and retrieval frequency.

Transfer learning: Can strategies learned on WebShop^[3] transfer to WebArena^[4]? You'd need a meta-learning layer that identifies task family similarities.

Multi-agent memory: What if multiple agents share a memory bank? You get diversity in strategy discovery but need consensus mechanisms for quality.

Compositionality: Can we learn to chain strategies? "For authentication errors, first try token refresh, then fallback to re-login."

The Path Forward

The insight here is simple: continual learning isn't about storing everything—it's about learning what to remember and when to recall it. Memory-R1 gave us the RL framework for active memory management. ReasoningBank showed us what to store. RAM combines them with a deterministic reward signal that makes training tractable.

What matters is the positive feedback loop: better memory → better performance → better benchmarks → better memory. Without continual learning, each task is an isolated event. With it, each task makes the agent sharper for the next one.

That's how you get from pattern matching to something that actually improves.

References

[1] Yan, S., Yang, X., Huang, Z., Nie, E., Ding, Z., Li, Z., Ma, X., Schütze, H., Tresp, V., & Ma, Y. (2025). Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning. arXiv. https://arxiv.org/abs/2508.19828
[2] Ouyang, S., Yan, J., Hsu, I-H., Chen, Y., Jiang, K., Wang, Z., Han, R., Le, L. T., Daruki, S., Tang, X., Tirumalashetty, V., Lee, G., Rofouei, M., Lin, H., Han, J., Lee, C-Y., & Pfister, T. (2025). ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory. arXiv. https://arxiv.org/abs/2509.25140
[3] Yao, S., Chen, H., Yang, J., & Narasimhan, K. (2022). WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. In Advances in Neural Information Processing Systems (NeurIPS 2022). https://arxiv.org/abs/2207.01206
[4] Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., Alon, U., & Neubig, G. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv. https://arxiv.org/abs/2307.13854