MemoAct: Atkinson–Shiffrin-Inspired Memory-Augmented Visuomotor Policy for Robotic Manipulation

Chongqing University
† Corresponding author

Abstract

Memory-augmented robotic policies are essential in handling memory-dependent tasks. However, existing approaches typically rely on simple observation window extensions, struggling to simultaneously achieve precise task state tracking and robust long-horizon retention. To overcome these challenges, inspired by the Atkinson–Shiffrin memory model, we propose MemoAct, a hierarchical memory-based policy that leverages distinct memory tiers to tackle specific bottlenecks. Specifically, lossless short-term memory ensures precise task state tracking, while compressed long-term memory enables robust long-horizon retention. To enrich the evaluation landscape, we construct MemoryRTBench based on RoboTwin 2.0, specifically tailored to assess policy capabilities in task state tracking and long-horizon retention. Extensive experiments across simulated and real-world scenarios demonstrate that MemoAct achieves superior performance compared to both existing Markovian baselines and history-aware policies.

Introduction

(a) An example of a memory-dependent task. (b) Policies lacking historical awareness fail under identical observations, while existing representative memory mechanisms suffer from limited long-horizon retention and poor task state tracking. (c) Inspired by the Atkinson–Shiffrin memory model, we propose MemoAct, which simultaneously enables precise task state tracking and robust long-horizon retention. (d) Results on MemoryRTBench, RMBench and real-world experiments demonstrate that MemoAct significantly outperforms baseline algorithms.

Overview of MemoAct

The sensory distillation module first encodes RGB images and proprioceptive states into high-fidelity features, termed sensory memory. This memory serves as a query to retrieve relevant historical context from the long short-term memory bank, which is processed by a temporal transformer encoder. Subsequently, a gating network adaptively fuses the retrieved history with the current sensory memory to produce a condition token. Guided by this token, the action decoder iteratively denoises random noise into history-aware action trajectories. Finally, the consolidation module updates the memory bank after each forward pass. For a detailed introduction to the memory consolidation module, please refer to our paper.

MemoryRTBench and Real-world tasks

To comprehensively evaluate task state tracking and long-horizon retention, we introduce MemoryRTBench, which consists of four simulation tasks. To ensure a thorough assessment, we also benchmark MemoAct on the language-independent tasks of RMBench, followed by validation on two real-world tasks. These tasks are characterized by multiple subtasks and frequent occurrences of perceptually identical observations across different states, compelling the policy to rely on historical context for disambiguation. Moreover, their extended execution horizons require recalling the initial states of objects, posing a strict demand for robust long-term memory.

Video Results on MemoryRTBench, RMBench and Real-world Tasks

Sequential Hammer Tap (SHT) at 2x Speed

MemoAct (ours)

MVMP

SAMP

DP

ACT

Block Place and Return (BPR) at 2x Speed

MemoAct (ours)

MVMP

SAMP

DP

ACT

Dual-Arm Sequential Transfer and Return (DA-STR) at 2x Speed

MemoAct (ours)

MVMP

SAMP

DP

ACT

Dual-Arm Interleaved Transfer and Return (DA-ITR) at 2x Speed

MemoAct (ours)

MVMP

SAMP

DP

ACT

Rearrange Blocks at 2x Speed

MemoAct (ours)

MVMP

SAMP

DP

ACT

Put Back Block at 2x Speed

MemoAct (ours)

MVMP

SAMP

DP

ACT

Swap T at 2x Speed

MemoAct (ours)

MVMP

SAMP

DP

ACT

Sequential Bowl Placement (SBP) at 8x Speed

MemoAct (ours)

MVMP

SAMP

DP

Doll Swap Placement (DSP) at 8x Speed

MemoAct (ours)

MVMP

SAMP

DP

Failure Cases

Despite these promising results, MemoAct has certain limitations. Since compressing an entire RGB image into a single token via the sensory distillation Module inevitably compromises visual fidelity, MemoAct occasionally makes errors in tasks that are highly sensitive to fine-grained visual information. In future work, we plan to explore adaptive compression mechanisms that better balance memory storage efficiency with visual precision.

Put Back Block

Swap T

SHT Failure Case 1

SHT Failure Case 2

DSP Failure Case 1

DSP Failure Case 2

Citation

@misc{memoact,
      title={MemoAct: Atkinson-Shiffrin-Inspired Memory-Augmented Visuomotor Policy for Robotic Manipulation}, 
      author={Liufan Tan and Jiale Li and Gangshan Jing},
      year={2026},
      eprint={2603.18494},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.18494}, 
}