MemoAct

Abstract

Memory-augmented robotic policies are essential in handling memory-dependent tasks. However, existing approaches typically rely on simply extending the observation window, struggling to simultaneously achieve precise task-state tracking and robust long-horizon retention. To overcome these challenges, inspired by the Atkinson–Shiffrin memory model, we propose MemoAct, a hierarchical memory- augmented policy that leverages distinct memory tiers to tackle specific bottlenecks. Specifically, sensory memory filters immediate perceptual inputs, lossless short-term memory supports precise task-state tracking, and compressed long-term memory facilitates robust long-horizon retention. To enrich the evaluation landscape, we construct MemoryRTBench based on RoboTwin 2.0, comprising 6 manipulation tasks that systematically evaluate policy memory capabilities across three dimensions: sequential, spatial, and episodic memory. Extensive experiments across simulated and real-world scenarios demonstrate that MemoAct achieves superior performance compared to both existing Markovian baselines and history-aware policies.

Introduction

(a) An example of a memory-dependent task. (b) Policies lacking historical awareness fail under identical observations, while existing representative memory mechanisms suffer from limited long-horizon retention and poor task-state tracking. (c) Inspired by the Atkinson–Shiffrin memory model, we propose MemoAct, which simultaneously enables precise task-state tracking and robust long-horizon retention. (d) Results on MemoryRTBench, RMBench, and real-world experiments demonstrate that MemoAct significantly outperforms baseline algorithms.

Overview of MemoAct

First, the sensory distillation module encodes RGB images and proprioceptive states into high-fidelity sensory memory. Subsequently, the distilled sensory memory queries relevant historical context from the long short-term memory bank. Next, a gating network adaptively fuses the retrieved history with the current sensory memory to produce a condition embedding, which guides the action decoder to iteratively denoise noisy action trajectories into history-aware action trajectories. Finally, the long short-term memory consolidation module updates the memory bank after each forward pass. Please refer to our paper for details.

MemoryRTBench and Real-world tasks

To evaluate MemoAct, we establish MemoryRTBench with 6 manipulation tasks assessing sequential, spatial, and episodic memory. These three memory dimensions evaluate the ability to execute sub-tasks in the prescribed order, recall the initial scene state after intermediate operations, and repeat a specified task the required number of times, respectively.

Grasp and Release Bowl

Doll Swap Placement

Sequential Transfer and Return

Block Place and Return

Interleaved Transfer and Return

Sequential Hammer Tap

Lift Bottle Twice

Click Bell Twice and Clock Once

Put Block Back

Click Three Buttons in Order

Experiments

As shown in Tables I, II, and III, MemoAct achieves the best overall performance across simulation and real-world evaluations. It obtains average success rates of 94.5%, 49.1%, and 77.5% on MemoryRTBench, RMBench, and real-world tasks, outperforming the strongest baseline MVMP by 19.1%, 20.6%, and 11.25%, respectively. As shown in Table IV, MemoAct demonstrates certain generalization ability, and its memory module remains effective under distribution shifts. As shown in Table VI, our memory module can be seamlessly integrated into existing policies to improve history-aware decision making.

As shown in the figure above, the long-term and short-term memory capacities control long-horizon retention and task-state tracking, respectively.

We conduct an efficiency evaluation using 50 CTBO expert demonstrations with batch size 32. All measurements are conducted on a single NVIDIA RTX 4090 GPU. We measure the parameter count, training time, inference latency, and GPU memory usage for MemoAct and all compared baselines. The results are summarized above.

SeedPolicy

Generalization Experiments at 8x Speed

Unseen Object Color: Red to Yellow

Unseen Object Color: Red to Green

Unseen Background

Failure Cases

Despite these promising results, MemoAct has certain limitations. Since compressing an entire RGB image into a single token via the sensory distillation Module inevitably compromises visual fidelity, MemoAct occasionally makes errors in tasks that are highly sensitive to fine-grained visual information. In future work, we plan to explore adaptive compression mechanisms that better balance memory storage efficiency with visual precision.