(a) An example of a memory-dependent task. (b) Policies lacking historical awareness fail under identical observations, while existing representative memory mechanisms suffer from limited long-horizon retention and poor task state tracking. (c) Inspired by the Atkinson–Shiffrin memory model, we propose MemoAct, which simultaneously enables precise task state tracking and robust long-horizon retention. (d) Results on MemoryRTBench, RMBench and real-world experiments demonstrate that MemoAct significantly outperforms baseline algorithms.
The sensory distillation module first encodes RGB images and proprioceptive states into high-fidelity features, termed sensory memory. This memory serves as a query to retrieve relevant historical context from the long short-term memory bank, which is processed by a temporal transformer encoder. Subsequently, a gating network adaptively fuses the retrieved history with the current sensory memory to produce a condition token. Guided by this token, the action decoder iteratively denoises random noise into history-aware action trajectories. Finally, the consolidation module updates the memory bank after each forward pass. For a detailed introduction to the memory consolidation module, please refer to our paper.
To comprehensively evaluate task state tracking and long-horizon retention, we introduce MemoryRTBench, which consists of four simulation tasks. To ensure a thorough assessment, we also benchmark MemoAct on the language-independent tasks of RMBench, followed by validation on two real-world tasks. These tasks are characterized by multiple subtasks and frequent occurrences of perceptually identical observations across different states, compelling the policy to rely on historical context for disambiguation. Moreover, their extended execution horizons require recalling the initial states of objects, posing a strict demand for robust long-term memory.
Sequential Bowl Placement
Doll Swap Placement
Dual-Arm Sequential Transfer and Return
Block Place and Return
Dual-Arm Interleaved Transfer and Return
Sequential Hammer Tap
MemoAct (ours)
MVMP
SAMP
DP
ACT
MemoAct (ours)
MVMP
SAMP
DP
ACT
MemoAct (ours)
MVMP
SAMP
DP
ACT
MemoAct (ours)
MVMP
SAMP
DP
ACT
MemoAct (ours)
MVMP
SAMP
DP
ACT
MemoAct (ours)
MVMP
SAMP
DP
ACT
MemoAct (ours)
MVMP
SAMP
DP
ACT
MemoAct (ours)
MVMP
SAMP
DP
MemoAct (ours)
MVMP
SAMP
DP
Despite these promising results, MemoAct has certain limitations. Since compressing an entire RGB image into a single token via the sensory distillation Module inevitably compromises visual fidelity, MemoAct occasionally makes errors in tasks that are highly sensitive to fine-grained visual information. In future work, we plan to explore adaptive compression mechanisms that better balance memory storage efficiency with visual precision.
Put Back Block
Swap T
SHT Failure Case 1
SHT Failure Case 2
DSP Failure Case 1
DSP Failure Case 2
@misc{memoact,
title={MemoAct: Atkinson-Shiffrin-Inspired Memory-Augmented Visuomotor Policy for Robotic Manipulation},
author={Liufan Tan and Jiale Li and Gangshan Jing},
year={2026},
eprint={2603.18494},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2603.18494},
}