Robotics - MemoryVLA Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Author: ernestasposkus
Published: Wed 27 Aug 2025
Episode Link: https://www.paperledge.com/e/robotics-memoryvla-perceptual-cognitive-memory-in-vision-language-action-models-for-robotic-manipulation/

Alright learning crew, Ernis here, ready to dive into some seriously cool robotics research that's all about giving robots a better memory! We're talking about a new system called MemoryVLA, and it's inspired by how our brains work.

You know how sometimes you need to remember what you were just doing – like, did I turn off the stove? That's your working memory. And then there are those longer-term memories, like your awesome vacation last year. Well, this research taps into both those types of memory to help robots perform complex tasks.

See, most robots struggle with tasks that take a while, especially when things change along the way. It's like trying to follow a recipe where the instructions keep changing – super frustrating, right? That's because traditional robot "brains" often forget what happened just a few steps ago. They lack that crucial temporal context.

The problem is that traditional Vision-Language-Action (VLA) models used in robotics tend to forget information and struggle with long-term tasks that require a memory of what happened earlier.

MemoryVLA tackles this with a clever system that mimics human cognition. Think of it as having two memory systems for the robot:

Working Memory: This is like the robot's short-term notepad. It keeps track of what's happening right now, the immediate task at hand.

Memory Bank: This is the robot's long-term storage. It stores both specific details ("I picked up the red block") and general knowledge ("red blocks are usually on the left") from past experiences.

This Memory Bank isn't just a static record. It's constantly being updated with new information from the working memory, and it's smart about it too, getting rid of redundancies to stay efficient. It's like organizing your notes after a meeting, keeping the important stuff and tossing out the rest.

So, how does this all come together? First, a "brain" takes in visual information (like camera images) and converts it into tokens, that is, small, meaningful chunks of data, that feed the working memory. The working memory then decides what's important to remember and stores it in the Memory Bank. When the robot needs to make a decision, it pulls relevant memories from the bank and uses them, along with current information, to figure out the next best action.

Imagine a robot learning to make a sandwich. It uses its working memory to remember what ingredient it just added, and its memory bank to recall the proper order of ingredients and how to spread mustard without making a mess. MemoryVLA uses a memory-conditioned diffusion action expert to provide temporally aware action sequences. This means that it can figure out what needs to be done next and in what order.

"MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation."

The researchers tested MemoryVLA on a bunch of different robots doing all sorts of tasks, both in simulation and in the real world. And guess what? It crushed the competition! It was way better at completing long, complicated tasks than robots using older systems. In some cases, it improved performance by over 25%!

This is huge because it means we're getting closer to robots that can truly understand and adapt to changing situations, making them much more useful in all sorts of applications.

Why does this matter to you?

Future Robot Owners: Imagine a robot that can actually help you around the house, learning your preferences and remembering where you left your keys.

Engineers/Researchers: This research provides a powerful new framework for building more intelligent and capable robots.

Anyone Curious About AI: MemoryVLA is a great example of how we can draw inspiration from the human brain to improve artificial intelligence.

So, here are a few things that really got me thinking:

How far away are we from robots that can learn new tasks simply by watching us, like learning a new dance or cooking a new dish?

Could a system like MemoryVLA eventually be used to help people with memory problems, like Alzheimer's disease?

What are the ethical implications of giving robots such advanced memory capabilities?

I'm super excited to see where this research leads us. It's a big step towards creating robots that are not just tools, but true collaborators. What do you think, learning crew? Let me know your thoughts!

Credit to Paper authors: Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang

Share to:

EachPod

EachPod

Robotics - MemoryVLA Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation