Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more
As companies continue to adopt extended language models (LLMs) in various applications, one of the main challenges they face is improving factual knowledge of the models and reducing hallucinations. In a new article, researchers from Meta-AI to propose “scalable memory layers”, which could be one of many possible solutions to this problem.
Scalable memory layers add more parameters to LLMs to increase their learning capacity without requiring additional computational resources. The architecture is useful for applications where you can save extra memory for factual knowledge, but also want the speed of more agile model inference.
Dense and memory layers
Traditional language models use “dense layers” to encode large amounts of information in their parameters. In dense layers, all parameters are used to their full capacity and are mostly activated at the same time during inference. Dense layers can learn complex functions, and scaling them up requires additional computing and energy resources.
In contrast, for simple factual knowledge, much simpler layers with associative memory architectures would be more efficient and interpretable. That’s what memory layers do. They use simple sparse activations and key-value search mechanisms to encode and retrieve knowledge. Sparse layers take up more memory than dense layers, but only use a small portion of the parameters at a time, making them much more computationally efficient.
Memory layers have existed for several years but are rarely used in modern deep learning architectures. They are not optimized for current hardware accelerators.
Current frontier LLMs typically use a form of “mixture of experts” (MoE) architecture, which uses a mechanism vaguely similar to memory layers. MoE models are composed of many smaller expert components specialized in specific tasks. At inference time, a routing mechanism determines which expert is activated based on the input sequence. PEER, an architecture recently developed by Google DeepMind, extends MoE to millions of experts, providing more granular control over the parameters enabled during inference.
Upgrading Memory Layers
Memory layers are compute-light but memory-heavy, which presents specific challenges for today’s hardware and software infrastructures. In their paper, the Meta researchers propose several modifications that address these challenges and enable their widespread use.
First, the researchers configured the memory layers for parallelization, distributing them across multiple GPUs to store millions of key-value pairs without changing other layers of the model. They also implemented a special CUDA kernel to handle high memory bandwidth operations. They also developed a parameter sharing mechanism that supports a single set of memory parameters across multiple memory layers within a model. This means that the keys and values used for lookups are shared between layers.
These modifications make it possible to implement memory layers within LLMs without slowing down the model.
“Memory layers, with their sparse activations, complement dense networks well, providing increased capacity for knowledge acquisition while being computationally lightweight,” the researchers write. “They can be scaled efficiently and offer practitioners an interesting new direction for trading off memory and computation.”
To test the memory layers, the researchers modified the Llama models by replacing one or more dense layers with a shared memory layer. They compared memory-enhanced models to dense LLM models as well as MoE and PEER models on several tasks, including answering factual questions, scientific and commonsense knowledge of the world, and coding.

Their results show that memory models improve significantly over dense baselines and rival models that use 2-4 times more compute. They also match the performance of MoE models that have the same computational budget and the same number of parameters. The model’s performance is particularly remarkable on tasks that require factual knowledge. For example, when it comes to factually answering questions, a memory model with 1.3 billion parameters comes close to the performance of Llama-2-7B, which was trained on twice as many tokens and 10 times as much computation. .
Additionally, the researchers found that the benefits of memory models remain consistent with model size, as they scaled their experiments from 134 million to 8 billion parameters.
“Given these results, we strongly advocate that memory layers be integrated into all next-generation AI architectures,” the researchers write, while adding that much remains to be done. “In particular, we hope that new learning methods can be developed to push the effectiveness of these layers even further, allowing for less forgetting, fewer hallucinations and continued learning.”