With the rise of large language models (LLMs), the necessity to efficiently scale these models while keeping computational costs manageable has become increasingly crucial. The introduction of the Mixture-of-Experts (MoE) technique has addressed some of these challenges by routing data to specialized “expert” modules, thus optimizing parameter usage and reducing inference costs. However, current MoE architectures face limitations when it comes to scaling to a larger number of experts. In response to this issue, Google DeepMind has unveiled a novel architecture known as Parameter Efficient Expert Retrieval (PEER), which offers the potential to scale MoE models to millions of experts, opening up new possibilities for performance enhancements in LLMs.
The traditional transformer architecture used in LLMs consists of attention layers and feedforward (FFW) layers, where the latter accounts for a significant portion of the model’s parameters. The computational footprint of FFW layers is directly proportional to their size, posing a bottleneck in scaling transformers due to increased computational and memory constraints. MoE addresses this challenge by replacing dense FFW layers with sparsely activated expert modules, each containing a fraction of the parameters of the full layer and specializing in specific areas. By assigning inputs to multiple experts, MoE increases LLM capacity without a corresponding increase in computational costs.
Research has shown that the optimal number of experts in an MoE model depends on various factors such as training tokens and compute budget. High granularity MoE models, characterized by a larger number of experts, have demonstrated performance gains, particularly when combined with increased model size and training data. By enhancing the granularity of MoE models, it becomes possible to efficiently learn new knowledge and adapt to evolving data streams, offering potential benefits for language models deployed in dynamic environments.
DeepMind’s PEER architecture represents a significant advancement in scaling MoE models to a larger number of experts. By replacing fixed routers with a learned index, PEER efficiently routes input data to a vast pool of experts without compromising speed. Unlike previous MoE architectures with large experts, PEER utilizes tiny experts with a single neuron in the hidden layer, allowing for enhanced knowledge transfer and parameter efficiency. Additionally, PEER employs a multi-head retrieval approach similar to the multi-head attention mechanism in transformer models, further optimizing model performance.
PEER’s focus on parameter efficiency aligns with techniques such as parameter-efficient fine-tuning (PEFT), which aim to minimize the number of active parameters during model fine-tuning. By reducing the number of active parameters in the MoE layer, PEER reduces computation and activation memory consumption during both pre-training and inference stages. This emphasis on parameter efficiency not only improves model performance but also provides opportunities for dynamic adaptation and knowledge enhancement in LLMs, potentially transforming the way large language models are developed and utilized.
Researchers have conducted experiments to evaluate the performance of PEER on various benchmarks, comparing its effectiveness against transformer models with dense feedforward layers and other MoE architectures. The results indicate that PEER models achieve a superior performance-compute tradeoff, exhibiting lower perplexity scores with equivalent computational budgets. Furthermore, increasing the number of experts in a PEER model leads to further reductions in perplexity, highlighting the scalability and efficiency of the architecture. These findings challenge conventional wisdom regarding MoE models and lay the groundwork for the broader adoption of scalable expert retrieval mechanisms in future language models.
Leave a Reply