Orchestration of Experts: The First-Principle Multi-Model System

Community Article Published April 16, 2024

Leeroo Team: Majid Yazdani, Alireza Mohammadshahi, Ali Shaikh.

This article presents our innovative multi-model LLM platform, which outperforms traditional single LLM systems. Our system has been evaluated on MMLU and GSM8k benchmarks, demonstrating significantly better performance compared to both generic and domain-specific LLM experts. Additionally, our LeerooDedicated-Math-7b is available on the Huggingface 🤗 for open-source community usage.

Content

Introduction

Over the past year, the ascent of closed-source, general-purpose LLMs has been marked by the infusion of billions of dollars to encode vast amounts of human knowledge into models. These models have demonstrated remarkable success across a broad spectrum of tasks, becoming the standard for AI system prototyping. Tailoring these models to specific enterprise applications has relied on prompt engineering and adding external knowledge into their input context. This approach, however, is unpredictable, necessitating extensive experimentation to achieve desired outcomes. Moreover, there's no guarantee that techniques effective with one model iteration will carry over to subsequent versions. Furthermore, as regulatory frameworks tighten and data security concerns grow, individuals and enterprises become increasingly reluctant to share sensitive information with providers of closed-source models. The difficulty in steering the output of LLM experts, coupled with the significant inference costs associated with platforms like GPT-4, presents additional obstacles, challenging the viability of these models for business applications.

Meanwhile, the development of smaller, domain-specific models is becoming more accessible, and fine-tuning methodologies are evolving. Fine-tuning is the most common approach enterprises use to teach LLMs their use cases. Yet, despite these advances, the limited knowledge of smaller fine-tuned models makes them brittle for tail use cases, which hardly can be found in fine-tuning trainsets. This raises a pivotal question: Is it possible to intelligently aggregate the knowledge of the world's best models not only to match but exceed the capabilities of the leading closed-source models and to do so with greater efficiency while allowing enterprises to train their dedicated models? This blog post explores the potential for achieving this ambitious goal.

Orchestration of Experts

We introduce our multi-model LLM building system to provide a high-quality AI system surpassing the capabilities of conventional single LLM systems. At the heart of our innovation lies an LLM-based orchestrator trained to estimate the knowledge of underlying LLM experts. Given a list of underlying LLM experts encompassing generic and domain-specific models, our trained orchestrator predicts their performance for different queries without running their inference. Subsequently, our decoding mechanism identifies the optimal expert based on considerations of performance, cost-effectiveness, and privacy criteria. To train the Orchestrator, we run various models on a diverse set of inputs to evaluate their performance. This data enables the orchestrator to effectively 'interview' these models offline and learn about their capabilities. Thus, it can efficiently deploy the most suitable experts at inference time based on the insights gathered during the training.

We have evaluated our approach using MMLU benchmark. This benchmark is particularly notable for its breadth and depth, covering 57 diverse domains. We utilized open-source models from Huggingface hub, models of 7B, 13B, and 34B parameters, as underlying experts. Subsequently, we employed a submodular algorithm, available in our paper, to identify the most synergistic models for utilization as a universe of experts.
In the figure below, we present a comparison of our findings with leading open-source and closed-source LLM experts, namely GPT4, Mixtral, and LLaMa2. Using only open-source experts, the Leeroo model demonstrates superior accuracy by 5.27% over Mixtral, the top-performing open-source LLM, while maintaining a comparable inference cost. Notably, it achieves competitive results with GPT3.5 while reducing costs by 73.3% and being completely open source. Subsequently, upon integrating GPT4-turbo as one of the underlying experts, Leeroo model attains competitive performance with GPT4, while sending 50% of queries to open-source experts, thereby leading to nearly a 50% reduction in inference costs. Interestingly, Leeroo model surpasses GPT4 performance by 0.24% while directing 25% of queries to open-source experts.

To further investigate the source of improvement in our models, we illustrate the distributions of performances on 17 subcategories in the figure below. A standout area of success is in STEM domains, such as mathematics and computer science, where our models particularly excel. This impressive performance is largely attributed to the incorporation of specialized small models (around 7B) that are fine-tuned for tasks in mathematics and coding by the community. Furthermore, this approach facilitates the identification of domains where there is a scarcity of experts. Then, future research can focus on improving the performance of these areas by developing domain-specific effective experts.

Note: Leeroo (mix) denotes the scenario where GPT4 is included as one of the underlying experts alongside the Huggingface open-source LLMs.

Leeroo Math 7b

One instantiation of our architecture is a case of building a dedicated model that is complemented with a closed-source generic model, e.g. GPT4, to provide the optimum trade-off between ownership and quality. This method is exemplified in the Leeroo Math 7B model. Designed to address mathematical queries, this model either generates solutions or, when necessary, utilizes GPT-4 to fill in gaps in its knowledge base. In evaluations using the GSM8k dataset, the Leeroo Math 7B model achieved an accuracy of 84.77% in 5-shot setting, positioning it among the top performers in its class and notably surpassing the MetaMath 7B (its base model), which scores 68.84% on the same dataset (based on Huggingface OpenLLM leaderboard). This was accomplished while relying on GPT-4 for responses to half of the questions posed by GSM8k.

As we have more training data, we can keep the quality or increase it while progressively reducing its dependency on external, off-premise models like GPT-4. For example, by training on 50% more data, we could increase the ownership by 7.4%.

Try Our Model 🤗

You can try LeerooDedicated-Math-7b on Leeroo HF hub. Given the query, the model either generates the answer, or tags it as "GPT4".

In the following sample, the model generates the answer:

model = AutoModelForCausalLM.from_pretrained("leeroo/LeerooDedicated-Math-7b", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("leeroo/LeerooDedicated-Math-7b")
device = model.device  
# the following question is answered by the leeroo expert
question = "Natalia sold clips to 48 of her friends in April,and then she sold half as many clips in May.How many clips did Natalia sell altogether in April and May?"
encodeds = tokenizer([question], return_tensors="pt")
model_inputs = encodeds['input_ids'].to(device)
generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=False)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
# Natalia sold 48 clips in April.\nIn May, she sold half as many clips as in April,
# so she sold 48/2 = 24 clips.\nAltogether, Natalia sold 48 + 24 = 72 clips in April and May.\n#### 72\nThe answer is: 72</s>

For the query below, it directs the query to GPT4, indicated by the special tag <GPT4>:

question = "James loves to go swimming and has to swim across a 20-mile lake.  He can swim at a pace of 2 miles per hour.  He swims 60% of the distance.  After that, he stops on an island and rests for half as long as the swimming time.  He then finishes the remaining distance while going half the speed.  How long did it take him to get across the lake?"
encodeds = tokenizer([question], return_tensors="pt")
model_inputs = encodeds['input_ids'].to(device)
generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=False)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
# <GPT4></s>

Learn More

🔍 To a deeper dive into our method and results, refer to our publication, and repository.
🌍 Join Leeroo community for further updates: Linkedin, Discord, X, Website.

Citation

@misc{mohammadshahi2024leeroo,
    title={Leeroo Orchestrator: Elevating LLMs Performance Through Model Integration},
    author={Alireza Mohammadshahi and Ali Shaikh and Majid Yazdani},
    year={2024},
    eprint={2401.13979},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}