Inclusion Arena: A New Benchmark for LLM Performance – Prioritizing Real-World Usage
9
What is the Viqus Verdict?
We evaluate each news story based on its real impact versus its media hype to offer a clear and objective perspective.
AI Analysis:
While the initial data set is limited, the core concept of real-world usage driving model evaluation represents a significant advancement. The potential for scale and the validation of user preferences is high, suggesting a greater long-term impact than current benchmarking approaches.
Article Summary
Inclusion AI’s ‘Inclusion Arena’ represents a significant shift in how Large Language Model (LLM) performance is evaluated. Recognizing the limitations of traditional benchmarks relying on static datasets and testing environments, Inclusion AI has developed a live leaderboard that integrates directly into AI-powered applications, gathering data from actual user interactions. The core innovation is its use of user preferences – users interact with multiple LLMs within apps like ‘Joyland’ and ‘T-Box,’ and then select the response they prefer, without knowing which model generated it. This approach mimics real-world usage scenarios and provides a far more nuanced understanding of a model’s strengths and weaknesses. The system utilizes the Bradley-Terry model, similar to Chatbot Arena, to analyze pairwise comparisons, acknowledging the challenges of exhaustive comparisons with a rapidly growing number of models. Inclusion Arena incorporates a ‘placement match mechanism’ and ‘proximity sampling’ to prioritize comparisons and maximize information gain efficiently. The initial experiment, up to July 2025, captured 501,003 comparisons, revealing Anthropic’s Claude 3.7 Sonnet and DeepSeek v3-0324 as top performers. The project addresses the escalating difficulty for enterprises to evaluate the growing landscape of LLMs, offering a method to align model selection with demonstrable, real-world utility. This focus on user preference is a crucial step towards building a more reliable and relevant system for enterprise AI deployments.Key Points
- Inclusion Arena prioritizes model performance based on real-world user preferences, moving beyond static benchmarks.
- The system leverages the Bradley-Terry model, similar to Chatbot Arena, to efficiently analyze pairwise comparisons and handle the increasing number of LLMs.
- Integration into AI-powered applications allows for continuous data collection, providing a dynamic and evolving understanding of model performance.

