ViqusViqus
Navigate
Company
Blog
About Us
Contact
System Status
Enter Viqus Hub

Inclusion Arena: A New Benchmark for LLM Performance – Prioritizing Real-World Usage

Large Language Models AI Benchmarking Enterprise AI LLM Leaderboards Inclusion Arena Bradley-Terry Model AI Evaluation
August 19, 2025
Viqus Verdict Logo Viqus Verdict Logo 9
Real-World Validation
Media Hype 6/10
Real Impact 9/10

Article Summary

Inclusion AI’s ‘Inclusion Arena’ represents a significant shift in how Large Language Model (LLM) performance is evaluated. Recognizing the limitations of traditional benchmarks relying on static datasets and testing environments, Inclusion AI has developed a live leaderboard that integrates directly into AI-powered applications, gathering data from actual user interactions. The core innovation is its use of user preferences – users interact with multiple LLMs within apps like ‘Joyland’ and ‘T-Box,’ and then select the response they prefer, without knowing which model generated it. This approach mimics real-world usage scenarios and provides a far more nuanced understanding of a model’s strengths and weaknesses. The system utilizes the Bradley-Terry model, similar to Chatbot Arena, to analyze pairwise comparisons, acknowledging the challenges of exhaustive comparisons with a rapidly growing number of models. Inclusion Arena incorporates a ‘placement match mechanism’ and ‘proximity sampling’ to prioritize comparisons and maximize information gain efficiently. The initial experiment, up to July 2025, captured 501,003 comparisons, revealing Anthropic’s Claude 3.7 Sonnet and DeepSeek v3-0324 as top performers. The project addresses the escalating difficulty for enterprises to evaluate the growing landscape of LLMs, offering a method to align model selection with demonstrable, real-world utility. This focus on user preference is a crucial step towards building a more reliable and relevant system for enterprise AI deployments.

Key Points

  • Inclusion Arena prioritizes model performance based on real-world user preferences, moving beyond static benchmarks.
  • The system leverages the Bradley-Terry model, similar to Chatbot Arena, to efficiently analyze pairwise comparisons and handle the increasing number of LLMs.
  • Integration into AI-powered applications allows for continuous data collection, providing a dynamic and evolving understanding of model performance.

Why It Matters

The rise of LLMs is overwhelming enterprises with a rapidly expanding selection of models. Existing benchmarks often fail to capture the nuances of real-world usage. Inclusion Arena addresses this critical gap by directly incorporating user feedback into the evaluation process. This is vital for data-driven decision-making in AI adoption, ensuring that enterprises choose models truly optimized for their specific applications. Furthermore, the focus on efficiency – through methods like proximity sampling – is crucial for managing the computational costs associated with evaluating increasingly complex AI systems. This impacts investment decisions, resource allocation, and the overall pace of AI innovation.

You might also be interested in