Inclusion Arena: A New Benchmark for LLM Performance – Prioritizing Real-World Usage

Large Language Models AI Benchmarking Enterprise AI LLM Leaderboards Inclusion Arena Bradley-Terry Model AI Evaluation

August 19, 2025

Source: VentureBeat AI

Real-World Validation

Media Hype 6/10

Real Impact 9/10

What is the Viqus Verdict?

We evaluate each news story based on its real impact versus its media hype to offer a clear and objective perspective.

AI Analysis:

While the initial data set is limited, the core concept of real-world usage driving model evaluation represents a significant advancement. The potential for scale and the validation of user preferences is high, suggesting a greater long-term impact than current benchmarking approaches.

Article Summary

Inclusion AI’s ‘Inclusion Arena’ represents a significant shift in how Large Language Model (LLM) performance is evaluated. Recognizing the limitations of traditional benchmarks relying on static datasets and testing environments, Inclusion AI has developed a live leaderboard that integrates directly into AI-powered applications, gathering data from actual user interactions. The core innovation is its use of user preferences – users interact with multiple LLMs within apps like ‘Joyland’ and ‘T-Box,’ and then select the response they prefer, without knowing which model generated it. This approach mimics real-world usage scenarios and provides a far more nuanced understanding of a model’s strengths and weaknesses. The system utilizes the Bradley-Terry model, similar to Chatbot Arena, to analyze pairwise comparisons, acknowledging the challenges of exhaustive comparisons with a rapidly growing number of models. Inclusion Arena incorporates a ‘placement match mechanism’ and ‘proximity sampling’ to prioritize comparisons and maximize information gain efficiently. The initial experiment, up to July 2025, captured 501,003 comparisons, revealing Anthropic’s Claude 3.7 Sonnet and DeepSeek v3-0324 as top performers. The project addresses the escalating difficulty for enterprises to evaluate the growing landscape of LLMs, offering a method to align model selection with demonstrable, real-world utility. This focus on user preference is a crucial step towards building a more reliable and relevant system for enterprise AI deployments.

Key Points

Inclusion Arena prioritizes model performance based on real-world user preferences, moving beyond static benchmarks.
The system leverages the Bradley-Terry model, similar to Chatbot Arena, to efficiently analyze pairwise comparisons and handle the increasing number of LLMs.
Integration into AI-powered applications allows for continuous data collection, providing a dynamic and evolving understanding of model performance.

Why It Matters

The rise of LLMs is overwhelming enterprises with a rapidly expanding selection of models. Existing benchmarks often fail to capture the nuances of real-world usage. Inclusion Arena addresses this critical gap by directly incorporating user feedback into the evaluation process. This is vital for data-driven decision-making in AI adoption, ensuring that enterprises choose models truly optimized for their specific applications. Furthermore, the focus on efficiency – through methods like proximity sampling – is crucial for managing the computational costs associated with evaluating increasingly complex AI systems. This impacts investment decisions, resource allocation, and the overall pace of AI innovation.

Inclusion Arena: A New Benchmark for LLM Performance – Prioritizing Real-World Usage

What is the Viqus Verdict?

Article Summary

Key Points

Why It Matters

You might also be interested in