Benchmark Battle: Qwen 3.6 vs. Claude Opus 4.7 for Creative Generative Tasks

Qwen3.6-35B-A3B Claude Opus 4.7 pelican benchmark generative AI large language models SVG generation

April 16, 2026

Source: Simon Willison

Efficient Scale Wins Over Raw Power

Media Hype 3/10

Real Impact 5/10

What is the Viqus Verdict?

We evaluate each news story based on its real impact versus its media hype to offer a clear and objective perspective.

AI Analysis:

Moderate signal value regarding the growing viability of local, optimized models, though the benchmark itself remains niche and the comparisons are anecdotal.

Article Summary

Using the idiosyncratic 'pelican on a bicycle' benchmark, Simon Willison compared the output of Qwen3.6-35B-A3B and Anthropic's Claude Opus 4.7. While noting that the benchmark is intentionally absurd, the author finds a direct, albeit loose, correlation between generative quality and general usefulness. The comparison shows that while Opus 4.7 is state-of-the-art, the smaller, quantized Qwen model currently wins on specific SVG generation tasks, suggesting that resource-constrained local models might outperform massive, proprietary cloud APIs for certain types of creative output. The piece concludes with a reflection on the difficulty of benchmarking LLMs, asserting that the utility of these models is no longer simply tied to the absurdity of the comparison task.

Key Points

Qwen 3.6-35B-A3B, run locally on consumer hardware, currently produced better results for SVG illustration than Anthropic's powerful Claude Opus 4.7.
The article underscores the difficulty in establishing a reliable 'utility' metric for modern LLMs, as the connection between benchmark quality and real-world application is weakening.
While proprietary cloud models (like Opus 4.7) set the high bar, smaller, quantized models are proving surprisingly effective for specific, creative, resource-bound outputs.

Why It Matters

This post is not a structural shift, but it is highly insightful for developers and technical professionals. It challenges the prevailing assumption that the largest, most expensive, proprietary models (like Claude Opus) are inherently superior for all tasks. Instead, it validates the growing trend toward powerful, optimized, and locally run smaller models (SLMs). Companies building applications should pay closer attention to the performance-per-resource-unit of models like Qwen over merely chasing the highest benchmark score from the largest available API. This shifts focus from pure scale to efficiency and deployability.

Benchmark Battle: Qwen 3.6 vs. Claude Opus 4.7 for Creative Generative Tasks

What is the Viqus Verdict?

Article Summary

Key Points

Why It Matters

You might also be interested in

Chinese Women Embrace AI Boyfriends: A New Era of Digital Companionship

OpenAI Acquires Convogo Team – A Talent Acquisition Strategy

Reinforcement Learning Drives AI Progress, But With a Crucial Caveat