AI Benchmarks
Compare AI model performance across industry-standard benchmarks.
Arena ELO
Chatbot Arena crowdsourced ELO rating
27 models tested
Top: Claude Opus 4.6 (1504.0)
HumanEval
Code generation benchmark
4 models tested
Top: o3 (95.2)
MATH
Competition mathematics
3 models tested
Top: DeepSeek R1 (97.3)
MMLU
Massive Multitask Language Understanding
8 models tested
Top: DeepSeek R1 (90.8)