Discover, evaluate, and create benchmarks for LLMs
Broad knowledge and multi-task language understanding
Logical deduction, inference, and chain-of-thought
Programming, code generation, and software engineering
Mathematical reasoning and problem solving
Long context understanding and retrieval
Visual understanding, captioning, and OCR
Audio transcription and speech recognition
Search relevance and information retrieval
Financial analysis and economic reasoning
Function calling and tool use capabilities
Alignment, harmlessness, and safety evaluations
JSON, schemas, and structured data generation
Spreadsheet manipulation, formulas, and data analysis
Document generation, editing, formatting, and understanding
Slide generation, presentation design, and layout
Loading benchmarks...