📊
Evals & Observability
86 tools
Evaluation harnesses, LLM tracing, monitoring, and prompt/agent observability. Dashboards, judges, replay tools, and regression suites for LLM apps.
OpenAI Agents SDK
OpenAI
Build agentic workflows with OpenAI.
2 stories
ARC Prize
ARC Prize Foundation
Test today's ARC-AGI puzzle.
1 story
Baidu Qianfan
Baidu
Baidu AI Cloud's large-model platform
1 story
Braintrust
Braintrust Data, Inc.
The AI observability platform for building quality AI products
1 story
Context.ai
Explore Interfaces Inc.
Build, run, and improve AI agents
1 story
D
DFlash
DFlash
DFlash software product
1 story
H
HealthBench Professional
OpenAI
Healthcare benchmark from OpenAI
1 story
ProgramBench
Meta
Can language models rebuild programs from scratch?
1 story
Vals AI
Vals AI
Benchmark Generative AI for Enterprise Applications.
1 story
Weights & Biases
Weights and Biases, LLC
Machine learning platform
1 story
agent-install
Splunk
Agent installation
0 stories
Agentation
Dip
Visual feedback for AI coding agents
0 stories
AGNTCY
Outshift by Cisco
Open-source infrastructure for AI agents
0 stories
AI21 Maestro
AI21 Labs
Enterprise AI orchestration
0 stories
Andon Labs
Andon Labs
Autonomous organizations without humans in the loop
0 stories
AppWorld
StonyBrookNLP
A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
0 stories
ARFBench
Datadog
A time series question-answering benchmark based on real incidents.
0 stories
Artificial Analysis
Artificial Analysis
AI model benchmarks and comparisons
0 stories
AssistantBench
Independent
Benchmark for evaluating AI assistants
0 stories
Attention Heads
HeyNEO
Attention-head analysis tool for GPT-2.
0 stories
AutoHypothesis
Unconfirmed
Automated hypothesis generation
0 stories
Better Agent
LangWatch
Build reliable, testable, production-grade AI agents with Better Agents CLI - the reliability layer for agent development
0 stories
Blueprint
Blueprint Software Systems
Migrate & Improve your RPA Estate
0 stories
Blueprint-Bench 2
Andon Labs
Spatial reasoning benchmark for converting apartment photos into floor plans.
0 stories
BullshitBench
Peter Gostev
Measures whether models detect nonsense, call it out clearly, and avoid confidently continuing with invalid assumptions.
0 stories
Claude token counter
Anthropic
Count tokens before you send a message to Claude.
0 stories
ClawMark
Evolvent AI
ClawMark by Evolvent AI
0 stories
Code Arena
Arena Intelligence, Inc.
Build & Test with AI Coding Models
0 stories
Context Arena
Context Arena
Context Arena software product
0 stories
COSMO
Space Telescope Science Institute
Space Telescope Science Institute software
0 stories
C
CUA-World
Carnegie Mellon University
Interactive computer-use benchmark
0 stories
Dagger Cloud
Dagger
Observability for Delivery Workflows
0 stories
Datadog
Datadog, Inc.
Cloud observability and security platform
0 stories
DeepAgents Deploy
LangChain
Deploy DeepAgents with LangChain
0 stories
Dogfood
Dogfood
Dogfood your product, the efficient way
0 stories
DSPy
Stanford NLP Group
Programming—not prompting—language models.
0 stories
Entire
Entire
Software product from Entire.
0 stories
Evals
OpenAI
Manage and run evals in the OpenAI platform.
0 stories
FrontierSWE
Proximal
Benchmarking software engineering skill at the edge of human ability.
0 stories
Future AGI
Future AGI
AI evaluation and observability platform
0 stories
GEPA
gepa-ai
Optimize Anything with LLMs
0 stories
GitHub Repo Stats
GitHub
Repository analytics for GitHub projects.
0 stories
Google AI Edge Gallery
Google LLC
Explore, Experience, and Evaluate the Future of On-Device Generative AI with Google AI Edge.
0 stories
Grafana
Grafana Labs
Dashboard anything. Observe everything.
0 stories
GuideLLM
Red Hat
SLO-aware Benchmarking and Evaluation Platform for Optimizing Real-World LLM Inference
0 stories
G
Gym-Anything
Carnegie Mellon University
Turn Any Software into an Agent Environment
0 stories
HiL-Bench
Scale AI
Benchmark measuring whether AI coding agents know when to ask for help.
0 stories
Interfere
Interfere, Inc.
Build software that never breaks
0 stories
LangSmith
LangChain
LLM observability and evaluation platform
0 stories
LangSmith Fleet
LangChain
Manage AI application fleets at scale.
0 stories
Lucent
Lucent AI, Inc.
AI Bug Detection from Session Replays
0 stories
Mastra
Kepler Software Inc.
TypeScript AI Agent Framework & Platform
0 stories
M
Meta-Harness
Stanford IRIS Lab
Framework for automated search over task-specific model harnesses.
0 stories
Microsoft Agent 365
Microsoft
The Control Plane for Agents
0 stories
Observability
TrueFoundry
End-to-End LLM Observability, Simplified
0 stories
Opik
Comet
Open-source LLM observability and evaluation platform
0 stories
Opik Test Suites
Comet
Straightforward unit & regression testing for AI agents
0 stories
Overmind
Overmind Technology Inc.
Prevent your next outage
0 stories
Parallel Monitor API
Parallel
Monitor API by Parallel
0 stories
PARE-Bench
Unknown
A research framework for evaluating proactive AI assistants through active user simulation
0 stories
ParseBench
LlamaIndex
The document parsing benchmark for AI agents
0 stories
PostHog
PostHog
Open source product analytics.
0 stories
PostTrainBench
AI Safety and Alignment Group
Measuring how well AI agents can post-train language models
0 stories
prinzbench
Unattributed
Benchmark software
0 stories
Prompt Builder
Prompt Builder
Build and manage prompts.
0 stories
Promptfoo
OpenAI
LLM security and testing for teams of all sizes
0 stories
rams
HSLA0001 Inc.
rams
0 stories
Seer
Seer
AI software product
0 stories
Seer Agent
Sentry
AI debugging agent
0 stories
sentrux
sentrux
Software service named sentrux
0 stories
Sentry
Sentry
Application Performance Monitoring & Error Tracking Software
0 stories
Sentry MCP
Sentry
MCP server for Sentry
0 stories
SimGym
Shopify
Simulate theme changes with human-like AI shoppers
0 stories
SlopCodeBench
Sprocket Lab
Code benchmark from Sprocket Lab
0 stories
Small Harness
The Doggie Lift
Vet-Created for Easy Care
0 stories
Smithery
Clavia, Inc.
Discover and install MCP servers
0 stories
Tautulli
Tautulli
Monitor your Plex Media Server
0 stories
Tessl
Tessl AI Limited
The package manager for agent skills and context
0 stories
THUNDERDOME
Thunderdome
Open Source Agile Planning Poker app
0 stories
Tinker
Shopify
Shopify product called Tinker
0 stories
Tokenjuice
Tokenjuice
Lean output compaction for terminal-heavy agent workflows.
0 stories
T
Tokenspeed
Tokenspeed
A software product named Tokenspeed.
0 stories
Vertex AI
Google Cloud
Build, deploy, and scale machine learning models.
0 stories
VoxelBench
VoxelBench
Minecraft Server Benchmark
0 stories
W&B LEET
Weights & Biases
Lightweight Experiment Exploration Tool
0 stories
ZenMux
AI Force Singapore Pte. Ltd.
AI software platform
0 stories