📊
Evals & Observability
82 tools
Evaluation harnesses, LLM tracing, monitoring, and prompt/agent observability. Dashboards, judges, replay tools, and regression suites for LLM apps.
agent-install
Splunk
Deploy application server agents
0 stories
Agentation
Dip
Visual feedback for AI coding agents
0 stories
AGNTCY
Outshift by Cisco
Open-source infrastructure for AI agents
0 stories
AI21 Maestro
AI21 Labs
Enterprise AI orchestration
0 stories
Andon Labs
Andon Labs
Agent evals for long-horizon AI tasks
0 stories
AppWorld
StonyBrookNLP
A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
0 stories
ARC Prize
ARC Prize Foundation
The North Star for AGI
0 stories
ARFBench
Datadog
A time series question-answering benchmark based on real incidents.
0 stories
Artificial Analysis
Artificial Analysis
AI model benchmarks and comparisons
0 stories
AssistantBench
Independent
Benchmark for evaluating AI assistants
0 stories
AutoHypothesis
Unconfirmed
Automated hypothesis generation
0 stories
Baidu Qianfan
Baidu
以Agent为核心的一站式企业级大模型服务平台
0 stories
Better Agent
LangWatch
Build reliable, testable, production-grade AI agents with Better Agents CLI - the reliability layer for agent development
0 stories
Blueprint
Blueprint Software Systems
Migrate & Improve your RPA Estate
0 stories
Blueprint-Bench 2
Andon Labs
Spatial reasoning benchmark for converting apartment photos into floor plans.
0 stories
Braintrust
Braintrust Data, Inc.
The AI observability platform for building quality AI products
0 stories
BullshitBench
Peter Gostev
Measures whether models detect nonsense, call it out clearly, and avoid confidently continuing with invalid assumptions.
0 stories
Claude token counter
Anthropic
Count tokens before you send a message to Claude.
0 stories
ClawMark
Evolvent AI
ClawMark by Evolvent AI
0 stories
Context Arena
Context Arena
Context Arena software product
0 stories
Context.ai
Explore Interfaces Inc.
Build, run, and improve AI agents
0 stories
C
COSMO
Space Telescope Science Institute
COS Monitoring
0 stories
C
CUA-World
Carnegie Mellon University
Interactive computer-use benchmark
0 stories
Dagger Cloud
Dagger
Observability for Delivery Workflows
0 stories
Datadog
Datadog, Inc.
Cloud Monitoring as a Service
0 stories
DeepAgents Deploy
LangChain
Deploy DeepAgents with LangChain
0 stories
Dogfood
Dogfood
Dogfood your product, the efficient way
0 stories
DSPy
Stanford NLP Group
Programming—not prompting—LMs
0 stories
Entire
Entire
Software product from Entire.
0 stories
Evals
OpenAI
Manage and run evals in the OpenAI platform.
0 stories
FrontierSWE
Proximal
Benchmarking software engineering skill at the edge of human ability.
0 stories
Future AGI
Future AGI
AI Agents hallucinate, fix it faster.
0 stories
GEPA
gepa-ai
Optimize Anything with LLMs
0 stories
GitHub Repo Stats
GitHub
Repository analytics for GitHub projects.
0 stories
Google AI Edge Gallery
Google LLC
Explore, Experience, and Evaluate the Future of On-Device Generative AI with Google AI Edge.
0 stories
Grafana
Grafana Labs
Dashboard anything. Observe everything.
0 stories
GuideLLM
Red Hat
SLO-aware Benchmarking and Evaluation Platform for Optimizing Real-World LLM Inference
0 stories
G
Gym-Anything
Carnegie Mellon University
Turn Any Software into an Agent Environment
0 stories
H
HealthBench Professional
OpenAI
Healthcare benchmark from OpenAI
0 stories
HiL-Bench
Scale AI
Benchmark measuring whether AI coding agents know when to ask for help.
0 stories
Interfere
Interfere, Inc.
Build software that never breaks
0 stories
LangSmith
LangChain
LLM observability and evaluation platform
0 stories
LangSmith Fleet
LangChain
Agents for the whole company
0 stories
Lucent
Lucent AI, Inc.
AI Bug Detection from Session Replays
0 stories
Mastra
Kepler Software Inc.
TypeScript AI Agent Framework & Platform
0 stories
M
Meta-Harness
Stanford IRIS Lab
Framework for automated search over task-specific model harnesses.
0 stories
Microsoft Agent 365
Microsoft
The Control Plane for Agents
0 stories
Observability
TrueFoundry
End-to-End LLM Observability, Simplified
0 stories
OpenAI Agents SDK
OpenAI
Build agentic AI apps and multi-agent workflows.
0 stories
Opik
Comet
Open-source LLM observability and evaluation platform
0 stories
Opik Test Suites
Comet
Straightforward unit & regression testing for AI agents
0 stories
Overmind
Overmind Technology Inc.
Prevent your next outage
0 stories
Parallel Monitor API
Parallel
Monitor API by Parallel
0 stories
PARE-Bench
Unknown
A research framework for evaluating proactive AI assistants through active user simulation
0 stories
ParseBench
ParseBench
Document Parsing Benchmark for AI Agents
0 stories
PostHog
PostHog
Open source product analytics.
0 stories
PostTrainBench
AI Safety and Alignment Group
Measuring how well AI agents can post-train language models
0 stories
prinzbench
Unattributed
Private benchmark for legal research and needle-in-the-haystack search
0 stories
ProgramBench
Meta
Can language models rebuild programs from scratch?
0 stories
Prompt Builder
Prompt Builder
Build AI prompts 10x faster with drag-and-drop blocks
0 stories
Promptfoo
OpenAI
LLM security and testing for teams of all sizes
0 stories
rams
HSLA0001 Inc.
rams
0 stories
Seer
Seer
AI software product
0 stories
Seer Agent
Sentry
AI debugging agent
0 stories
sentrux
sentrux
Software service named sentrux
0 stories
Sentry
Sentry
Application Performance Monitoring & Error Tracking Software
0 stories
Sentry MCP
Sentry
Connect AI coding tools to Sentry via MCP.
0 stories
SlopCodeBench
Sprocket Lab
Community driven benchmark for measuring code erosion under iterative specification refinement.
0 stories
Small Harness
The Doggie Lift
Vet-Created for Easy Care
0 stories
Smithery
Clavia, Inc.
Connect agents to MCPs in minutes
0 stories
Tautulli
Tautulli
Monitor your Plex Media Server
0 stories
Tessl
Tessl AI Limited
The package manager for agent skills and context
0 stories
THUNDERDOME
Thunderdome
Open Source Agile Planning Poker app
0 stories
Tinker
Shopify
Shopify product called Tinker
0 stories
Tokenjuice
Tokenjuice
Lean output compaction for terminal-heavy agent workflows.
0 stories
T
Tokenspeed
Tokenspeed
A software product named Tokenspeed.
0 stories
Vals AI
Vals AI
Benchmark Generative AI for Enterprise Applications.
0 stories
Vertex AI
Google Cloud
Build, deploy, and scale machine learning models.
0 stories
VoxelBench
VoxelBench
Minecraft Server Benchmark
0 stories
W&B LEET
Weights & Biases
Lightweight Experiment Exploration Tool
0 stories
Weights & Biases
Weights and Biases, LLC
Machine learning platform
0 stories
ZenMux
AI Force Singapore Pte. Ltd.
AI software platform
0 stories