📊
Evals & Observability
144 tools
Evaluation harnesses, LLM tracing, monitoring, and prompt/agent observability. Dashboards, judges, replay tools, and regression suites for LLM apps.
B
BenchLocal
BenchLocal
Local SEO benchmarking and reporting platform.
1 story
Claude Console
Anthropic
Claude API console
1 story
Sentry
Sentry
Application monitoring for developers
1 story
Agent Installer
Splunk
Installer utility for Splunk agents.
0 stories
Agent Sandbox
Kubernetes SIG Apps
A sandbox for agents on Kubernetes.
0 stories
A
Agent Session App
Agent Session
Agent session app
0 stories
A
Agent View
Agent View
Agent View
0 stories
agent-trace
Open Source
Trace and debug AI agents.
0 stories
Agentation
dip Corporation
Recruiting software by dip Corporation.
0 stories
A
AgentRank
AgentRank
Agent ranking platform
0 stories
AgentsView
AgentsView
AgentsView
0 stories
AGNTCY
Outshift by Cisco
Agent interoperability platform
0 stories
AI Gateway
Vercel
A single gateway for AI model access and routing.
0 stories
A
AI Usage
AI Usage
AI usage tracking software
0 stories
AI21 Maestro
AI21 Labs
Enterprise AI orchestration platform
0 stories
aiewf-eval
Daily
Daily-associated evaluation tool
0 stories
Andon Labs
Andon Labs
AI research and engineering lab
0 stories
AntithesisHQ
Antithesis
Autonomous software testing platform
0 stories
AppWorld
StonyBrookNLP
Benchmark and execution environment for generalist agents
0 stories
Aptabase
Aptabase
Open-source, privacy-friendly analytics for your app.
0 stories
ARC Prize
ARC Prize Foundation
ARC-AGI prize challenge
0 stories
ARFBench
Datadog
Benchmarking tool
0 stories
Artificial Analysis
Artificial Analysis
AI model benchmarking and analysis platform
0 stories
ASSERT
Microsoft
Unverified Microsoft product target
0 stories
AssistantBench
Independent
Benchmark for AI assistant evaluation
0 stories
Attention Head Visualiser
HeyNEO
Attention head visualisation tool.
0 stories
A
AttuneBench
Unknown
Unverified benchmark product
0 stories
AutoHypothesis
AutoHypothesis
AutoHypothesis
0 stories
Baidu Qianfan
Baidu
Baidu's large-model development platform
0 stories
Better Agent
LangWatch
Build better AI agents.
0 stories
Blueprint
Blueprint Software Systems
Enterprise software product
0 stories
Blueprint-Bench 2
Andon Labs
Blueprint-Bench 2
0 stories
Braintrust
Braintrust Data, Inc.
AI evaluation and observability platform
0 stories
BridgeBench
BridgeMind
BridgeBench software product
0 stories
Bugsink
Bugsink
Error tracking for your app.
0 stories
BullshitBench
Peter Gostev
Benchmark for bullshit detection
0 stories
ccusage
Independent
Claude Code usage tracking
0 stories
CHI-Bench
Independent
Benchmark software
0 stories
C
Chrome DevTools for agents
Google
Chrome DevTools for agents
0 stories
Clarity
Microsoft
Web analytics with session recordings and heatmaps
0 stories
Claude Counter
Independent
Claude Counter
0 stories
Claude token counter
Anthropic
Count tokens for Claude prompts and requests.
0 stories
ClawMark
Evolvent AI
Unverified software product target
0 stories
Code Arena
Arena Intelligence, Inc.
Code Arena by Arena Intelligence
0 stories
Context Arena
Context Arena
AI context evaluation platform
0 stories
Context.ai
Explore Interfaces Inc.
Context.ai software product
0 stories
COSMO
Space Telescope Science Institute
STScI software product
0 stories
Coval
Coval
Testing and evaluation for voice AI agents
0 stories
C
CUA-World
Carnegie Mellon University
Benchmark for computer-use agents
0 stories
C
CursorBench
Unattributed
Benchmark for Cursor-style coding workflows
0 stories
Dagger Cloud
Dagger
Observe and debug your pipelines
0 stories
Datadog
Datadog, Inc.
Cloud observability and security platform
0 stories
Daybreak
Daybreak
Daybreak
0 stories
DeepAgents Deploy
LangChain
Deploy DeepAgents with LangChain
0 stories
DFlash
Z Lab
DFlash
0 stories
Dify
LangGenius, Inc.
AI application development platform
0 stories
Dogfood
Dogfood
Dogfood
0 stories
DSPy
Stanford NLP Group
Programming—not prompting—for language models.
0 stories
EnterpriseRAG-Bench
Open-source community
Benchmark for enterprise RAG evaluation
0 stories
Entire
Entire
Software product named Entire.
0 stories
Evals
OpenAI
OpenAI's framework for evaluating models and prompts
0 stories
FrontierSWE
Proximal
Software engineering evaluation for coding agents
0 stories
Future AGI
Future AGI
AI platform for testing and monitoring applications
0 stories
GEPA
gepa-ai
Prompt optimization for LLM agents
0 stories
Gepa-Viz
Modaic
Software product associated with Modaic
0 stories
GitHub Repo Stats
GitHub
Repository statistics and insights for GitHub.
0 stories
GlitchTip
GlitchTip
Open source error tracking
0 stories
Google ADK
Google
Open-source toolkit for building AI agents.
0 stories
Google AI Edge Gallery
Google LLC
Explore and run AI models on-device.
0 stories
Grafana
Grafana Labs
Open and composable observability platform
0 stories
GuideLLM
Red Hat
Benchmarking tool for LLM serving systems
0 stories
G
Gym-Anything
Carnegie Mellon University
Research software toolkit
0 stories
Hermes Agent Control Room
Hermes
Agent control room
0 stories
HiL-Bench
Scale AI
Human-in-the-loop benchmark
0 stories
howtoeval
Unknown
Unknown software product.
0 stories
HydraDB
AGI Context, Inc.
Database product
0 stories
Interfere
Interfere, Inc.
Unverified software product.
0 stories
ITBench-AA
Unverified
Unverified product listing
0 stories
KramaBench
MIT CSAIL
Research benchmark
0 stories
LangSmith
LangChain
Debug, test, and monitor your LLM applications.
0 stories
LangSmith Engine
LangChain
Platform for tracing, evaluation, and monitoring of LLM applications
0 stories
LangSmith Fleet
LangChain
LangChain software product
0 stories
LangSmith Sandboxes
LangChain
Sandboxed code execution for LangSmith
0 stories
llm-checker
Independent
LLM checker
0 stories
LogRocket
LogRocket
Product analytics, session replay, and frontend monitoring.
0 stories
Lucent
Lucent AI, Inc.
AI software product
0 stories
L
Lumetric
Lumetric
Lumetric software product
0 stories
Mastra
Kepler Software Inc.
The TypeScript AI framework
0 stories
Medmarks
Medmarks
Medical software product
0 stories
Meta-Harness
Stanford IRIS Lab
Meta-Harness software tool from Stanford IRIS Lab.
0 stories
Microsoft Agent 365
Microsoft
Agent governance control plane
0 stories
minigepa
Independent
Independent software product
0 stories
Mistral Studio
Mistral AI
Studio for building AI applications on Mistral AI
0 stories
M
ModelClock
ModelClock
ModelClock software product
0 stories
MulTaBench
Independent
MulTaBench
0 stories
Observability
TrueFoundry
Observability for TrueFoundry
0 stories
O
Open Inspect
Open Inspect
Open Inspect software product
0 stories
OpenAI Agents SDK
OpenAI
Build agents with code.
0 stories
Opik
Comet
Open-source LLM evaluation and observability platform
0 stories
Opik Test Suites
Comet
Test suites for Opik
0 stories
Overmind
Overmind Technology Inc.
Change risk analysis platform
0 stories
Parallel Monitor API
Parallel
Monitoring API from Parallel
0 stories
PARE-Bench
Unknown
Benchmark software
0 stories
ParseBench
LlamaIndex
Document parsing benchmark
0 stories
Phoenix
Arize AI
Open-source LLM observability and evaluation platform
0 stories
P
PhoenixScore
Unknown vendor
PhoenixScore software product
0 stories
Plurai
Plurai
Software product by Plurai.
0 stories
PostHog
PostHog
Open source product analytics platform.
0 stories
PostTrainBench
AI Safety and Alignment Group
Benchmark for post-training evaluation.
0 stories
prinzbench
prinz-ai
AI benchmarking tool
0 stories
ProgramBench
Meta
Programming benchmark suite
0 stories
Prompt Builder
Salesforce, Inc.
Build prompts inside Salesforce.
0 stories
Promptfoo
Promptfoo
Open-source LLM evals and red teaming
0 stories
Ramp Sheets
Ramp
Spreadsheet workflows for Ramp users.
0 stories
rams
HSLA0001 Inc.
Software product
0 stories
Rerun
Rerun
Open-source visualization for multimodal data
0 stories
Seer
Seer
Seer
0 stories
Seer Agent
Sentry
Sentry's AI debugging agent.
0 stories
Sentrux
Sentrux
Sentrux software product
0 stories
Sentry MCP
Sentry
Connect Sentry to MCP-compatible AI clients.
0 stories
SimGym
Shopify
Shopify software tool
0 stories
SkillOpt
Microsoft
Skill optimization software.
0 stories
SLEIGHT-Bench
Independent
Open benchmark suite for evaluating model behavior
0 stories
SlopCodeBench
Sprocket Lab
Code benchmark product
0 stories
Small Harness
The Doggie Lift
Small-size harness from The Doggie Lift.
0 stories
Smithery
Clavia, Inc.
Discover, install, and host MCP servers
0 stories
SWE-check
Unknown
Developer tool for code checks
0 stories
SWE-Marathon
Independent
Benchmarking and evaluation for coding agents
0 stories
Tautulli
Tautulli
A Python-based web application for monitoring Plex Media Server.
0 stories
Tessl
Tessl AI Limited
AI-native software development platform
0 stories
THUNDERDOME
Thunderdome
THUNDERDOME software product.
0 stories
Tinker
Shopify
Shopify product
0 stories
Tokenjuice
Vincent Koc
Token counter for AI prompts
0 stories
Tokenspeed
Tokenspeed
Tokenspeed
0 stories
TraceQuest
TraceQuest
TraceQuest
0 stories
Vals AI
Vals AI
AI evaluation platform
0 stories
Vertex AI
Google Cloud
A unified machine learning platform for building and using generative AI
0 stories
vitest-evals
Sentry
Vitest-based evals for LLM testing.
0 stories
VoxelBench
VoxelBench
Unverified software product
0 stories
W&B LEET
Weights & Biases
Unverified W&B product reference
0 stories
Watchmen
Watchmen
Watchmen
0 stories
Weights & Biases
Weights & Biases, LLC
The AI developer platform
0 stories
Workshop
Workshop
Workshop
0 stories
ZenMux
AI Force Singapore Pte. Ltd.
Unverified software product.
0 stories