📊
Evals & Observability
145 tools
Evaluation harnesses, LLM tracing, monitoring, and prompt/agent observability. Dashboards, judges, replay tools, and regression suites for LLM apps.
Vals AI
Vals AI
AI evaluation platform
4 stories
Weights & Biases
Weights & Biases, LLC
The AI developer platform
3 stories
Artificial Analysis
Artificial Analysis
AI model benchmarking and analysis platform
2 stories
DFlash
Z Lab
DFlash
2 stories
LangSmith
LangChain
Debug, test, and monitor your LLM applications.
2 stories
OpenAI Agents SDK
OpenAI
Build agents with code.
2 stories
ARC Prize
ARC Prize Foundation
ARC-AGI prize challenge
1 story
Baidu Qianfan
Baidu
Baidu's large-model development platform
1 story
Braintrust
Braintrust Data, Inc.
Braintrust
1 story
Chrome DevTools for agents
Google
Chrome DevTools for AI agents
1 story
Claude Console
Anthropic
Claude API console
1 story
Context.ai
Explore Interfaces Inc.
Context.ai software product
1 story
Daybreak
Daybreak
Daybreak
1 story
LangSmith Engine
LangChain
LangChain product suite entry for LangSmith Engine
1 story
LangSmith Sandboxes
LangChain
Sandboxed code execution for LangSmith
1 story
Mastra
Kepler Software Inc.
Build AI agents with TypeScript
1 story
Medmarks
Medmarks
Medical software product
1 story
Plurai
Plurai
Software product by Plurai.
1 story
ProgramBench
Meta
Benchmarking program understanding and generation
1 story
Ramp Sheets
Ramp
Spreadsheet workflows for Ramp users.
1 story
SkillOpt
Microsoft
Skill optimization software.
1 story
Tinker
Shopify
Shopify software product
1 story
Tokenjuice
Vincent Koc
Token counter for AI prompts
1 story
Workshop
Workshop
Workshop
1 story
Agent Installer
Splunk
Installer utility for Splunk agents.
0 stories
Agent Sandbox
Kubernetes SIG Apps
Sandboxed AI agent execution on Kubernetes
0 stories
A
Agent Session App
Agent Session
Agent session app
0 stories
A
Agent View
Agent View
Unverified software product named Agent View
0 stories
agent-trace
Open Source
Trace and debug AI agents.
0 stories
Agentation
dip Corporation
Agentation
0 stories
A
AgentRank
AgentRank
AgentRank
0 stories
AgentsView
AgentsView
AgentsView software product
0 stories
AGNTCY
Outshift by Cisco
Agent interoperability platform
0 stories
AI Gateway
Vercel
Gateway to AI models
0 stories
A
AI Usage
AI Usage
AI usage software
0 stories
AI21 Maestro
AI21 Labs
Enterprise AI orchestration platform
0 stories
aiewf-eval
Daily
Daily-associated evaluation tool
0 stories
Andon Labs
Andon Labs
AI research and engineering lab
0 stories
AntithesisHQ
Antithesis
Autonomous software testing platform
0 stories
AppWorld
StonyBrookNLP
Benchmark and platform for agentic app workflows
0 stories
Aptabase
Aptabase
Open-source analytics for mobile, web, and desktop apps.
0 stories
ARFBench
Datadog
Open-source benchmark for retrieval and function-calling workflows
0 stories
ASSERT
Microsoft
Unverified Microsoft product target
0 stories
AssistantBench
Independent
Benchmark for AI assistant evaluation
0 stories
Attention Head Visualiser
HeyNEO
Attention head visualisation tool.
0 stories
AttuneBench
Unknown
AttuneBench
0 stories
AutoHypothesis
AutoHypothesis
AutoHypothesis
0 stories
B
BenchLocal
BenchLocal
Local SEO benchmarking and reporting platform.
0 stories
Better Agent
LangWatch
Build and improve AI agents.
0 stories
Blueprint
Blueprint Software Systems
Enterprise software product
0 stories
Blueprint-Bench 2
Andon Labs
Benchmark for agent evaluation
0 stories
BridgeBench
BridgeMind
BridgeBench software product
0 stories
Bugsink
Bugsink
Bugsink
0 stories
BullshitBench
Peter Gostev
AI benchmark for bullshit detection
0 stories
ccusage
Independent
Claude Code usage tracker.
0 stories
CHI-Bench
Independent
Benchmark for model evaluation
0 stories
Claude Counter
Independent
Claude Counter
0 stories
Claude token counter
Anthropic
Count tokens for Claude prompts and requests.
0 stories
C
ClawdMeter
Independent
Independent software product
0 stories
ClawMark
Evolvent AI
Unverified software product target
0 stories
Code Arena
Arena Intelligence, Inc.
Code Arena by Arena Intelligence
0 stories
Context Arena
Context Arena
Context management platform
0 stories
COSMO
Space Telescope Science Institute
STScI software product
0 stories
Coval
Coval
Testing and evaluation for voice AI agents
0 stories
C
CUA-World
Carnegie Mellon University
Benchmark for computer-using agents
0 stories
C
CursorBench
Unattributed
CursorBench
0 stories
Dagger Cloud
Dagger
Visibility into your Dagger runs
0 stories
Datadog
Datadog, Inc.
Cloud observability and security platform
0 stories
DeepAgents Deploy
LangChain
Deploy DeepAgents with LangChain
0 stories
Dify
LangGenius, Inc.
The open-source LLM app development platform.
0 stories
Dogfood
Dogfood
Customer feedback and product insights platform.
0 stories
DSPy
Stanford NLP Group
Programming—not prompting—for language models.
0 stories
EnterpriseRAG-Bench
Open-source community
Benchmark for enterprise RAG evaluation
0 stories
Entire
Entire
Software product named Entire.
0 stories
Evals
OpenAI
OpenAI's framework for evaluating models and prompts
0 stories
FrontierSWE
Proximal
Code-focused software engineering benchmark
0 stories
Future AGI
Future AGI
AI platform for testing and monitoring applications
0 stories
GEPA
gepa-ai
Prompt optimization for LLM agents
0 stories
Gepa-Viz
Modaic
Software product associated with Modaic
0 stories
GitHub Repo Stats
GitHub
Repository statistics and insights for GitHub.
0 stories
GlitchTip
GlitchTip
Open-source error monitoring platform.
0 stories
Google ADK
Google
Google's agent development kit
0 stories
Google AI Edge Gallery
Google LLC
Google AI Edge Gallery
0 stories
Grafana
Grafana Labs
The open observability platform
0 stories
GuideLLM
Red Hat
Benchmarking tool for LLM serving systems
0 stories
G
Gym-Anything
Carnegie Mellon University
Open-source reinforcement-learning environment toolkit.
0 stories
Hermes Agent Control Room
Hermes
Agent control room
0 stories
HiL-Bench
Scale AI
Human-in-the-loop benchmark
0 stories
howtoeval
Unknown
Unverified software product entry for the exact target name.
0 stories
HydraDB
AGI Context, Inc.
AI-native database
0 stories
Interfere
Interfere, Inc.
Interfere software
0 stories
ITBench-AA
Unverified
Public details could not be verified in this run.
0 stories
KramaBench
MIT CSAIL
MIT CSAIL benchmark
0 stories
LangSmith Fleet
LangChain
LangChain software product
0 stories
llm-checker
Independent
LLM checker
0 stories
LogRocket
LogRocket
Frontend monitoring and session replay
0 stories
Lucent
Lucent AI, Inc.
Lucent
0 stories
L
Lumetric
Lumetric
Lumetric software product
0 stories
Meta-Harness
Stanford IRIS Lab
Meta-Harness software from Stanford IRIS Lab
0 stories
Microsoft Agent 365
Microsoft
The control plane for AI agents
0 stories
Microsoft Clarity
Microsoft
Understand user behavior on your website
0 stories
minigepa
Independent
minigepa
0 stories
Mistral Studio
Mistral AI
Studio for building and managing AI workflows and agents.
0 stories
M
ModelClock
ModelClock
ModelClock software product
0 stories
MulTaBench
Independent
MulTaBench
0 stories
Observability
TrueFoundry
Observability
0 stories
O
Open Inspect
Open Inspect
Open Inspect software product
0 stories
Opik
Comet
Open-source LLM evaluation and observability platform
0 stories
Opik Test Suites
Comet
Test suites for Opik evaluations
0 stories
Overmind
Overmind Technology Inc.
Change risk analysis platform
0 stories
Parallel Monitor API
Parallel
Monitoring API from Parallel
0 stories
PARE-Bench
Unknown
PARE-Bench benchmark software
0 stories
ParseBench
LlamaIndex
Document parsing benchmark
0 stories
Phoenix
Arize AI
Open-source LLM observability and evaluation platform
0 stories
P
PhoenixScore
Unknown vendor
Unverified software product
0 stories
PostHog
PostHog
Open source product analytics platform.
0 stories
PostTrainBench
AI Safety and Alignment Group
Benchmark for post-training evaluation
0 stories
prinzbench
prinz-ai
AI benchmarking tool
0 stories
Prompt Builder
Salesforce, Inc.
Build prompts inside Salesforce.
0 stories
Promptfoo
Promptfoo
The LLM evals platform.
0 stories
rams
HSLA0001 Inc.
Software product
0 stories
Rerun
Rerun
Open source multimodal data visualization and logging
0 stories
Seer
Seer
Seer
0 stories
Seer Agent
Sentry
Sentry's AI debugging agent.
0 stories
Sentrux
Sentrux
Sentrux software product
0 stories
Sentry
Sentry
Application monitoring for developers
0 stories
Sentry MCP
Sentry
Connect Sentry to MCP-compatible AI clients.
0 stories
SimGym
Shopify
Shopify software tool
0 stories
SLEIGHT-Bench
Independent
Benchmark
0 stories
SlopCodeBench
Sprocket Lab
Code benchmark product
0 stories
Small Harness
The Doggie Lift
Small Harness
0 stories
Smithery
Clavia, Inc.
Discover, install, and host MCP servers
0 stories
SWE-check
Unknown
Developer tool for code checks
0 stories
SWE-Marathon
Independent
Open-source benchmark for autonomous software engineering agents.
0 stories
Tautulli
Tautulli
A Python-based web application for monitoring and tracking your Plex Media Server.
0 stories
Tessl
Tessl AI Limited
AI-native software development platform
0 stories
THUNDERDOME
Thunderdome
Software platform
0 stories
Tokenspeed
Tokenspeed
Tokenspeed
0 stories
TraceQuest
TraceQuest
TraceQuest
0 stories
Vertex AI
Google Cloud
A unified machine learning platform for building and using generative AI
0 stories
vitest-evals
Sentry
LLM evals with Vitest
0 stories
VoxelBench
VoxelBench
Voxel benchmarking software
0 stories
W&B LEET
Weights & Biases
Unverified W&B product reference
0 stories
Watchmen
Watchmen
Watchmen
0 stories
ZenMux
AI Force Singapore Pte. Ltd.
Unverified software product.
0 stories