📊
Evals & Observability
144 tools
Evaluation harnesses, LLM tracing, monitoring, and prompt/agent observability. Dashboards, judges, replay tools, and regression suites for LLM apps.
Vals AI
Vals AI
AI agent evaluation platform
4 stories
Weights & Biases
Weights & Biases, LLC
The AI Developer Platform
3 stories
Artificial Analysis
Artificial Analysis
AI model analysis platform
2 stories
DFlash
Z Lab
Software product from Z Lab.
2 stories
LangSmith
LangChain
Debug, test, and monitor your LLM apps.
2 stories
OpenAI Agents SDK
OpenAI
Build agents with OpenAI
2 stories
ARC Prize
ARC Prize Foundation
Competition platform for ARC-AGI
1 story
Baidu Qianfan
Baidu
Large-model platform for building AI applications
1 story
Braintrust
Braintrust Data, Inc.
The AI evaluation platform
1 story
C
Chrome DevTools for agents
Google
Chrome DevTools for agents
1 story
Claude Console
Anthropic
Developer console for Claude
1 story
Context.ai
Explore Interfaces Inc.
AI product analytics
1 story
Daybreak
Daybreak
Daybreak
1 story
LangSmith Engine
LangChain
LLM observability and evaluation platform
1 story
LangSmith Sandboxes
LangChain
Isolated code execution for LangSmith
1 story
Mastra
Kepler Software Inc.
The TypeScript AI framework.
1 story
Medmarks
Medmarks
Software product by Medmarks
1 story
Plurai
Plurai
AI software platform
1 story
ProgramBench
Meta
Benchmark project for program-centric evaluation
1 story
Ramp Sheets
Ramp
Spreadsheet-style finance workflows
1 story
SkillOpt
Microsoft
Unverified Microsoft software product
1 story
Tinker
Shopify
Shopify product Tinker
1 story
Tokenjuice
Vincent Koc
AI token utility
1 story
Workshop
Workshop
Workshop
1 story
Agent Installer
Splunk
Splunk agent installer
0 stories
Agent Sandbox
Kubernetes SIG Apps
Sandbox for experimenting with agents on Kubernetes
0 stories
A
Agent Session App
Agent Session
Agent Session App
0 stories
A
Agent View
Agent View
Agent View
0 stories
agent-trace
Open Source
Trace AI agent runs.
0 stories
Agentation
dip Corporation
Unknown tagline
0 stories
A
AgentRank
AgentRank
AgentRank
0 stories
AgentsView
AgentsView
AgentsView
0 stories
AGNTCY
Outshift by Cisco
The Internet of Agents
0 stories
AI Gateway
Vercel
A unified gateway for AI model providers
0 stories
A
AI Usage
AI Usage
AI usage tracking platform
0 stories
AI21 Maestro
AI21 Labs
AI21 Maestro
0 stories
aiewf-eval
Daily
Daily software product
0 stories
Andon Labs
Andon Labs
Andon Labs
0 stories
AntithesisHQ
Antithesis
Autonomous software testing platform
0 stories
AppWorld
StonyBrookNLP
Benchmark and environment for app-using agents
0 stories
Aptabase
Aptabase
Privacy-friendly analytics for your apps.
0 stories
ARFBench
Datadog
Benchmark repository
0 stories
ASSERT
Microsoft
Microsoft software product ASSERT
0 stories
AssistantBench
Independent
Benchmark for AI assistants
0 stories
Attention Head Visualiser
HeyNEO
Visualiser for transformer attention heads.
0 stories
A
AttuneBench
Unknown
Software product
0 stories
AutoHypothesis
AutoHypothesis
Hypothesis automation software
0 stories
B
BenchLocal
BenchLocal
Local SEO reporting platform
0 stories
Better Agent
LangWatch
Build better agents
0 stories
Blueprint
Blueprint Software Systems
Requirements and product design management platform
0 stories
Blueprint-Bench 2
Andon Labs
Benchmark product
0 stories
BridgeBench
BridgeMind
BridgeBench by BridgeMind
0 stories
Bugsink
Bugsink
Self-hosted error tracking
0 stories
BullshitBench
Peter Gostev
Benchmark for bullshit-free AI answers
0 stories
ccusage
Independent
Claude Code usage analytics
0 stories
CHI-Bench
Independent
Independent benchmark software product
0 stories
Clarity
Microsoft
Understand user behavior on your site.
0 stories
Claude Counter
Independent
Independent software product.
0 stories
Claude token counter
Anthropic
Count tokens for Claude prompts
0 stories
ClawMark
Evolvent AI
Software product by Evolvent AI.
0 stories
Code Arena
Arena Intelligence, Inc.
Code Arena
0 stories
Context Arena
Context Arena
Context handling platform
0 stories
COSMO
Space Telescope Science Institute
COSMO software product
0 stories
C
Coval
Coval
Voice AI evaluation platform
0 stories
C
CUA-World
Carnegie Mellon University
Benchmark for computer-use agents.
0 stories
C
CursorBench
Unattributed
Benchmark/tool for cursor-based coding workflows.
0 stories
Dagger Cloud
Dagger
CI/CD observability for Dagger
0 stories
Datadog
Datadog, Inc.
Observability and security platform for cloud applications
0 stories
DeepAgents Deploy
LangChain
Deploy DeepAgents
0 stories
Dify
LangGenius, Inc.
The open-source LLM app development platform.
0 stories
Dogfood
Dogfood
Turn user feedback into product decisions.
0 stories
DSPy
Stanford NLP Group
The framework for programming—not prompting—language models.
0 stories
EnterpriseRAG-Bench
Open-source community
Benchmark for enterprise RAG evaluation.
0 stories
Entire
Entire
Software product named Entire.
0 stories
Evals
OpenAI
Framework for evaluating LLMs and systems
0 stories
FrontierSWE
Proximal
FrontierSWE by Proximal
0 stories
Future AGI
Future AGI
AI evaluation and observability platform
0 stories
GEPA
gepa-ai
Prompt optimization for LLM agents
0 stories
Gepa-Viz
Modaic
Visualization software product
0 stories
GitHub Repo Stats
GitHub
Repository statistics and activity insights for GitHub repositories.
0 stories
GlitchTip
GlitchTip
Open source error tracking and monitoring
0 stories
Google ADK
Google
Agent Development Kit
0 stories
Google AI Edge Gallery
Google LLC
Explore on-device AI models and demos.
0 stories
Grafana
Grafana Labs
The open and composable observability platform
0 stories
GuideLLM
Red Hat
LLM inference benchmarking and optimization tool
0 stories
G
Gym-Anything
Carnegie Mellon University
Software product
0 stories
Hermes Agent Control Room
Hermes
AI agent control room
0 stories
HiL-Bench
Scale AI
Benchmark for human-in-the-loop AI evaluation
0 stories
howtoeval
Unknown
Evaluation-oriented software product
0 stories
HydraDB
AGI Context, Inc.
Database product
0 stories
Interfere
Interfere, Inc.
Software platform
0 stories
ITBench-AA
Unknown
IT benchmarking software
0 stories
KramaBench
MIT CSAIL
MIT CSAIL benchmark
0 stories
LangSmith Fleet
LangChain
Managed agent fleet operations in LangSmith
0 stories
llm-checker
Independent
Exact target named llm-checker
0 stories
LogRocket
LogRocket
The frontend monitoring platform
0 stories
Lucent
Lucent AI, Inc.
Lucent
0 stories
L
Lumetric
Lumetric
Lumetric
0 stories
Meta-Harness
Stanford IRIS Lab
Meta-harness for evaluation
0 stories
Microsoft Agent 365
Microsoft
Control plane for agents
0 stories
minigepa
Independent
Software product named minigepa
0 stories
Mistral Studio
Mistral AI
Build and operate AI applications and agents.
0 stories
M
ModelClock
ModelClock
ModelClock
0 stories
MulTaBench
Independent
Open-source benchmark software
0 stories
Observability
TrueFoundry
Monitor logs, metrics, traces, and alerts.
0 stories
O
Open Inspect
Open Inspect
Open Inspect
0 stories
Opik
Comet
Open-source LLM observability and evaluation platform
0 stories
Opik Test Suites
Comet
Test suites for Opik
0 stories
Overmind
Overmind Technology Inc.
Overmind
0 stories
Parallel Monitor API
Parallel
Monitoring API
0 stories
PARE-Bench
Unknown
Benchmark for evaluating AI systems
0 stories
ParseBench
LlamaIndex
Document parsing benchmark
0 stories
Phoenix
Arize AI
Open-source LLM observability and evaluation platform.
0 stories
P
PhoenixScore
Unknown vendor
PhoenixScore software product
0 stories
PostHog
PostHog
The all-in-one product analytics platform
0 stories
PostTrainBench
AI Safety and Alignment Group
Benchmark for post-training safety evaluation
0 stories
prinzbench
prinz-ai
Benchmark software from prinz-ai.
0 stories
Prompt Builder
Prompt Builder
Prompt Builder
0 stories
Promptfoo
Promptfoo
Test, eval, and red-team LLM apps.
0 stories
rams
HSLA0001 Inc.
rams
0 stories
Rerun
Rerun
Multimodal data stack
0 stories
Seer
Seer
Seer
0 stories
Seer Agent
Sentry
AI debugging agent
0 stories
Sentrux
Sentrux
Sentrux
0 stories
Sentry
Sentry
Application monitoring for developers
0 stories
Sentry MCP
Sentry
Connect AI assistants to Sentry through MCP.
0 stories
SimGym
Shopify
Shopify simulation environment for shopping agents
0 stories
SLEIGHT-Bench
Independent
Benchmark suite for model and agent evaluation
0 stories
SlopCodeBench
Sprocket Lab
Code benchmark
0 stories
Small Harness
The Doggie Lift
Small Harness
0 stories
Smithery
Clavia, Inc.
Discover and use MCP servers.
0 stories
SWE-check
Unknown
Developer tool for code checks
0 stories
SWE-Marathon
Independent
Open benchmark for long-horizon software engineering tasks.
0 stories
Tautulli
Tautulli
Plex monitoring and tracking
0 stories
Tessl
Tessl AI Limited
Build software with AI
0 stories
THUNDERDOME
Thunderdome
Thunderdome platform
0 stories
Tokenspeed
Tokenspeed
Unverified software product under the Tokenspeed name.
0 stories
TraceQuest
TraceQuest
TraceQuest
0 stories
Vertex AI
Google Cloud
Build, deploy, and scale machine learning models and AI applications.
0 stories
vitest-evals
Sentry
Vitest-based evals
0 stories
VoxelBench
VoxelBench
VoxelBench
0 stories
W&B LEET
Weights & Biases
W&B software product
0 stories
Watchmen
Watchmen
Watchmen
0 stories
ZenMux
AI Force Singapore Pte. Ltd.
AI product
0 stories