Skip to content

explore all stories all tools all skills

All Tools›Categories›Evals & Observability

📊

Evals & Observability

145 tools

Evaluation harnesses, LLM tracing, monitoring, and prompt/agent observability. Dashboards, judges, replay tools, and regression suites for LLM apps.

AI evaluation platform

Weights & Biases

Weights & Biases, LLC

The AI developer platform

Artificial Analysis

Artificial Analysis

AI model benchmarking and analysis platform

Debug, test, and monitor your LLM applications.

OpenAI Agents SDK

Build agents with code.

ARC Prize Foundation

ARC-AGI prize challenge

Baidu's large-model development platform

Braintrust Data, Inc.

Chrome DevTools for agents

Chrome DevTools for AI agents

Claude API console

Explore Interfaces Inc.

Context.ai software product

LangSmith Engine

LangChain product suite entry for LangSmith Engine

LangSmith Sandboxes

Sandboxed code execution for LangSmith

Kepler Software Inc.

Build AI agents with TypeScript

Medical software product

Software product by Plurai.

Benchmarking program understanding and generation

Spreadsheet workflows for Ramp users.

Skill optimization software.

Shopify software product

Token counter for AI prompts

Agent Installer

Installer utility for Splunk agents.

Kubernetes SIG Apps

Sandboxed AI agent execution on Kubernetes

Agent Session App

Agent session app

Unverified software product named Agent View

Trace and debug AI agents.

dip Corporation

AgentsView software product

Outshift by Cisco

Agent interoperability platform

Gateway to AI models

AI usage software

Enterprise AI orchestration platform

Daily-associated evaluation tool

AI research and engineering lab

Autonomous software testing platform

Benchmark and platform for agentic app workflows

Open-source analytics for mobile, web, and desktop apps.

Open-source benchmark for retrieval and function-calling workflows

Unverified Microsoft product target

Benchmark for AI assistant evaluation

Attention Head Visualiser

Attention head visualisation tool.

Local SEO benchmarking and reporting platform.

Build and improve AI agents.

Blueprint Software Systems

Enterprise software product

Blueprint-Bench 2

Benchmark for agent evaluation

BridgeBench software product

AI benchmark for bullshit detection

Claude Code usage tracker.

Benchmark for model evaluation

Claude token counter

Count tokens for Claude prompts and requests.

Independent software product

Unverified software product target

Arena Intelligence, Inc.

Code Arena by Arena Intelligence

Context management platform

Space Telescope Science Institute

STScI software product

Testing and evaluation for voice AI agents

Carnegie Mellon University

Benchmark for computer-using agents

Visibility into your Dagger runs

Cloud observability and security platform

DeepAgents Deploy

Deploy DeepAgents with LangChain

LangGenius, Inc.

The open-source LLM app development platform.

Customer feedback and product insights platform.

Stanford NLP Group

Programming—not prompting—for language models.

EnterpriseRAG-Bench

Open-source community

Benchmark for enterprise RAG evaluation

Software product named Entire.

OpenAI's framework for evaluating models and prompts

Code-focused software engineering benchmark

AI platform for testing and monitoring applications

Prompt optimization for LLM agents

Software product associated with Modaic

GitHub Repo Stats

Repository statistics and insights for GitHub.

Open-source error monitoring platform.

Google's agent development kit

Google AI Edge Gallery

Google AI Edge Gallery

The open observability platform

Benchmarking tool for LLM serving systems

Carnegie Mellon University

Open-source reinforcement-learning environment toolkit.

Hermes Agent Control Room

Agent control room

Human-in-the-loop benchmark

Unverified software product entry for the exact target name.

AGI Context, Inc.

AI-native database

Interfere, Inc.

Interfere software

Public details could not be verified in this run.

MIT CSAIL benchmark

LangSmith Fleet

LangChain software product

Frontend monitoring and session replay

Lucent AI, Inc.

Lumetric software product

Stanford IRIS Lab

Meta-Harness software from Stanford IRIS Lab

Microsoft Agent 365

The control plane for AI agents

Microsoft Clarity

Understand user behavior on your website

Studio for building and managing AI workflows and agents.

ModelClock software product

Open Inspect software product

Open-source LLM evaluation and observability platform

Opik Test Suites

Test suites for Opik evaluations

Overmind Technology Inc.

Change risk analysis platform

Parallel Monitor API

Monitoring API from Parallel

PARE-Bench benchmark software

Document parsing benchmark

Open-source LLM observability and evaluation platform

Unverified software product

Open source product analytics platform.

AI Safety and Alignment Group

Benchmark for post-training evaluation

AI benchmarking tool

Salesforce, Inc.

Build prompts inside Salesforce.

The LLM evals platform.

Software product

Open source multimodal data visualization and logging

Sentry's AI debugging agent.

Sentrux software product

Application monitoring for developers

Connect Sentry to MCP-compatible AI clients.

Shopify software tool

Code benchmark product

The Doggie Lift

Discover, install, and host MCP servers

Developer tool for code checks

Open-source benchmark for autonomous software engineering agents.

A Python-based web application for monitoring and tracking your Plex Media Server.

Tessl AI Limited

AI-native software development platform

Software platform

A unified machine learning platform for building and using generative AI

LLM evals with Vitest

Voxel benchmarking software

Weights & Biases

Unverified W&B product reference

AI Force Singapore Pte. Ltd.

Unverified software product.

AI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.