Skip to content

explore all stories all tools all skills

All Tools›Categories›Evals & Observability

📊

Evals & Observability

144 tools

Evaluation harnesses, LLM tracing, monitoring, and prompt/agent observability. Dashboards, judges, replay tools, and regression suites for LLM apps.

Local SEO benchmarking and reporting platform.

Claude API console

Application monitoring for developers

Agent Installer

Installer utility for Splunk agents.

Kubernetes SIG Apps

A sandbox for agents on Kubernetes.

Agent Session App

Agent session app

Trace and debug AI agents.

dip Corporation

Recruiting software by dip Corporation.

Agent ranking platform

Outshift by Cisco

Agent interoperability platform

A single gateway for AI model access and routing.

AI usage tracking software

Enterprise AI orchestration platform

Daily-associated evaluation tool

AI research and engineering lab

Autonomous software testing platform

Benchmark and execution environment for generalist agents

Open-source, privacy-friendly analytics for your app.

ARC Prize Foundation

ARC-AGI prize challenge

Benchmarking tool

Artificial Analysis

Artificial Analysis

AI model benchmarking and analysis platform

Unverified Microsoft product target

Benchmark for AI assistant evaluation

Attention Head Visualiser

Attention head visualisation tool.

Unverified benchmark product

Baidu's large-model development platform

Build better AI agents.

Blueprint Software Systems

Enterprise software product

Blueprint-Bench 2

Blueprint-Bench 2

Braintrust Data, Inc.

AI evaluation and observability platform

BridgeBench software product

Error tracking for your app.

Benchmark for bullshit detection

Claude Code usage tracking

Benchmark software

Chrome DevTools for agents

Chrome DevTools for agents

Web analytics with session recordings and heatmaps

Claude token counter

Count tokens for Claude prompts and requests.

Unverified software product target

Arena Intelligence, Inc.

Code Arena by Arena Intelligence

AI context evaluation platform

Explore Interfaces Inc.

Context.ai software product

Space Telescope Science Institute

STScI software product

Testing and evaluation for voice AI agents

Carnegie Mellon University

Benchmark for computer-use agents

Benchmark for Cursor-style coding workflows

Observe and debug your pipelines

Cloud observability and security platform

DeepAgents Deploy

Deploy DeepAgents with LangChain

LangGenius, Inc.

AI application development platform

Stanford NLP Group

Programming—not prompting—for language models.

EnterpriseRAG-Bench

Open-source community

Benchmark for enterprise RAG evaluation

Software product named Entire.

OpenAI's framework for evaluating models and prompts

Software engineering evaluation for coding agents

AI platform for testing and monitoring applications

Prompt optimization for LLM agents

Software product associated with Modaic

GitHub Repo Stats

Repository statistics and insights for GitHub.

Open source error tracking

Open-source toolkit for building AI agents.

Google AI Edge Gallery

Explore and run AI models on-device.

Open and composable observability platform

Benchmarking tool for LLM serving systems

Carnegie Mellon University

Research software toolkit

Hermes Agent Control Room

Agent control room

Human-in-the-loop benchmark

Unknown software product.

AGI Context, Inc.

Database product

Interfere, Inc.

Unverified software product.

Unverified product listing

Research benchmark

Debug, test, and monitor your LLM applications.

LangSmith Engine

Platform for tracing, evaluation, and monitoring of LLM applications

LangSmith Fleet

LangChain software product

LangSmith Sandboxes

Sandboxed code execution for LangSmith

Product analytics, session replay, and frontend monitoring.

Lucent AI, Inc.

AI software product

Lumetric software product

Kepler Software Inc.

The TypeScript AI framework

Medical software product

Stanford IRIS Lab

Meta-Harness software tool from Stanford IRIS Lab.

Microsoft Agent 365

Agent governance control plane

Independent software product

Studio for building AI applications on Mistral AI

ModelClock software product

Observability for TrueFoundry

Open Inspect software product

OpenAI Agents SDK

Build agents with code.

Open-source LLM evaluation and observability platform

Opik Test Suites

Test suites for Opik

Overmind Technology Inc.

Change risk analysis platform

Parallel Monitor API

Monitoring API from Parallel

Benchmark software

Document parsing benchmark

Open-source LLM observability and evaluation platform

PhoenixScore software product

Software product by Plurai.

Open source product analytics platform.

AI Safety and Alignment Group

Benchmark for post-training evaluation.

AI benchmarking tool

Programming benchmark suite

Salesforce, Inc.

Build prompts inside Salesforce.

Open-source LLM evals and red teaming

Spreadsheet workflows for Ramp users.

Software product

Open-source visualization for multimodal data

Sentry's AI debugging agent.

Sentrux software product

Connect Sentry to MCP-compatible AI clients.

Shopify software tool

Skill optimization software.

Open benchmark suite for evaluating model behavior

Code benchmark product

The Doggie Lift

Small-size harness from The Doggie Lift.

Discover, install, and host MCP servers

Developer tool for code checks

Benchmarking and evaluation for coding agents

A Python-based web application for monitoring Plex Media Server.

Tessl AI Limited

AI-native software development platform

THUNDERDOME software product.

Shopify product

Token counter for AI prompts

AI evaluation platform

A unified machine learning platform for building and using generative AI

Vitest-based evals for LLM testing.

Unverified software product

Weights & Biases

Unverified W&B product reference

Weights & Biases

Weights & Biases, LLC

The AI developer platform

AI Force Singapore Pte. Ltd.

Unverified software product.

AI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.