Metrics

All Posts

python (20)
mcp (13)
langgraph (12)
llm (10)
agent (10)
rag (10)
langchain (10)
ai-agents (10)
读书笔记 (9)
ai (7)
大语言模型 (7)
engineering (5)
ai-agent (5)
智能体 (4)
workflow (4)
security (4)
dify (3)
状态机 (3)
deepseek (3)
ai代理 (3)
devops (3)
claude (3)
skills (3)
agentic-ai (3)
架构演进 (3)
javascript (2)
ddd (2)
人工智能 (2)
数据分析 (2)
机器学习 (2)
hyde (2)
llm应用 (2)
java (2)
依赖管理 (2)
虚拟环境 (2)
包管理 (2)
python工具链 (2)
提示工程 (2)
多智能体系统 (2)
企业级ai (2)
领域建模 (2)
a2a (2)
生成式ai (2)
代理框架 (2)
深度研究 (2)
教程 (2)
deep-agents (2)
multi-agent (2)
ai-architecture (2)
anthropic (2)
react (2)
content-extraction (2)
knowledge-base (2)
graphrag (2)
owasp (2)
ai-coding (2)
knowledge-graph (2)
ontology (2)
agent-architecture (2)
强化学习 (2)
eslint (1)
前端工具 (1)
代码质量 (1)
静态分析 (1)
cola (1)
分层架构 (1)
领域驱动设计 (1)
微服务 (1)
java架构 (1)
prompt (1)
随手记 (1)
google (1)
智能代理 (1)
prophet (1)
facebook (1)
时间序列预测 (1)
价格预测 (1)
加法模型 (1)
趋势分析 (1)
季节性分析 (1)
clickhouse (1)
echarts (1)
数据可视化 (1)
ai应用开发 (1)
llm工具集成 (1)
rag技术 (1)
rse (1)
自我rag (1)
自适应rag (1)
重排序器 (1)
图rag (1)
层次rag (1)
crag (1)
检索增强生成 (1)
ai开发 (1)
spring-boot (1)
ai集成 (1)
开发工具 (1)
spring-ai (1)
poetry (1)
pyprojecttoml (1)
uv (1)
streamable-http (1)
异步编程 (1)
服务器开发 (1)
客户端开发 (1)
asgi (1)
分布式系统 (1)
实时通信 (1)
mem0 (1)
openmemory (1)
ai记忆 (1)
向量存储 (1)
本地部署 (1)
上下文工程 (1)
记忆系统 (1)
12-factor-agents (1)
生产级agent (1)
ai工程 (1)
最佳实践 (1)
baml (1)
ai工作流 (1)
代理 (1)
类型安全 (1)
boundaryml (1)
deerflow (1)
ai架构 (1)
fastapi (1)
工作流编排 (1)
智能体协作 (1)
embabel (1)
jvm (1)
spring (1)
kotlin (1)
智能规划 (1)
rod-johnson (1)
ooda循环 (1)
open-deep-research (1)
多代理系统 (1)
架构设计 (1)
deep-research (1)
研究报告 (1)
nextjs (1)
vercel (1)
多模型架构 (1)
typescript (1)
知识图谱 (1)
fastmcp (1)
工具集成 (1)
模型上下文协议 (1)
subgraph (1)
架构 (1)
管理 (1)
开发 (1)
定律 (1)
软件工程 (1)
产品管理 (1)
创业方法论 (1)
design-sprint (1)
产品战略 (1)
deepagents (1)
工具包 (1)
代理运行时 (1)
tavily (1)
任务规划 (1)
上下文管理 (1)
chrome-devtools (1)
浏览器自动化 (1)
ai编程助手 (1)
trae-ai (1)
puppeteer (1)
性能监控 (1)
调试工具 (1)
web开发 (1)
agents-20 (1)
人类在环 (1)
持久化 (1)
长期记忆 (1)
子代理 (1)
ai工程化 (1)
kubernetes (1)
kind (1)
helm (1)
macos (1)
supervisor (1)
agentic-patterns (1)
claude-skills (1)
pdf (1)
microsandbox (1)
daytona (1)
sandbox (1)
swarm (1)
multi-hyde (1)
adaptive-hyde (1)
sql (1)
架构框架 (1)
finops (1)
混合智能 (1)
深度可观测性 (1)
反射型智能体 (1)
多智能体编排 (1)
tool-selection (1)
agent-skills (1)
multi-agent-systems (1)
orchestration (1)
evaluation (1)
testing (1)
metrics (1)
llm-as-judge (1)
best-practices (1)
monitoring (1)
observability (1)
opentelemetry (1)
sre (1)
improvement-loops (1)
feedback-pipelines (1)
experimentation (1)
continuous-learning (1)
maestro (1)
prompt-injection (1)
defense-in-depth (1)
human-agent-collaboration (1)
guided-autonomy (1)
governance (1)
trust (1)
superpowers (1)
tdd (1)
spec-driven-development (1)
cognee (1)
memory (1)
arckit (1)
architecture-governance (1)
enterprise-architecture (1)
architecture (1)
rdf (1)
shacl (1)
软件架构 (1)
enterprise-ai (1)
dikwp (1)
tool-call (1)
function-calling (1)
authorization (1)
context-engineering (1)
knowledge-management (1)
okf (1)
generative-ai (1)
乱翻书 (1)
蒙特卡罗方法 (1)
数值计算 (1)
随机模拟 (1)
q-learning (1)
马尔可夫决策过程 (1)
环境搭建 (1)
github-pages (1)
blog (1)

Published on
2026年2月19日
构建 AI 智能体应用（四）：验证与测量
AI-Agents Evaluation Testing Metrics LLM-as-Judge Best-Practices Engineering 读书笔记
本文是《Building Applications with AI Agents》系列解读的第四篇。AI Agent 的质量不等于“回答写得好”，而取决于它在真实环境里是否能稳定完成任务。本文提供从离线评估到生产监控的全链路落地指南，帮助工程团队把“看起来更聪明”的改动，变成“可证明更可靠”的改动。

Metrics

metrics (1)

构建 AI 智能体应用（四）：验证与测量