对 OpenAI-Agents SDK 的评估

本教程将介绍如何使用 Litefuse 和 Hugging Face Datasets，监控 OpenAI agent SDK 内部步骤的 trace 并 评估其表现。

本指南覆盖了团队用来快速、可靠地将 agent 推上生产的在线与离线评估指标。

为什么 AI agent 评估很重要：

在任务失败或结果不理想时调试问题
实时监控成本与性能
通过持续反馈提升可靠性和安全性

步骤 0：安装所需的库

下面安装 openai-agents 库（OpenAI Agents SDK 项目地址）、openinference 的 OpenTelemetry instrumentation、langfuse 以及 Hugging Face 的 datasets 库。

%pip install openai-agents nest_asyncio openinference-instrumentation-openai-agents langfuse datasets -q

Note: you may need to restart the kernel to use updated packages.

步骤 1：对 agent 进行 instrument

在本 notebook 中，我们将使用 Litefuse 来 trace、调试和评估 agent。

注意： 如果你使用的是其他框架（如 LangGraph、LlamaIndex 或 CrewAI），可以在我们的集成专区中找到对应的 instrument 文档。

import os
 
# 在项目设置页获取你的项目密钥：https://litefuse.cloud
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-***" 
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-***" 
os.environ["LANGFUSE_BASE_URL"] = "https://litefuse.cloud"
 
# 你的 OpenAI key
os.environ["OPENAI_API_KEY"] = "sk-proj-***"

环境变量设置完成后，就可以初始化 Langfuse 客户端。get_client() 会使用环境变量中提供的凭据初始化 Langfuse 客户端。

from langfuse import get_client
 
langfuse = get_client()
 
# Verify connection
if langfuse.auth_check():
    print("Langfuse client is authenticated and ready!")
else:
    print("Authentication failed. Please check your credentials and host.")

接下来初始化 OpenInference OpenAI Agents instrumentation。这个第三方 instrumentation 会自动捕获 OpenAI Agents 的操作，并将 OpenTelemetry（OTel）span 导出到 Litefuse。

注意： nest_asyncio.apply() 与 uvloop 不兼容，而 FastAPI 常用 uvloop 来管理事件循环。如果你的应用使用了 uvloop，又需要 nest_asyncio（例如某些 instrumentation 或 trace 库），那么需要在受影响的代码段中关闭 uvloop，改用 Python 标准的 asyncio 事件循环。

import nest_asyncio
nest_asyncio.apply()

from openinference.instrumentation.openai_agents import OpenAIAgentsInstrumentor
 
OpenAIAgentsInstrumentor().instrument()

步骤 2：测试 instrumentation

这里是一个简单的问答 agent。我们运行它来确认 instrumentation 工作正常。如果一切配置正确，你会在 observability dashboard 中看到日志/span。

import asyncio
from agents import Agent, Runner
 
async def main():
    agent = Agent(
        name="Assistant",
        instructions="You are a senior software engineer",
    )
 
    result = await Runner.run(agent, "Tell me why it is important to evaluate AI agents.")
    print(result.final_output)
 
loop = asyncio.get_running_loop()
await loop.create_task(main())
 
langfuse.flush()

打开你的 Litefuse Traces Dashboard 确认 span 和日志已记录下来。

Litefuse 中的示例 trace：

Litefuse 中的示例 trace

trace 链接

步骤 3：观察并评估更复杂的 agent

确认 instrumentation 工作之后，我们尝试一个更复杂的查询，看看更进一步的指标（token 用量、延迟、成本等）是如何被跟踪的。

import asyncio
from agents import Agent, Runner, function_tool
 
# Example function tool.
@function_tool
def get_weather(city: str) -> str:
    return f"The weather in {city} is sunny."
 
agent = Agent(
    name="Hello world",
    instructions="You are a helpful agent.",
    tools=[get_weather],
)
 
async def main():
    result = await Runner.run(agent, input="What's the weather in Berlin?")
    print(result.final_output)
 
loop = asyncio.get_running_loop()
await loop.create_task(main())

Trace 结构

Litefuse 记录一个 trace，其中包含若干 span，分别对应 agent 逻辑的每一步。在这里，trace 包含整个 agent 运行以及如下子 span：

工具调用（get_weather）
LLM 调用（使用 ‘gpt-4o’ 的 Responses API）

你可以查看这些 span，精确了解时间花在哪里、用了多少 token 等等：

Litefuse 中的 trace 树

trace 链接

在线评估

在线评估指的是在真实环境中对 agent 进行评估，即在生产中的实际使用过程中持续监控 agent 在真实用户交互上的表现并分析结果。

生产中常见的跟踪指标

成本 —— instrumentation 会捕获 token 用量，你可以按每 token 的价格换算成大致成本。
延迟 —— 观察每一步或整次运行所花的时间。
用户反馈 —— 用户可以直接给出反馈（点赞/点踩），帮助优化或纠正 agent。
LLM-as-a-Judge —— 用另一个 LLM 近实时地评估 agent 的输出（例如检查毒性或正确性）。

下面分别给出这些指标的示例。

1. 成本

下面截图展示了 gpt-4o 调用的用量。这有助于发现高成本步骤并优化 agent。

trace 链接

2. 延迟

我们也能看到每个步骤的耗时。在下面的示例中，整个运行耗时 7 秒，你可以按步骤拆解，找到瓶颈并优化 agent。

trace 链接

3. 添加额外属性

Litefuse 支持向 span 传递额外属性，如 user_id、tags、session_id 和自定义 metadata。用这些信息丰富 trace 对于分析、调试以及跨用户或 session 监控应用行为非常重要。

下面的代码演示了如何用 langfuse.start_as_current_observation 启动一个自定义 span，并使用 propagate_attributes() 更新与该 span 关联的 trace。

→ 了解更多：更新 Trace 和 Span 属性。

from langfuse import get_client, propagate_attributes
 
langfuse = get_client()
 
input_query = "Why is AI agent evaluation important?"
 
with langfuse.start_as_current_observation(
    name="OpenAI-Agent-Trace",
    ) as span:
 
    with propagate_attributes(
        user_id="user_123",
        session_id="my-agent-session",
        tags=["staging", "demo", "OpenAI Agent SDK"],
        metadata={"email": "user@litefuse.ai"},
        version="1.0.0"
    ):
    
        # Run your application here
        async def main(input_query):
                agent = Agent(
                    name = "Assistant",
                    instructions = "You are a helpful assistant.",
                )
 
                result = await Runner.run(agent, input_query)
                print(result.final_output)
                return result
 
        result = await main(input_query)
    
        # Set input and output on the root observation
        span.update(
            input=input_query,
            output=result,
            )
 
# Flush events in short-lived applications
langfuse.flush()

Litefuse 中的示例 trace

4. 用户反馈

如果 agent 嵌入了某个用户界面，可以记录用户的直接反馈（比如聊天 UI 中的点赞/点踩）。下面是一个使用 IPython.display 实现简易反馈机制的示例。

下面的代码片段中，当用户发送聊天消息时，我们会捕获 OpenTelemetry 的 trace ID。如果用户对最近一次回答点赞/点踩，我们就把对应的分数附加到 trace 上。

注意： 下面这段代码是面向 Google Colab notebook 编写的，在其他环境下可以跳过或者做相应调整。

from agents import Agent, Runner, WebSearchTool
import ipywidgets as widgets
from IPython.display import display
from langfuse import get_client
 
langfuse = get_client()
 
# Define your agent with the web search tool
agent = Agent(
    name="WebSearchAgent",
    instructions="You are an agent that can search the web.",
    tools=[WebSearchTool()]
)
 
def on_feedback(button):
    if button.icon == "thumbs-up":
      langfuse.create_score(
            value=1,
            name="user-feedback",
            comment="The user gave this response a thumbs up",
            trace_id=trace_id
        )
    elif button.icon == "thumbs-down":
      langfuse.create_score(
            value=0,
            name="user-feedback",
            comment="The user gave this response a thumbs down",
            trace_id=trace_id
        )
    print("Scored the trace in Litefuse")
 
user_input = input("Enter your question: ")
 
# Run agent
with langfuse.start_as_current_observation(
    as_type="span",
    name="OpenAI-Agent-Trace",
    ) as span:
    
    # Run your application here
    result = Runner.run_sync(agent, user_input)
    print(result.final_output)
 
    result = await main(user_input)
    trace_id = langfuse.get_current_trace_id()
 
    span.update(
        input=user_input,
        output=result.final_output,
    )
 
# Get feedback
print("How did you like the agent response?")
 
thumbs_up = widgets.Button(description="👍", icon="thumbs-up")
thumbs_down = widgets.Button(description="👎", icon="thumbs-down")
 
thumbs_up.on_click(on_feedback)
thumbs_down.on_click(on_feedback)
 
display(widgets.HBox([thumbs_up, thumbs_down]))
 
# Flush events in short-lived applications
langfuse.flush()

随后用户反馈就会出现在 Litefuse 中：

Litefuse 中正在采集的用户反馈

5. LLM-as-a-Judge

LLM-as-a-Judge 是自动评估 agent 输出的另一种方法。你可以设置一次单独的 LLM 调用，对输出的正确性、毒性、风格或任何你关心的标准进行判定。

工作流程：

你定义一个 Evaluation Template，例如 “Check if the text is toxic.”
你设定用作判定模型的 model，本例使用 gpt-4o-mini。
agent 每次产出输出时，将该输出连同模板传给 “judge” LLM。
judge LLM 给出评分或标签，并记录到你的 observability 工具中。

Litefuse 中的示例：

LLM-as-a-Judge Evaluation Template LLM-as-a-Judge Evaluator

# Example: Checking if the agent's output is toxic or not.
from agents import Agent, Runner, WebSearchTool
 
# Define your agent with the web search tool
agent = Agent(
    name="WebSearchAgent",
    instructions="You are an agent that can search the web.",
    tools=[WebSearchTool()]
)
 
input_query = "Is eating carrots good for the eyes?"
 
# Run agent
with langfuse.start_as_current_observation(as_type="span", name="OpenAI-Agent-Trace") as span:
    # Run your agent with a query
    result = Runner.run_sync(agent, input_query)
 
    # Set input and output on the root observation
    span.update(
        input=input_query,
        output=result.final_output,
    )

可以看到本例中的回答被判定为 “not toxic”。

LLM-as-a-Judge Evaluation Score

6. 可观测性指标概览

所有这些指标都可以在 dashboard 中一起可视化。这能让你快速看到 agent 在大量 session 上的表现，方便长期跟踪质量指标。

可观测性指标概览

离线评估

在线评估对于实时反馈是必需的，但你还需要 离线评估 —— 在开发前或开发中进行系统化检查。这有助于在将变更推到生产前保证质量与可靠性。

数据集评估

在离线评估中，通常做法是：

准备一个基准数据集（包含 prompt 与期望输出对）
用 agent 在该数据集上运行
把输出与期望结果进行比较，或者使用额外的评分机制

下面我们用 search-dataset 演示这种做法，该数据集包含可以通过 web 搜索工具回答的问题及对应的期望答案。

import pandas as pd
from datasets import load_dataset
 
# Fetch search-dataset from Hugging Face
dataset = load_dataset("junzhang1207/search-dataset", split = "train")
df = pd.DataFrame(dataset)
print("First few rows of search-dataset:")
print(df.head())

接下来在 Litefuse 中创建一个 dataset 实体来跟踪运行结果，然后把数据集中的每一项加进去。

from langfuse import get_client
langfuse = get_client()
 
langfuse_dataset_name = "search-dataset_huggingface_openai-agent"
 
# Create a dataset in Litefuse
langfuse.create_dataset(
    name=langfuse_dataset_name,
    description="search-dataset uploaded from Huggingface",
    metadata={
        "date": "2026-01-21",
        "type": "benchmark"
    }
)

for idx, row in df.iterrows():
    langfuse.create_dataset_item(
        dataset_name=langfuse_dataset_name,
        input={"text": row["question"]},
        expected_output={"text": row["expected_answer"]}
    )
    if idx >= 49: # For this example, we upload only the first 50 items
        break

Litefuse 中的 dataset 条目

在数据集上运行 agent

我们使用 experiment runner SDK 针对每个数据集项运行 agent。experiment runner 会处理并发执行、自动 trace 和评估。

我们定义一个 task 函数，它：

启动一个 Litefuse span 用于 trace
用 prompt 运行 agent
返回生成的答案

import asyncio
from agents import Agent, Runner, WebSearchTool
from langfuse import get_client
 
langfuse = get_client()
dataset_name = "search-dataset_huggingface_openai-agent"
 
# Define the OpenAI agent
agent = Agent(
    name="WebSearchAgent",
    instructions="You are an agent that can search the web.",
    tools=[WebSearchTool(search_context_size="high")]
)
 
# Task function that runs the OpenAI agent for each dataset item
def run_task(*, item, **kwargs):
    """Task function that runs the OpenAI agent for each dataset item"""
    question = item.input["text"]
    
    async def _run():
        result = await Runner.run(agent, question)
 
        print(f"Processed: {item.input}")
        return result.final_output
    
    loop = asyncio.get_event_loop()
    return loop.run_until_complete(_run())

# Fetch dataset
dataset = langfuse.get_dataset(name=dataset_name)
 
# Run experiment using the experiment runner SDK
result = dataset.run_experiment(
    name="qna_model_v3_run_05_20",  # Identifies this specific evaluation run
    description="Evaluation run for Q&A model v3 on May 20th",
    task=run_task,
    metadata={"model_provider": "OpenAI", "temperature_setting": 0.7}
)
 
print(result.format())

你可以替换不同的配置重复这个过程：

搜索工具（例如 OpenAI 的 WebSearchTool 设置不同的 context size）
模型（gpt-5.2、gpt-5.2-mini 等）
工具（带搜索 vs. 不带搜索）

然后在 Litefuse 中并排比较。本示例中我在 50 道数据集问题上运行了 agent 3 次，每次为 OpenAI WebSearchTool 设置了不同的 context size。可以看到增大 context size 时答案正确率从 0.89 略微提高到 0.92。correct_answer 分数由 LLM-as-a-Judge Evaluator 生成，它会根据数据集中给定的样例答案判定问题的正确性。

Dataset run 概览 Dataset run 对比

资源

查看 Litefuse 文档，了解更多评估和调试 agent 的方法。

评估多轮对话外部评估 Pipeline

这个页面对你有帮助吗？

支持