评估 LangGraph Agent

本教程将介绍如何使用 Litefuse 和 Hugging Face Datasets，监控 LangGraph agents 内部步骤的 trace 并 评估其表现。

本指南覆盖了团队用来快速、可靠地将 agent 推上生产的在线和离线评估指标。

为什么 AI agent 评估很重要：

在任务失败或结果不理想时调试问题
实时监控成本与性能
通过持续反馈提升可靠性和安全性

步骤 0：安装所需的库

下面安装 langgraph 库、langfuse 以及 Hugging Face 的 datasets 库。

%pip install langfuse langchain langgraph langchain_openai langchain_community langchain_huggingface -q

步骤 1：设置环境变量

通过注册 Litefuse 云服务或自托管 Litefuse，获取你的 Litefuse API key。

import os
 
# 在项目设置页获取你的项目密钥：https://litefuse.cloud
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-***" 
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-***" 
os.environ["LANGFUSE_BASE_URL"] = "https://litefuse.cloud"
 
# 你的 OpenAI key
os.environ["OPENAI_API_KEY"] = "sk-proj-***"

环境变量设置完成后，就可以初始化 Langfuse 客户端。get_client() 会使用环境变量中提供的凭据初始化 Langfuse 客户端。

from langfuse import get_client
 
langfuse = get_client()
 
# Verify connection
if langfuse.auth_check():
    print("Langfuse client is authenticated and ready!")
else:
    print("Authentication failed. Please check your credentials and host.")

Langfuse client is authenticated and ready!

步骤 2：测试 instrumentation

这里是一个简单的问答 agent。我们运行它来确认 instrumentation 工作正常。如果一切配置正确，你会在 observability dashboard 中看到日志/span。

from typing import Annotated
 
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from typing_extensions import TypedDict
 
from langgraph.graph import StateGraph
from langgraph.graph.message import add_messages
 
class State(TypedDict):
    # Messages have the type "list". The `add_messages` function in the annotation defines how this state key should be updated
    # (in this case, it appends messages to the list, rather than overwriting them)
    messages: Annotated[list, add_messages]
 
graph_builder = StateGraph(State)
 
llm = ChatOpenAI(model = "gpt-4o", temperature = 0.2)
 
# The chatbot node function takes the current State as input and returns an updated messages list. This is the basic pattern for all LangGraph node functions.
def chatbot(state: State):
    return {"messages": [llm.invoke(state["messages"])]}
 
# Add a "chatbot" node. Nodes represent units of work. They are typically regular python functions.
graph_builder.add_node("chatbot", chatbot)
 
# Add an entry point. This tells our graph where to start its work each time we run it.
graph_builder.set_entry_point("chatbot")
 
# Set a finish point. This instructs the graph "any time this node is run, you can exit."
graph_builder.set_finish_point("chatbot")
 
# To be able to run our graph, call "compile()" on the graph builder. This creates a "CompiledGraph" we can use invoke on our state.
graph = graph_builder.compile()

from langfuse.langchain import CallbackHandler
 
# Initialize Langfuse CallbackHandler for Langchain (tracing)
langfuse_handler = CallbackHandler()
 
for s in graph.stream(
    {"messages": [HumanMessage(content = "What is Litefuse?")]},
    config={"callbacks": [langfuse_handler]}):
    print(s)

打开你的 Litefuse Traces Dashboard 确认 span 和日志已记录下来。

Litefuse 中的示例 trace：

Litefuse 中的示例 trace

trace 链接

步骤 3：观察并评估更复杂的 agent

确认 instrumentation 工作之后，我们尝试一个更复杂的查询，看看更进一步的指标（token 用量、延迟、成本等）是如何被跟踪的。

import os
from typing import TypedDict, List, Dict, Any, Optional
from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

class EmailState(TypedDict):
    email: Dict[str, Any]           
    is_spam: Optional[bool]         
    spam_reason: Optional[str]      
    email_category: Optional[str]   
    draft_response: Optional[str]   
    messages: List[Dict[str, Any]]

# Initialize LLM
model = ChatOpenAI( model="gpt-4o",temperature=0)
 
class EmailState(TypedDict):
    email: Dict[str, Any]
    is_spam: Optional[bool]
    draft_response: Optional[str]
    messages: List[Dict[str, Any]]
 
# Define nodes
def read_email(state: EmailState):
    email = state["email"]
    print(f"Alfred is processing an email from {email['sender']} with subject: {email['subject']}")
    return {}
 
def classify_email(state: EmailState):
    email = state["email"]
    
    prompt = f"""
As Alfred the butler of Mr wayne and it's SECRET identity Batman, analyze this email and determine if it is spam or legitimate and should be brought to Mr wayne's attention.
 
Email:
From: {email['sender']}
Subject: {email['subject']}
Body: {email['body']}
 
First, determine if this email is spam.
answer with SPAM or HAM if it's legitimate. Only return the answer
Answer :
    """
    messages = [HumanMessage(content=prompt)]
    response = model.invoke(messages)
    
    response_text = response.content.lower()
    print(response_text)
    is_spam = "spam" in response_text and "ham" not in response_text
    
    if not is_spam:
        new_messages = state.get("messages", []) + [
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": response.content}
        ]
    else :
        new_messages = state.get("messages", [])
    
    return {
        "is_spam": is_spam,
        "messages": new_messages
    }
 
def handle_spam(state: EmailState):
    print(f"Alfred has marked the email as spam.")
    print("The email has been moved to the spam folder.")
    return {}
 
def drafting_response(state: EmailState):
    email = state["email"]
    
    prompt = f"""
As Alfred the butler, draft a polite preliminary response to this email.
 
Email:
From: {email['sender']}
Subject: {email['subject']}
Body: {email['body']}
 
Draft a brief, professional response that Mr. Wayne can review and personalize before sending.
    """
    
    messages = [HumanMessage(content=prompt)]
    response = model.invoke(messages)
    
    new_messages = state.get("messages", []) + [
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": response.content}
    ]
    
    return {
        "draft_response": response.content,
        "messages": new_messages
    }
 
def notify_mr_wayne(state: EmailState):
    email = state["email"]
    
    print("\n" + "="*50)
    print(f"Sir, you've received an email from {email['sender']}.")
    print(f"Subject: {email['subject']}")
    print("\nI've prepared a draft response for your review:")
    print("-"*50)
    print(state["draft_response"])
    print("="*50 + "\n")
    
    return {}
 
# Define routing logic
def route_email(state: EmailState) -> str:
    if state["is_spam"]:
        return "spam"
    else:
        return "legitimate"
 
# Create the graph
email_graph = StateGraph(EmailState)
 
# Add nodes
email_graph.add_node("read_email", read_email) # the read_email node executes the read_mail function
email_graph.add_node("classify_email", classify_email) # the classify_email node will execute the classify_email function
email_graph.add_node("handle_spam", handle_spam) #same logic 
email_graph.add_node("drafting_response", drafting_response) #same logic
email_graph.add_node("notify_mr_wayne", notify_mr_wayne) # same logic

# Add edges
email_graph.add_edge(START, "read_email") # After starting we go to the "read_email" node
 
email_graph.add_edge("read_email", "classify_email") # after_reading we classify
 
# Add conditional edges
email_graph.add_conditional_edges(
    "classify_email", # after classify, we run the "route_email" function"
    route_email,
    {
        "spam": "handle_spam", # if it return "Spam", we go the "handle_span" node
        "legitimate": "drafting_response" # and if it's legitimate, we go to the "drafting response" node
    }
)
 
# Add final edges
email_graph.add_edge("handle_spam", END) # after handling spam we always end
email_graph.add_edge("drafting_response", "notify_mr_wayne")
email_graph.add_edge("notify_mr_wayne", END) # after notifyinf Me wayne, we can end  too

# Compile the graph
compiled_graph = email_graph.compile()

 # Example emails for testing
legitimate_email = {
    "sender": "Joker",
    "subject": "Found you Batman ! ",
    "body": "Mr. Wayne,I found your secret identity ! I know you're batman ! Ther's no denying it, I have proof of that and I'm coming to find you soon. I'll get my revenge. JOKER"
}
 
spam_email = {
    "sender": "Crypto bro",
    "subject": "The best investment of 2025",
    "body": "Mr Wayne, I just launched an ALT coin and want you to buy some !"
}

from langfuse.langchain import CallbackHandler
 
# Initialize Langfuse CallbackHandler for Langchain (tracing)
langfuse_handler = CallbackHandler()
 
# Process legitimate email
print("\nProcessing legitimate email...")
legitimate_result = compiled_graph.invoke(
    input = {
        "email": legitimate_email,
        "is_spam": None,
        "draft_response": None,
        "messages": []
        },
    config={"callbacks": [langfuse_handler]}
)
 
# Process spam email
print("\nProcessing spam email...")
spam_result = compiled_graph.invoke(
    input = {
        "email": spam_email,
        "is_spam": None,
        "draft_response": None,
        "messages": []
        },
    config={"callbacks": [langfuse_handler]}
)

Trace 结构

Litefuse 记录一个 trace，其中包含若干 span，分别对应 agent 逻辑的每一步。在这里，trace 包含整个 agent 运行以及如下子 span：

工具调用（get_weather）
LLM 调用（使用 ‘gpt-4o’ 的 Responses API）

你可以查看这些 span，精确了解时间花在哪里、用了多少 token 等等：

Litefuse 中的 trace 树

trace 链接

在线评估

在线评估指的是在真实环境中对 agent 进行评估，即在生产中的实际使用过程中持续监控 agent 在真实用户交互上的表现并分析结果。

生产中常见的跟踪指标

成本 —— instrumentation 会捕获 token 用量，你可以按每 token 的价格换算成大致成本。
延迟 —— 观察每一步或整次运行所花的时间。
用户反馈 —— 用户可以直接给出反馈（点赞/点踩），帮助优化或纠正 agent。
LLM-as-a-Judge —— 用另一个 LLM 近实时地评估 agent 的输出（例如检查毒性或正确性）。

下面分别给出这些指标的示例。

1. 成本

下面截图展示了 gpt-4o 调用的用量。这有助于发现高成本步骤并优化 agent。

trace 链接

2. 延迟

我们也能看到每个步骤的耗时。在下面的示例中，整个运行约耗时 3 秒，你可以按步骤拆解，找到瓶颈并优化 agent。

trace 链接

3. 用户反馈

如果 agent 嵌入了某个用户界面，可以记录用户的直接反馈（比如聊天 UI 中的点赞/点踩）。

from langfuse import get_client
 
langfuse = get_client()
 
# Option 1: Use the yielded span object from the context manager
with langfuse.start_as_current_observation(
    as_type="span",
    name="langgraph-request") as span:
    # ... LangGraph execution ...
 
    # Score using the span object
    span.score_trace(
        name="user-feedback",
        value=1,
        data_type="NUMERIC",
        comment="This was correct, thank you"
    )
 
# Option 2: Use langfuse.score_current_trace() if still in context
with langfuse.start_as_current_observation(as_type="span", name="langgraph-request") as span:
    # ... LangGraph execution ...
 
    # Score using current context
    langfuse.score_current_trace(
        name="user-feedback",
        value=1,
        data_type="NUMERIC"
    )
 
# Option 3: Use create_score() with trace ID (when outside context)
langfuse.create_score(
    trace_id="predefined-trace-id", # Needs to be a valid trace id format (see docs)
    name="user-feedback",
    value=1,
    data_type="NUMERIC",
    comment="This was correct, thank you"
)

随后用户反馈就会出现在 Litefuse 中：

Litefuse 中正在采集的用户反馈

4. 自动化的 LLM-as-a-Judge 打分

LLM-as-a-Judge 是自动评估 agent 输出的另一种方法。你可以设置一次单独的 LLM 调用，对输出的正确性、毒性、风格或任何你关心的标准进行判定。

工作流程：

你定义一个 Evaluation Template，例如 “Check if the text is toxic.”
你设定用作判定模型的 model，本例使用 gpt-4o-mini。
agent 每次产出输出时，将该输出连同模板传给 “judge” LLM。
judge LLM 给出评分或标签，并记录到你的 observability 工具中。

Litefuse 中的示例：

LLM-as-a-Judge Evaluation Template LLM-as-a-Judge Evaluator

# Process spam email
print("\nProcessing spam email...")
spam_result = compiled_graph.invoke(
    input = {
        "email": spam_email,
        "is_spam": None,
        "draft_response": None,
        "messages": []
        },
    config={"callbacks": [langfuse_handler]}
)

Processing spam email… Alfred is processing an email from Crypto bro with subject: The best investment of 2025 spam Alfred has marked the email as spam. The email has been moved to the spam folder. 可以看到本例中的回答被判定为 “not toxic”。

LLM-as-a-Judge Evaluation Score

5. 可观测性指标概览

所有这些指标都可以在 dashboard 中一起可视化。这能让你快速看到 agent 在大量 session 上的表现，方便长期跟踪质量指标。

可观测性指标概览

离线评估

在线评估对于实时反馈是必需的，但你还需要 离线评估 —— 在开发前或开发中进行系统化检查。这有助于在将变更推到生产前保证质量与可靠性。

数据集评估

在离线评估中，通常做法是：

准备一个基准数据集（包含 prompt 与期望输出对）
用 agent 在该数据集上运行
把输出与期望结果进行比较，或者使用额外的评分机制

下面我们用 q&a-dataset 演示这种做法，该数据集包含问题和期望答案。

import pandas as pd
from datasets import load_dataset
 
# Fetch search-dataset from Hugging Face
dataset = load_dataset("junzhang1207/search-dataset", split = "train")
df = pd.DataFrame(dataset)
print("First few rows of search-dataset:")
print(df.head())

接下来在 Litefuse 中创建一个 dataset 实体来跟踪运行结果，然后把数据集中的每一项加进去。

from langfuse import Langfuse
langfuse = Langfuse()
 
langfuse_dataset_name = "qa-dataset_langgraph-agent"
 
# Create a dataset in Langfuse
langfuse.create_dataset(
    name=langfuse_dataset_name,
    description="q&a dataset uploaded from Hugging Face",
    metadata={
        "date": "2025-03-21",
        "type": "benchmark"
    }
)

df_30 = df.sample(30) # For this example, we upload only 30 dataset questions
 
for idx, row in df_30.iterrows():
    langfuse.create_dataset_item(
        dataset_name=langfuse_dataset_name,
        input={"text": row["question"]},
        expected_output={"text": row["expected_answer"]}
    )

Litefuse 中的 dataset 条目

在数据集上运行 agent

首先，我们定义一个 task 函数 my_task()，用来包装我们的 LangGraph agent。

from langfuse import get_client
from langfuse.langchain import CallbackHandler
from langchain_openai import ChatOpenAI
from langgraph.graph.message import add_messages
from langgraph.graph import StateGraph
from langchain_core.messages import HumanMessage
from typing import Annotated
from typing_extensions import TypedDict
 
# Initialize Langfuse CallbackHandler for Langchain
langfuse = get_client()
langfuse_handler = CallbackHandler()
 
# Define the task function we pass to the experiment runner method
def my_task(*, item, **kwargs):
    
    question = item.input["text"]
    
    class State(TypedDict):
        messages: Annotated[list, add_messages]
 
    graph_builder = StateGraph(State)
    llm = ChatOpenAI(model = "gpt-4o")
 
    def chatbot(state: State):
        return {"messages": [llm.invoke(state["messages"])]}
    
    graph_builder.add_node("chatbot", chatbot)
    graph_builder.set_entry_point("chatbot")
    graph_builder.set_finish_point("chatbot")
    graph = graph_builder.compile()
 
    response = graph.invoke(
        input={"messages": [HumanMessage(content=question)]},
        config={"callbacks": [langfuse_handler]}
        )
        
    print(question)
    print(response["messages"][1].content)
 
    return response["messages"][1].content

最后，使用 experiment runner SDK 针对每个数据集项运行 task 函数。experiment runner 会处理并发执行、自动 trace 和评估。

# Fetch dataset and run experiment
dataset = langfuse.get_dataset('qa-dataset_langgraph-agent')
 
result = dataset.run_experiment(
    name="run_gpt-4o",
    description="My first run",
    task=my_task,
    metadata={"model": "gpt-4o"}
)
 
# Flush the langfuse client to ensure all data is sent to the server at the end of the experiment run
langfuse.flush()
print(result.format())

你可以替换不同的 agent 配置重复这个过程，例如：

模型（gpt-5.1、gpt-5-mini 等）
prompt
工具（带搜索 vs. 不带搜索）
agent 复杂度（多 agent vs. 单 agent）

然后在 Litefuse 中并排比较。本示例中我在 30 道数据集问题上运行了 agent 3 次，每次使用不同的 OpenAI 模型。可以看到使用更大模型时正确回答的题数有所提升（符合预期）。correct_answer 分数由 LLM-as-a-Judge Evaluator 生成，它会根据数据集中给定的样例答案判定问题的正确性。

Dataset run 概览 Dataset run 对比

指南 - 构建意图分类 pipeline Example Llm Security Monitoring

这个页面对你有帮助吗？

支持