Access Agents & Criteria
Go to the Agents & Criteria section in the Beam AI Evaluation Framework.
Select the workspace relevant to the agent you’re evaluating.
Select the Agent and Workflow
Choose the agent you are evaluating.
Identify the specific workflow associated with the test cases.
Define Steps for Each Workflow
Within each workflow, you’ll see individual steps representing discrete tasks the agent must complete.
Each step will have its own specific evaluation criteria to measure the agent’s performance on that task.
Set Evaluation Criteria for Each Step
Click on a step to view or edit its evaluation details.
Define the Evaluation Technique:
Choose an appropriate technique based on the evaluation needs, such as scoring based on expected output.
Choose Check Against criteria:
This determines the basis for comparison, such as using Expected Output (a precise, correct answer) or Prompt (a general template with placeholders).
Best Practices for Choosing Between Expected Output and Prompt
Use Expected Output:
When the agent’s response must be exact or very specific.
For tasks where there is a single, correct answer or a tightly defined result.
Example scenarios include data extraction tasks, where a particular piece of information (like a number or identifier) must be extracted correctly.
Use Prompt:
When flexibility is needed in the response, and the output can vary within an acceptable structure.
For tasks where the format is more important than the exact wording, such as generating responses with a specific structure.
This approach is suitable when responses may contain varying details but should still meet a consistent template.
Writing Clear and Effective Expected Outputs and Prompts
Expected Outputs:
Write expected outputs to be as precise as possible, detailing exactly what the agent should return.
Avoid ambiguity to ensure that the agent’s response can be evaluated accurately against a clear standard.
Prompts:
Write prompts with placeholders for variable elements, focusing on the structure and critical components rather than specific wording.
Use clear labels in placeholders to specify what type of information should go there (e.g.,
<Customer Name>
or<Order Number>
).Ensure the prompt covers all essential parts of the response, so even with flexible wording, the agent meets the required structure.
Review and Finalize Criteria
Ensure each step’s criteria are concise, consistent, and aligned with the desired agent behavior.
Save any changes to finalize the evaluation criteria.