assessment evaluation metrics quality Evaluation System 12-Tier Comprehensive Evaluation
Industry-leading evaluation with 100+ metrics across 400+ modules
Enterprise Evaluation
Part of 237 enterprise modules with real-time monitoring, offline analysis, and continuous improvement. See Enterprise Documentation .
Real-time
Online Monitoring
Comprehensive
Offline Analysis
Overview The Agentic AI Framework provides a comprehensive 12-tier evaluation system that covers every aspect of AI agent performanceβfrom basic model quality to enterprise compliance and cost optimization.
Python from agenticaiframework.evaluation import (
EvaluationManager ,
ModelQualityEvaluator ,
TaskPerformanceEvaluator ,
ToolEffectivenessEvaluator ,
MemoryRAGEvaluator ,
AutonomyPerformanceEvaluator ,
SecurityRiskEvaluator ,
CostQualityEvaluator ,
HumanBusinessEvaluator ,
DriftDetector ,
ABTestingFramework ,
CanaryDeployment ,
WorkflowAnalytics
)
# Initialize comprehensive evaluation
evaluation_manager = EvaluationManager (
enable_all_tiers = True ,
real_time_monitoring = True ,
alert_thresholds = "production"
)
12-Tier Evaluation Architecture Tier 1 Model Quality Accuracy, latency, hallucination detection, factual grounding
Tier 2 Task Performance Completion rates, efficiency, step optimization
Tier 3 Tool Effectiveness Tool usage, error rates, selection accuracy
Tier 4 Memory & RAG Retrieval quality, context utilization, recall rates
Tier 5 Autonomy Decision quality, escalation appropriateness
Tier 6 Security & Risk Threat detection, compliance, vulnerability scanning
Tier 7 Cost & Quality Token optimization, ROI analysis, resource efficiency
Tier 8 Human Alignment User satisfaction, business value, feedback loops
Tier 9 Drift Detection Behavior drift, performance degradation, model decay
Tier 10 A/B Testing Experiment management, statistical significance
Tier 11 Canary Deployment Gradual rollout, risk mitigation, rollback triggers
Tier 12 Workflow Analytics End-to-end analysis, bottleneck detection, optimization
Tier 1: Model Quality Evaluation Comprehensive assessment of LLM response quality, accuracy, and reliability.
Model Quality Evaluator Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import ModelQualityEvaluator
evaluator = ModelQualityEvaluator (
metrics = [ "accuracy" ,
"response_latency" ,
"hallucination_rate" ,
"factual_grounding" ,
"coherence" ,
"relevance" ,
"completeness"
],
ground_truth_source = "knowledge_base" ,
hallucination_detector = "advanced"
)
# Evaluate a response
result = await evaluator . evaluate (
prompt = "What are the benefits of solar energy?" ,
response = agent_response ,
context = retrieved_documents
)
logger . info ( f "Accuracy: { result . accuracy : .2% } " )
logger . info ( f "Hallucination Score: { result . hallucination_score : .2f } " )
logger . info ( f "Latency: { result . latency_ms } ms" )
logger . info ( f "Factual Grounding: { result . factual_grounding : .2% } " )
Hallucination Detection Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.model_quality import HallucinationDetector
detector = HallucinationDetector (
detection_methods = [ "semantic_similarity" ,
"knowledge_base_verification" ,
"self_consistency" ,
"source_attribution"
],
threshold = 0.8 ,
alert_on_detection = True
)
# Check for hallucinations
analysis = await detector . analyze (
response = agent_response ,
source_documents = documents ,
knowledge_base = kb
)
if analysis . has_hallucination :
logger . info ( f "Detected hallucinations: { analysis . hallucinated_segments } " )
logger . info ( f "Confidence: { analysis . confidence : .2% } " )
logger . info ( f "Suggested corrections: { analysis . corrections } " )
Response Quality Scoring Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.model_quality import QualityScorer
scorer = QualityScorer (
dimensions = [ "helpfulness" ,
"harmlessness" ,
"honesty" ,
"specificity" ,
"clarity" ,
"actionability"
],
model = "gpt-4" , # Judge model
use_human_feedback = True
)
# Score response quality
scores = await scorer . score (
query = "How do I fix this bug in my code?" ,
response = agent_response ,
user_intent = "debugging_help"
)
logger . info ( f "Overall Quality: { scores . overall : .2f } /5.0" )
for dimension , score in scores . dimensions . items ():
logger . info ( f " { dimension } : { score : .2f } /5.0" )
Measure how effectively agents complete their assigned tasks.
Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import TaskPerformanceEvaluator
evaluator = TaskPerformanceEvaluator (
metrics = [ "completion_rate" ,
"step_efficiency" ,
"time_to_completion" ,
"error_recovery_rate" ,
"subtask_accuracy" ,
"goal_achievement"
],
track_intermediate_steps = True
)
# Evaluate task performance
task_result = await evaluator . evaluate_task (
task = task ,
agent = agent ,
expected_outcome = ground_truth
)
logger . info ( f "Completion Rate: { task_result . completion_rate : .2% } " )
logger . info ( f "Steps Used: { task_result . steps_used } / { task_result . optimal_steps } " )
logger . info ( f "Efficiency Score: { task_result . efficiency : .2% } " )
logger . info ( f "Time: { task_result . execution_time } s" )
Step Analysis Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.task_tool import StepAnalyzer
analyzer = StepAnalyzer (
detect_loops = True ,
identify_bottlenecks = True ,
suggest_optimizations = True
)
# Analyze task execution steps
analysis = await analyzer . analyze ( task_execution )
logger . info ( f "Total Steps: { analysis . total_steps } " )
logger . info ( f "Redundant Steps: { analysis . redundant_steps } " )
logger . info ( f "Loops Detected: { len ( analysis . loops ) } " )
logger . info ( f "Bottlenecks: { analysis . bottlenecks } " )
logger . info ( f "Optimization Suggestions: { analysis . suggestions } " )
Task Complexity Assessment Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.task_tool import ComplexityAssessor
assessor = ComplexityAssessor ()
# Assess task complexity
complexity = await assessor . assess ( task )
logger . info ( f "Complexity Score: { complexity . score } /10" )
logger . info ( f "Estimated Steps: { complexity . estimated_steps } " )
logger . info ( f "Skill Requirements: { complexity . required_skills } " )
logger . info ( f "Risk Level: { complexity . risk_level } " )
logger . info ( f "Recommended Agent Tier: { complexity . recommended_tier } " )
Analyze how well agents utilize tools and make optimal tool selections.
Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import ToolEffectivenessEvaluator
evaluator = ToolEffectivenessEvaluator (
metrics = [ "selection_accuracy" ,
"usage_efficiency" ,
"error_rate" ,
"retry_rate" ,
"parameter_accuracy" ,
"outcome_success"
]
)
# Evaluate tool usage
tool_stats = await evaluator . evaluate (
agent = agent ,
task_executions = executions
)
logger . info ( f "Tool Selection Accuracy: { tool_stats . selection_accuracy : .2% } " )
logger . info ( f "Average Error Rate: { tool_stats . error_rate : .2% } " )
logger . info ( f "Retry Rate: { tool_stats . retry_rate : .2% } " )
# Per-tool breakdown
for tool_name , stats in tool_stats . per_tool . items ():
logger . info ( f " \n { tool_name } :" )
logger . info ( f " Usage Count: { stats . usage_count } " )
logger . info ( f " Success Rate: { stats . success_rate : .2% } " )
logger . info ( f " Avg Latency: { stats . avg_latency_ms } ms" )
Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.task_tool import ToolOptimizer
optimizer = ToolOptimizer (
analyze_patterns = True ,
suggest_alternatives = True
)
# Analyze and optimize tool usage
optimization = await optimizer . analyze ( tool_usage_logs )
logger . info ( f "Optimization Opportunities: { len ( optimization . opportunities ) } " )
for opp in optimization . opportunities :
logger . info ( f " - Replace { opp . current_tool } with { opp . suggested_tool } " )
logger . info ( f " Expected improvement: { opp . improvement : .2% } " )
Tier 4: Memory & RAG Evaluation Evaluate memory systems and retrieval-augmented generation quality.
Memory RAG Evaluator Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import MemoryRAGEvaluator
evaluator = MemoryRAGEvaluator (
metrics = [ "retrieval_precision" ,
"retrieval_recall" ,
"context_relevance" ,
"answer_faithfulness" ,
"memory_utilization" ,
"knowledge_coverage"
]
)
# Evaluate RAG performance
rag_result = await evaluator . evaluate (
query = "What was discussed in yesterday's meeting?" ,
retrieved_documents = documents ,
generated_response = response ,
ground_truth = actual_meeting_notes
)
logger . info ( f "Retrieval Precision: { rag_result . precision : .2% } " )
logger . info ( f "Retrieval Recall: { rag_result . recall : .2% } " )
logger . info ( f "Answer Faithfulness: { rag_result . faithfulness : .2% } " )
logger . info ( f "Context Relevance: { rag_result . relevance : .2% } " )
Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.memory_rag import MemoryAnalyzer
analyzer = MemoryAnalyzer (
memory_types = [ "episodic" , "semantic" , "procedural" , "working" ]
)
# Analyze memory performance
analysis = await analyzer . analyze ( memory_manager )
logger . info ( f "Memory Utilization: { analysis . utilization : .2% } " )
logger . info ( f "Hit Rate: { analysis . hit_rate : .2% } " )
logger . info ( f "Average Retrieval Time: { analysis . avg_retrieval_ms } ms" )
for memory_type , stats in analysis . by_type . items ():
logger . info ( f " \n { memory_type . title () } Memory:" )
logger . info ( f " Size: { stats . size } items" )
logger . info ( f " Access Frequency: { stats . access_frequency } " )
logger . info ( f " Decay Rate: { stats . decay_rate : .4f } " )
Retrieval Quality Benchmarking Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.memory_rag import RetrievalBenchmark
benchmark = RetrievalBenchmark (
datasets = [ "squad" , "natural_questions" , "custom_qa" ],
metrics = [ "mrr" , "map" , "ndcg" , "hit_rate" ]
)
# Run retrieval benchmark
results = await benchmark . run (
retriever = knowledge_base . retriever ,
num_samples = 1000
)
logger . info ( f "MRR@10: { results . mrr_10 : .4f } " )
logger . info ( f "MAP: { results . map : .4f } " )
logger . info ( f "NDCG@10: { results . ndcg_10 : .4f } " )
logger . info ( f "Hit Rate@10: { results . hit_rate_10 : .2% } " )
Assess agent decision-making quality and appropriate escalation behavior.
Autonomy Evaluator Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import AutonomyPerformanceEvaluator
evaluator = AutonomyPerformanceEvaluator (
metrics = [ "decision_accuracy" ,
"escalation_appropriateness" ,
"independence_score" ,
"judgment_quality" ,
"self_correction_rate"
]
)
# Evaluate autonomy performance
autonomy = await evaluator . evaluate (
agent = agent ,
decisions = decision_logs
)
logger . info ( f "Decision Accuracy: { autonomy . decision_accuracy : .2% } " )
logger . info ( f "Escalation Appropriateness: { autonomy . escalation_score : .2% } " )
logger . info ( f "Independence Score: { autonomy . independence : .2% } " )
logger . info ( f "Self-Correction Rate: { autonomy . self_correction : .2% } " )
Escalation Analysis Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.autonomy_performance import EscalationAnalyzer
analyzer = EscalationAnalyzer (
ideal_escalation_rate = 0.15 ,
analyze_patterns = True
)
# Analyze escalation behavior
escalation = await analyzer . analyze ( agent_logs )
logger . info ( f "Escalation Rate: { escalation . rate : .2% } " )
logger . info ( f "Appropriate Escalations: { escalation . appropriate : .2% } " )
logger . info ( f "Missed Escalations: { escalation . missed_count } " )
logger . info ( f "Unnecessary Escalations: { escalation . unnecessary_count } " )
logger . info ( f "Top Escalation Reasons: { escalation . top_reasons } " )
Decision Quality Scoring Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.autonomy_performance import DecisionScorer
scorer = DecisionScorer (
evaluation_model = "gpt-4" ,
scoring_rubric = "comprehensive"
)
# Score agent decisions
for decision in agent_decisions :
score = await scorer . score ( decision )
logger . info ( f "Decision: { decision . summary } " )
logger . info ( f " Quality Score: { score . quality : .2f } /5.0" )
logger . info ( f " Risk Assessment: { score . risk_level } " )
logger . info ( f " Confidence Calibration: { score . confidence_calibration : .2% } " )
Tier 6: Security & Risk Evaluation Comprehensive security assessment and risk monitoring.
Security Risk Evaluator Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import SecurityRiskEvaluator
evaluator = SecurityRiskEvaluator (
checks = [ "prompt_injection" ,
"jailbreak_attempts" ,
"data_leakage" ,
"authorization_bypass" ,
"input_sanitization" ,
"output_filtering"
],
severity_threshold = "medium"
)
# Evaluate security posture
security = await evaluator . evaluate (
agent = agent ,
interactions = interaction_logs
)
logger . info ( f "Security Score: { security . overall_score : .2f } /100" )
logger . info ( f "Vulnerabilities Found: { len ( security . vulnerabilities ) } " )
logger . info ( f "High Severity: { security . high_severity_count } " )
logger . info ( f "Medium Severity: { security . medium_severity_count } " )
for vuln in security . vulnerabilities :
logger . info ( f " \n { vuln . type } : { vuln . description } " )
logger . info ( f " Severity: { vuln . severity } " )
logger . info ( f " Remediation: { vuln . remediation } " )
Threat Detection Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.security_risk import ThreatDetector
detector = ThreatDetector (
real_time = True ,
patterns = [ "prompt_injection" ,
"data_exfiltration" ,
"privilege_escalation" ,
"denial_of_service"
],
alert_webhook = "https://security.company.com/alerts"
)
# Enable real-time threat detection
detector . start_monitoring ( agent )
# Check detected threats
threats = await detector . get_threats (
time_range = "last_24h" ,
severity = "high"
)
for threat in threats :
logger . info ( f " { threat . type } : { threat . description } " )
logger . info ( f " Time: { threat . timestamp } " )
logger . info ( f " Source: { threat . source } " )
logger . info ( f " Mitigated: { threat . mitigated } " )
Compliance Verification Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.security_risk import ComplianceChecker
checker = ComplianceChecker (
frameworks = [ "SOC2" , "GDPR" , "HIPAA" , "ISO27001" ],
continuous_monitoring = True
)
# Verify compliance
compliance = await checker . verify ( agent_system )
logger . info ( f "Overall Compliance: { compliance . overall_score : .2% } " )
for framework in compliance . frameworks :
logger . info ( f " \n { framework . name } :" )
logger . info ( f " Score: { framework . score : .2% } " )
logger . info ( f " Controls Passed: { framework . passed } / { framework . total } " )
for issue in framework . issues :
logger . info ( f " { issue . control } : { issue . description } " )
Tier 7: Cost & Quality Optimization Balance cost efficiency with quality outcomes.
Cost Quality Evaluator Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import CostQualityEvaluator
evaluator = CostQualityEvaluator (
cost_tracking = True ,
quality_threshold = 0.85 ,
budget_alerts = True
)
# Evaluate cost-quality tradeoffs
analysis = await evaluator . evaluate (
agent = agent ,
time_period = "last_30_days"
)
logger . info ( f "Total Cost: $ { analysis . total_cost : .2f } " )
logger . info ( f "Cost per Task: $ { analysis . cost_per_task : .4f } " )
logger . info ( f "Quality Score: { analysis . quality_score : .2% } " )
logger . info ( f "Cost-Quality Ratio: { analysis . cost_quality_ratio : .4f } " )
logger . info ( f "Token Efficiency: { analysis . token_efficiency : .2% } " )
Token Usage Optimization Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.cost_quality import TokenOptimizer
optimizer = TokenOptimizer (
analyze_waste = True ,
suggest_compression = True ,
model_recommendations = True
)
# Analyze and optimize token usage
optimization = await optimizer . analyze ( usage_logs )
logger . info ( f "Total Tokens Used: { optimization . total_tokens : , } " )
logger . info ( f "Wasted Tokens: { optimization . wasted_tokens : , } ( { optimization . waste_rate : .1% } )" )
logger . info ( f "Potential Savings: $ { optimization . potential_savings : .2f } " )
logger . info ( " \n Optimization Suggestions:" )
for suggestion in optimization . suggestions :
logger . info ( f " - { suggestion . action } " )
logger . info ( f " Impact: { suggestion . token_reduction : , } tokens saved" )
logger . info ( f " Savings: $ { suggestion . cost_savings : .2f } /month" )
Model Tier Optimization Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.model_tier import TierOptimizer
optimizer = TierOptimizer (
available_models = [ "gpt-4" , "gpt-4-turbo" , "gpt-3.5-turbo" , "claude-3" ],
quality_requirements = { "min_accuracy" : 0.90 }
)
# Recommend optimal model tiers
recommendations = await optimizer . analyze ( task_history )
logger . info ( "Model Tier Recommendations:" )
for task_type , rec in recommendations . by_task_type . items ():
logger . info ( f " \n { task_type } :" )
logger . info ( f " Current Model: { rec . current_model } " )
logger . info ( f " Recommended: { rec . recommended_model } " )
logger . info ( f " Cost Reduction: { rec . cost_reduction : .1% } " )
logger . info ( f " Quality Impact: { rec . quality_impact : +.1% } " )
Tier 8: Human & Business Alignment Measure user satisfaction and business value generation.
Human Business Evaluator Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import HumanBusinessEvaluator
evaluator = HumanBusinessEvaluator (
metrics = [ "user_satisfaction" ,
"nps_score" ,
"task_completion_satisfaction" ,
"business_value" ,
"time_saved" ,
"error_reduction"
],
feedback_collection = True
)
# Evaluate human alignment
alignment = await evaluator . evaluate (
agent = agent ,
user_feedback = feedback_data ,
business_metrics = business_data
)
logger . info ( f "User Satisfaction: { alignment . satisfaction : .2% } " )
logger . info ( f "NPS Score: { alignment . nps } " )
logger . info ( f "Business Value Generated: $ { alignment . business_value : ,.2f } " )
logger . info ( f "Time Saved: { alignment . hours_saved : ,.1f } hours" )
logger . info ( f "Error Reduction: { alignment . error_reduction : .1% } " )
User Satisfaction Tracking Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.human_business import SatisfactionTracker
tracker = SatisfactionTracker (
collection_method = "implicit" , # or "explicit"
sentiment_analysis = True ,
trend_detection = True
)
# Track satisfaction over time
satisfaction = await tracker . analyze (
interactions = interaction_logs ,
time_range = "last_90_days"
)
logger . info ( f "Current Satisfaction: { satisfaction . current : .2% } " )
logger . info ( f "Trend: { satisfaction . trend } " ) # rising, falling, stable
logger . info ( f "30-Day Change: { satisfaction . change_30d : +.1% } " )
# Identify satisfaction drivers
logger . info ( " \n Satisfaction Drivers:" )
for driver in satisfaction . drivers :
logger . info ( f " { driver . factor } : { driver . impact : +.2f } " )
Business Value Calculator Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.human_business import BusinessValueCalculator
calculator = BusinessValueCalculator (
hourly_rate = 75.0 ,
error_cost = 500.0 ,
include_indirect_benefits = True
)
# Calculate business value
value = await calculator . calculate (
agent = agent ,
period = "quarterly"
)
logger . info ( f "Direct Value:" )
logger . info ( f " Time Saved: $ { value . time_savings : ,.2f } " )
logger . info ( f " Error Prevention: $ { value . error_savings : ,.2f } " )
logger . info ( f " Productivity Gain: $ { value . productivity_gain : ,.2f } " )
logger . info ( f " \n Indirect Value:" )
logger . info ( f " Employee Satisfaction: $ { value . employee_satisfaction : ,.2f } " )
logger . info ( f " Customer Experience: $ { value . customer_experience : ,.2f } " )
logger . info ( f " \n Total Quarterly Value: $ { value . total : ,.2f } " )
logger . info ( f "ROI: { value . roi : .1% } " )
Tier 9: Drift Detection Monitor for behavior drift and performance degradation over time.
Drift Detector Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import DriftDetector
detector = DriftDetector (
drift_types = [ "behavior_drift" ,
"performance_degradation" ,
"distribution_shift" ,
"concept_drift" ,
"model_decay"
],
sensitivity = "medium" ,
alert_on_detection = True
)
# Detect drift
drift = await detector . analyze (
current_data = recent_interactions ,
baseline_data = baseline_interactions
)
logger . info ( f "Drift Detected: { drift . detected } " )
logger . info ( f "Drift Score: { drift . score : .4f } " )
logger . info ( f "Drift Type: { drift . drift_type } " )
logger . info ( f "Affected Areas: { drift . affected_areas } " )
logger . info ( f "Recommended Action: { drift . recommendation } " )
Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.drift import PerformanceMonitor
monitor = PerformanceMonitor (
metrics = [ "accuracy" , "latency" , "success_rate" , "user_satisfaction" ],
baseline_period = "first_30_days" ,
alert_thresholds = {
"accuracy_drop" : 0.05 ,
"latency_increase" : 0.20 ,
"success_rate_drop" : 0.10
}
)
# Start continuous monitoring
monitor . start ()
# Check current performance vs baseline
comparison = await monitor . compare_to_baseline ()
logger . info ( f "Accuracy Change: { comparison . accuracy_change : +.2% } " )
logger . info ( f "Latency Change: { comparison . latency_change : +.2% } " )
logger . info ( f "Success Rate Change: { comparison . success_rate_change : +.2% } " )
for alert in comparison . alerts :
logger . info ( f " { alert . metric } : { alert . message } " )
Concept Drift Analyzer Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.drift import ConceptDriftAnalyzer
analyzer = ConceptDriftAnalyzer (
detection_methods = [ "adwin" , "page_hinkley" , "ddm" ],
window_size = 1000
)
# Analyze for concept drift
analysis = await analyzer . analyze (
feature_stream = feature_data ,
label_stream = label_data
)
logger . info ( f "Concept Drift Detected: { analysis . detected } " )
logger . info ( f "Drift Point: { analysis . drift_point } " )
logger . info ( f "Severity: { analysis . severity } " )
logger . info ( f "Affected Features: { analysis . affected_features } " )
logger . info ( f "Recommendation: { analysis . recommendation } " )
Tier 10: A/B Testing Framework Experiment with different configurations and measure impact.
A/B Testing Framework Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import ABTestingFramework
ab_framework = ABTestingFramework (
traffic_allocation = "even" ,
statistical_engine = "bayesian" ,
min_sample_size = 1000
)
# Create experiment
experiment = await ab_framework . create_experiment (
name = "prompt_optimization_v2" ,
description = "Testing new prompt template" ,
variants = {
"control" : { "prompt_template" : "original_template" },
"treatment_a" : { "prompt_template" : "template_v2" },
"treatment_b" : { "prompt_template" : "template_v3" }
},
metrics = [ "success_rate" , "user_satisfaction" , "latency" ],
primary_metric = "success_rate"
)
# Start experiment
await experiment . start ()
# Check results
results = await experiment . get_results ()
logger . info ( f "Experiment: { results . name } " )
logger . info ( f "Status: { results . status } " )
logger . info ( f "Sample Size: { results . total_samples } " )
logger . info ( f "Duration: { results . duration } " )
for variant , data in results . variants . items ():
logger . info ( f " \n { variant } :" )
logger . info ( f " Success Rate: { data . success_rate : .2% } " )
logger . info ( f " Improvement: { data . improvement : +.2% } " )
logger . info ( f " Statistical Significance: { data . significance : .2% } " )
Multi-Armed Bandit Optimization Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.ab_testing import MultiArmedBandit
bandit = MultiArmedBandit (
algorithm = "thompson_sampling" ,
arms = [{ "name" : "model_a" , "config" : { "model" : "gpt-4" }},
{ "name" : "model_b" , "config" : { "model" : "gpt-4-turbo" }},
{ "name" : "model_c" , "config" : { "model" : "claude-3" }}
],
optimization_metric = "quality_per_dollar"
)
# Start optimization
bandit . start ()
# Get current allocation
allocation = await bandit . get_allocation ()
logger . info ( f "Current Traffic Allocation: { allocation } " )
# Get performance summary
summary = await bandit . get_summary ()
for arm , stats in summary . items ():
logger . info ( f " \n { arm } :" )
logger . info ( f " Trials: { stats . trials } " )
logger . info ( f " Success Rate: { stats . success_rate : .2% } " )
logger . info ( f " Confidence: { stats . confidence : .2% } " )
Experiment Analytics Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.ab_testing import ExperimentAnalytics
analytics = ExperimentAnalytics ()
# Get experiment history
history = await analytics . get_history (
time_range = "last_6_months" ,
status = "completed"
)
logger . info ( f "Total Experiments: { history . total } " )
logger . info ( f "Successful: { history . successful } ( { history . success_rate : .1% } )" )
logger . info ( f "Average Lift: { history . average_lift : +.2% } " )
# Get learnings
learnings = await analytics . get_learnings ()
for learning in learnings :
logger . info ( f " \n { learning . insight } " )
logger . info ( f " Confidence: { learning . confidence : .2% } " )
logger . info ( f " Based on: { learning . experiments } experiments" )
Tier 11: Canary Deployment Safely deploy changes with gradual rollout and automatic rollback.
Canary Deployment Manager Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import CanaryDeployment
canary = CanaryDeployment (
rollout_strategy = "gradual" ,
initial_percentage = 5 ,
increment = 10 ,
increment_interval_minutes = 30 ,
auto_rollback = True
)
# Deploy new version
deployment = await canary . deploy (
name = "agent_v2.1" ,
new_config = new_agent_config ,
baseline_config = current_agent_config ,
success_criteria = {
"error_rate" : { "max" : 0.02 },
"latency_p99" : { "max" : 2000 },
"success_rate" : { "min" : 0.95 }
}
)
# Monitor deployment
status = await deployment . get_status ()
logger . info ( f "Deployment: { status . name } " )
logger . info ( f "Phase: { status . phase } " )
logger . info ( f "Traffic Percentage: { status . traffic_percentage } %" )
logger . info ( f "Health: { status . health } " )
# Metrics comparison
logger . info ( " \n Metrics Comparison:" )
logger . info ( f " Error Rate: { status . canary_error_rate : .3% } vs { status . baseline_error_rate : .3% } " )
logger . info ( f " Latency P99: { status . canary_latency } ms vs { status . baseline_latency } ms" )
logger . info ( f " Success Rate: { status . canary_success : .2% } vs { status . baseline_success : .2% } " )
Rollback Management Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.canary import RollbackManager
rollback_manager = RollbackManager (
automatic = True ,
triggers = [{ "metric" : "error_rate" , "threshold" : 0.05 , "window" : "5m" },
{ "metric" : "latency_p99" , "threshold" : 3000 , "window" : "5m" },
{ "metric" : "success_rate" , "threshold" : 0.90 , "comparison" : "lt" }
]
)
# Configure rollback behavior
rollback_manager . configure (
notification_channels = [ "slack" , "pagerduty" ],
preserve_logs = True ,
post_rollback_analysis = True
)
# Check rollback history
history = await rollback_manager . get_history ()
for rollback in history :
logger . info ( f " \n { rollback . deployment } " )
logger . info ( f " Time: { rollback . timestamp } " )
logger . info ( f " Trigger: { rollback . trigger_reason } " )
logger . info ( f " Recovery Time: { rollback . recovery_time_seconds } s" )
Deployment Analytics Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.canary import DeploymentAnalytics
analytics = DeploymentAnalytics ()
# Get deployment metrics
metrics = await analytics . get_metrics ( time_range = "last_30_days" )
logger . info ( f "Deployments: { metrics . total } " )
logger . info ( f "Successful: { metrics . successful } ( { metrics . success_rate : .1% } )" )
logger . info ( f "Rolled Back: { metrics . rolled_back } " )
logger . info ( f "Mean Time to Deploy: { metrics . mttr_minutes : .1f } minutes" )
logger . info ( f "Mean Time to Rollback: { metrics . mttr_minutes : .1f } minutes" )
Tier 12: Workflow Analytics End-to-end analysis of multi-agent workflows and processes.
Workflow Analytics Engine Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import WorkflowAnalytics
analytics = WorkflowAnalytics (
metrics = [ "end_to_end_latency" ,
"step_efficiency" ,
"handoff_quality" ,
"bottleneck_detection" ,
"failure_analysis" ,
"resource_utilization"
]
)
# Analyze workflow performance
analysis = await analytics . analyze (
workflow = workflow ,
time_range = "last_7_days"
)
logger . info ( f "Total Executions: { analysis . total_executions } " )
logger . info ( f "Success Rate: { analysis . success_rate : .2% } " )
logger . info ( f "Average Duration: { analysis . avg_duration_seconds } s" )
logger . info ( f "P99 Duration: { analysis . p99_duration_seconds } s" )
# Bottleneck analysis
logger . info ( " \n Bottlenecks:" )
for bottleneck in analysis . bottlenecks :
logger . info ( f " { bottleneck . step } : { bottleneck . avg_wait_time } s wait time" )
logger . info ( f " Impact: { bottleneck . impact_percentage : .1% } of total time" )
Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.workflow import TeamAnalyzer
analyzer = TeamAnalyzer ()
# Analyze team performance
team_analysis = await analyzer . analyze ( team )
logger . info ( f "Team Efficiency: { team_analysis . efficiency : .2% } " )
logger . info ( f "Collaboration Score: { team_analysis . collaboration_score : .2f } /5.0" )
logger . info ( " \n Agent Performance:" )
for agent_stats in team_analysis . agents :
logger . info ( f " \n { agent_stats . name } :" )
logger . info ( f " Tasks Completed: { agent_stats . tasks_completed } " )
logger . info ( f " Success Rate: { agent_stats . success_rate : .2% } " )
logger . info ( f " Avg Response Time: { agent_stats . avg_response_time } s" )
logger . info ( f " Contribution: { agent_stats . contribution : .1% } " )
Process Mining Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation.workflow import ProcessMiner
miner = ProcessMiner (
discovery_algorithm = "inductive" ,
conformance_checking = True
)
# Mine workflow patterns
patterns = await miner . mine ( execution_logs )
logger . info ( f "Discovered Patterns: { len ( patterns . patterns ) } " )
logger . info ( f "Process Conformance: { patterns . conformance : .2% } " )
for pattern in patterns . patterns :
logger . info ( f " \n { pattern . name } " )
logger . info ( f " Frequency: { pattern . frequency } " )
logger . info ( f " Avg Duration: { pattern . avg_duration } s" )
logger . info ( f " Success Rate: { pattern . success_rate : .2% } " )
# Get optimization recommendations
recommendations = await miner . get_recommendations ()
for rec in recommendations :
logger . info ( f " \n { rec . suggestion } " )
logger . info ( f " Expected Improvement: { rec . improvement : .1% } " )
Online vs Offline Evaluation Online Evaluation (Real-time) Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import OnlineEvaluator
online = OnlineEvaluator (
metrics = [ "latency" ,
"success_rate" ,
"user_feedback" ,
"error_rate"
],
sampling_rate = 0.1 , # 10% of traffic
real_time_dashboard = True
)
# Enable online evaluation
online . attach ( agent )
# Get real-time metrics
metrics = await online . get_current_metrics ()
logger . info ( f "Current Success Rate: { metrics . success_rate : .2% } " )
logger . info ( f "Current P50 Latency: { metrics . latency_p50 } ms" )
logger . info ( f "Error Rate (last 5m): { metrics . error_rate_5m : .3% } " )
# Configure alerts
online . set_alerts ({
"error_rate" : { "threshold" : 0.05 , "window" : "5m" },
"latency_p99" : { "threshold" : 3000 , "window" : "1m" },
"success_rate" : { "threshold" : 0.90 , "comparison" : "lt" }
})
Offline Evaluation (Batch) Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import OfflineEvaluator
offline = OfflineEvaluator (
datasets = [ "test_set_v2" , "golden_set" , "edge_cases" ],
metrics = [ "accuracy" ,
"f1_score" ,
"hallucination_rate" ,
"task_completion"
],
parallel = True
)
# Run offline evaluation
results = await offline . evaluate (
agent = agent ,
num_samples = 10000
)
logger . info ( f "Accuracy: { results . accuracy : .2% } " )
logger . info ( f "F1 Score: { results . f1_score : .4f } " )
logger . info ( f "Hallucination Rate: { results . hallucination_rate : .2% } " )
logger . info ( f "Task Completion: { results . task_completion : .2% } " )
# Compare with baseline
comparison = await offline . compare (
current_agent = agent ,
baseline_agent = baseline_agent
)
logger . info ( " \n Comparison with Baseline:" )
for metric , diff in comparison . differences . items ():
symbol = "β" if diff > 0 else "β" if diff < 0 else "="
logger . info ( f " { metric } : { diff : +.2% } { symbol } " )
Comprehensive Evaluation Dashboard Setting Up the Dashboard Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import EvaluationDashboard
dashboard = EvaluationDashboard (
title = "Agent Performance Dashboard" ,
tiers = "all" , # Include all 12 tiers
refresh_interval = 30 ,
theme = "dark"
)
# Configure dashboard panels
dashboard . add_panel ( "model_quality" , position = "top-left" )
dashboard . add_panel ( "task_performance" , position = "top-right" )
dashboard . add_panel ( "cost_analysis" , position = "bottom-left" )
dashboard . add_panel ( "security_overview" , position = "bottom-right" )
# Start dashboard server
dashboard . serve ( port = 8080 )
logger . info ( "Dashboard available at http://localhost:8080" )
Automated Reporting Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import ReportGenerator
generator = ReportGenerator (
tiers = [ "model_quality" , "cost_quality" , "security_risk" ],
format = "pdf" ,
schedule = "weekly"
)
# Generate comprehensive report
report = await generator . generate (
time_range = "last_7_days" ,
include_recommendations = True
)
# Send report
await report . send_to ([ "team@company.com" ,
"stakeholders@company.com"
])
logger . info ( f "Report generated: { report . path } " )
logger . info ( f "Pages: { report . pages } " )
logger . info ( f "Recommendations: { len ( report . recommendations ) } " )
Evaluation Best Practices Start with Baselines Establish baseline metrics before making changes. Always compare new versions against established baselines.
Focus on Business Metrics While technical metrics matter, always tie evaluation back to business value and user satisfaction.
Continuous Evaluation Don't just evaluate once. Set up continuous monitoring to catch drift and degradation early.
Test in Production Use canary deployments and A/B testing to safely validate changes with real traffic.
Evaluation Checklist Python import logging
logger = logging . getLogger ( __name__ )
from agenticaiframework.evaluation import EvaluationChecklist
checklist = EvaluationChecklist ()
# Run pre-deployment checklist
results = await checklist . run_pre_deployment ( agent )
logger . info ( "Pre-Deployment Checklist:" )
for item in results . items :
status = "" if item . passed else ""
logger . info ( f " { status } { item . name } " )
if not item . passed :
logger . info ( f " Reason: { item . failure_reason } " )
logger . info ( f " Fix: { item . suggestion } " )
logger . info ( f " \n Overall: { 'PASS' if results . passed else 'FAIL' } " )
Next Steps Set up comprehensive observability for your agents
Debug and analyze agent execution flows
Implement security best practices
Deploy your agents to production