📚 API Reference¶
🧭 Navigation Quick Links¶
🧠 Core Engines¶
ModelInferenceEngine¶
 **Core evaluation engine for running prompts against LLMs** 📍 **Location**: `llm_evaluation_framework.model_inference_engine`
#### **Class Definition** #### **Constructor** **Parameters:** - `model_registry` *(ModelRegistry)*: Instance containing registered model configurations **Example:** #### **Methods** ##### `evaluate_model()` **Description:** Evaluates a model against test cases and returns comprehensive results. **Parameters:** - `model_id` *(str)*: Registered model identifier - `test_cases` *(List[Dict])*: Test cases with `id`, `type`, `prompt`, `evaluation_criteria` - `use_case_requirements` *(Dict)*: Requirements including `max_response_time`, `budget`, `required_capabilities` **Returns:** *(Dict)* - Evaluation results with structure: **Complete Example:** ##### `evaluate_multiple_models()` **Description:** Evaluates multiple models against the same test cases for comparison. **Parameters:** - `model_ids` *(List[str])*: List of registered model identifiers - `test_cases` *(List[Dict])*: Shared test cases for all models - `use_case_requirements` *(Dict)*: Shared requirements for all models **Returns:** *(Dict)* - Results keyed by model ID **Example:**
class ModelInferenceEngine:
"""
Core engine for evaluating Language Learning Models (LLMs).
Handles prompt execution, response collection, metrics calculation,
and cost analysis across different model providers.
"""
from llm_evaluation_framework import ModelRegistry, ModelInferenceEngine
registry = ModelRegistry()
registry.register("gpt-3.5-turbo", {
"provider": "openai",
"api_cost_input": 0.001,
"api_cost_output": 0.002,
"capabilities": ["reasoning", "creativity"]
})
engine = ModelInferenceEngine(registry)
def evaluate_model(
self,
model_id: str,
test_cases: List[Dict[str, Any]],
use_case_requirements: Dict[str, Any]
) -> Dict[str, Any]
{
"model_id": str,
"test_results": List[Dict],
"aggregate_metrics": {
"total_cost": float,
"avg_response_time": float,
"success_rate": float,
"overall_score": float
},
"timestamp": str,
"metadata": Dict
}
# Define test cases
test_cases = [
{
"id": "tc001",
"type": "reasoning",
"prompt": "If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?",
"expected_output": "5 minutes",
"evaluation_criteria": {
"accuracy": 0.8,
"reasoning_quality": 0.7
}
},
{
"id": "tc002",
"type": "creativity",
"prompt": "Write a haiku about machine learning",
"evaluation_criteria": {
"creativity": 0.9,
"format_compliance": 1.0
}
}
]
# Define requirements
requirements = {
"max_response_time": 30.0, # seconds
"budget": 0.10, # USD
"required_capabilities": ["reasoning", "creativity"],
"quality_threshold": 0.7
}
# Run evaluation
results = engine.evaluate_model("gpt-3.5-turbo", test_cases, requirements)
print(f"Overall Score: {results['aggregate_metrics']['overall_score']}")
print(f"Total Cost: ${results['aggregate_metrics']['total_cost']:.4f}")
def evaluate_multiple_models(
self,
model_ids: List[str],
test_cases: List[Dict[str, Any]],
use_case_requirements: Dict[str, Any]
) -> Dict[str, Dict[str, Any]]
models = ["gpt-3.5-turbo", "gpt-4", "claude-3"]
comparative_results = engine.evaluate_multiple_models(models, test_cases, requirements)
for model_id, results in comparative_results.items():
score = results['aggregate_metrics']['overall_score']
cost = results['aggregate_metrics']['total_cost']
print(f"{model_id}: Score={score:.3f}, Cost=${cost:.4f}")
AutoSuggestionEngine¶
 **Intelligent model recommendation based on evaluation results** 📍 **Location**: `llm_evaluation_framework.auto_suggestion_engine`
#### **Class Definition** #### **Constructor** **Parameters:** - `model_registry` *(ModelRegistry)*: Instance for accessing model metadata #### **Methods** ##### `suggest_model()` **Description:** Suggests best models based on evaluation results and requirements. **Parameters:** - `evaluation_results` *(List[Dict])*: Results from ModelInferenceEngine - `use_case_requirements` *(Dict)*: Prioritization weights and constraints **Returns:** *(List[Dict])* - Ranked recommendations with structure: **Complete Example:** ##### `compare_models()` **Description:** Provides detailed comparison analysis between models. **Example:**
class AutoSuggestionEngine:
"""
Intelligent recommendation engine for model selection.
Analyzes evaluation results and suggests optimal models based on
performance metrics, cost constraints, and use case requirements.
"""
def suggest_model(
self,
evaluation_results: List[Dict[str, Any]],
use_case_requirements: Dict[str, Any]
) -> List[Dict[str, Any]]
[
{
"model_id": str,
"recommendation_score": float,
"strengths": List[str],
"weaknesses": List[str],
"cost_analysis": Dict,
"performance_summary": Dict,
"confidence": float
}
]
from llm_evaluation_framework import AutoSuggestionEngine
# Initialize suggestion engine
suggestion_engine = AutoSuggestionEngine(registry)
# Define prioritization requirements
requirements = {
"weights": {
"accuracy": 0.4, # 40% weight on accuracy
"cost": 0.3, # 30% weight on cost efficiency
"response_time": 0.2, # 20% weight on speed
"creativity": 0.1 # 10% weight on creativity
},
"constraints": {
"max_cost_per_request": 0.01,
"max_response_time": 15.0,
"min_accuracy": 0.7
},
"use_case": "content_generation"
}
# Get recommendations (using previous evaluation results)
recommendations = suggestion_engine.suggest_model(comparative_results, requirements)
# Display top recommendation
top_choice = recommendations[0]
print(f"🏆 Recommended Model: {top_choice['model_id']}")
print(f"📊 Recommendation Score: {top_choice['recommendation_score']:.3f}")
print(f"💪 Strengths: {', '.join(top_choice['strengths'])}")
print(f"⚠️ Considerations: {', '.join(top_choice['weaknesses'])}")
TestDatasetGenerator¶
 **Synthetic dataset generation for comprehensive evaluation** 📍 **Location**: `llm_evaluation_framework.test_dataset_generator`
#### **Class Definition** #### **Methods** ##### `generate_test_cases()` **Description:** Generates test cases tailored to specific use case requirements. **Parameters:** - `use_case_requirements` *(Dict)*: Requirements including `domain`, `required_capabilities`, `difficulty_level` - `num_cases` *(int)*: Number of test cases to generate (default: 10) **Returns:** *(List[Dict])* - Generated test cases with structure: **Complete Example:** ##### `generate_edge_cases()` **Description:** Generates edge cases to test model robustness. **Parameters:** - `base_requirements` *(Dict)*: Base requirements for generation - `edge_case_types` *(List[str])*: Types like `["empty_input", "very_long_input", "multilingual", "ambiguous"]` **Example:** ##### `load_real_world_dataset()` **Description:** Loads and formats real-world datasets for evaluation. **Example:**
class TestDatasetGenerator:
"""
Generates synthetic test datasets for model evaluation.
Creates domain-specific prompts, edge cases, and evaluation criteria
to ensure comprehensive model testing across various scenarios.
"""
def generate_test_cases(
self,
use_case_requirements: Dict[str, Any],
num_cases: int = 10
) -> List[Dict[str, Any]]
[
{
"id": str,
"type": str,
"prompt": str,
"expected_output": str,
"evaluation_criteria": Dict[str, float],
"difficulty": str,
"domain": str,
"metadata": Dict
}
]
from llm_evaluation_framework import TestDatasetGenerator
generator = TestDatasetGenerator()
# Define generation requirements
requirements = {
"domain": "customer_service",
"required_capabilities": ["reasoning", "empathy", "problem_solving"],
"difficulty_levels": ["easy", "medium", "hard"],
"topics": ["product_issues", "billing_questions", "technical_support"],
"response_types": ["informative", "empathetic", "solution_oriented"]
}
# Generate test cases
test_cases = generator.generate_test_cases(requirements, num_cases=15)
print(f"Generated {len(test_cases)} test cases:")
for case in test_cases[:3]: # Show first 3
print(f" 📝 {case['id']}: {case['type']} ({case['difficulty']})")
print(f" Prompt: {case['prompt'][:80]}...")
def generate_edge_cases(
self,
base_requirements: Dict[str, Any],
edge_case_types: List[str]
) -> List[Dict[str, Any]]
edge_cases = generator.generate_edge_cases(
requirements,
edge_case_types=["empty_input", "very_long_input", "ambiguous_prompt"]
)
print(f"Generated {len(edge_cases)} edge cases for robustness testing")
📂 Registry & Storage¶
ModelRegistry¶
 **Centralized model configuration and metadata management** 📍 **Location**: `llm_evaluation_framework.registry.model_registry`
#### **Class Definition** #### **Constructor** **Parameters:** - `config_file` *(Optional[str])*: Path to model configuration file #### **Methods** ##### `register()` **Description:** Registers a new model with validation. **Parameters:** - `name` *(str)*: Unique model identifier - `config` *(Dict)*: Model configuration including provider, costs, capabilities **Configuration Schema:** **Complete Example:** ##### `list_models()` **Description:** Lists registered models with optional filtering. **Example:** ##### `get_model()` **Description:** Retrieves complete model configuration. **Example:** ##### `update_model()` **Description:** Updates existing model configuration. **Example:** ##### `validate_model()` **Description:** Validates model configuration and connectivity. **Example:**
class ModelRegistry:
"""
Centralized registry for model configurations and metadata.
Manages model registration, configuration validation, capability tracking,
and provides consistent access to model information across the framework.
"""
{
"provider": str, # e.g., "openai", "anthropic", "azure"
"api_endpoint": Optional[str], # Custom endpoint URL
"api_key_env": Optional[str], # Environment variable for API key
"api_cost_input": float, # Cost per 1K input tokens
"api_cost_output": float, # Cost per 1K output tokens
"max_tokens": int, # Maximum tokens per request
"capabilities": List[str], # Model capabilities
"rate_limits": Dict[str, int], # Rate limiting configuration
"context_window": int, # Maximum context window size
"metadata": Dict[str, Any] # Additional metadata
}
from llm_evaluation_framework.registry import ModelRegistry
registry = ModelRegistry()
# Register OpenAI model
openai_config = {
"provider": "openai",
"api_key_env": "OPENAI_API_KEY",
"api_cost_input": 0.001, # $0.001 per 1K input tokens
"api_cost_output": 0.002, # $0.002 per 1K output tokens
"max_tokens": 4096,
"capabilities": ["reasoning", "creativity", "code_generation"],
"rate_limits": {
"requests_per_minute": 3500,
"tokens_per_minute": 90000
},
"context_window": 16385,
"metadata": {
"version": "gpt-3.5-turbo-0125",
"training_cutoff": "2023-09",
"multimodal": False
}
}
success = registry.register("gpt-3.5-turbo", openai_config)
print(f"Registration successful: {success}")
# Register Anthropic model
anthropic_config = {
"provider": "anthropic",
"api_key_env": "ANTHROPIC_API_KEY",
"api_cost_input": 0.00325,
"api_cost_output": 0.01625,
"max_tokens": 4096,
"capabilities": ["reasoning", "analysis", "creative_writing"],
"context_window": 200000,
"metadata": {
"version": "claude-3-haiku-20240307",
"safety_filtered": True
}
}
registry.register("claude-3-haiku", anthropic_config)
# List all models
all_models = registry.list_models()
print(f"Registered models: {all_models}")
# Filter by capability
reasoning_models = registry.list_models(filter_by={"capabilities": "reasoning"})
print(f"Reasoning capable models: {reasoning_models}")
# Filter by cost range
budget_models = registry.list_models(filter_by={"max_input_cost": 0.002})
model_config = registry.get_model("gpt-3.5-turbo")
print(f"Provider: {model_config['provider']}")
print(f"Capabilities: {model_config['capabilities']}")
print(f"Input cost: ${model_config['api_cost_input']}/1K tokens")
📊 Evaluation & Scoring¶
ScoringStrategies¶
 **Comprehensive evaluation metrics and scoring algorithms** 📍 **Location**: `llm_evaluation_framework.evaluation.scoring_strategies`
#### **Available Scoring Strategies** ##### `AccuracyScorer` **Example:** ##### `SemanticSimilarityScorer` **Example:** ##### `BLEUScorer` ##### `ROUGEScorer` ##### `CustomScorer` **Custom Scorer Example:**
class AccuracyScorer:
"""Exact match accuracy scoring for factual responses."""
def score(self, predicted: str, reference: str) -> float
from llm_evaluation_framework.evaluation.scoring_strategies import AccuracyScorer
scorer = AccuracyScorer()
score = scorer.score("Paris", "Paris") # Returns 1.0
score = scorer.score("London", "Paris") # Returns 0.0
class SemanticSimilarityScorer:
"""Semantic similarity scoring using embeddings."""
def score(self, predicted: str, reference: str) -> float
from llm_evaluation_framework.evaluation.scoring_strategies import SemanticSimilarityScorer
scorer = SemanticSimilarityScorer(model="sentence-transformers/all-MiniLM-L6-v2")
score = scorer.score("The sky is blue", "Blue is the color of the sky") # Returns ~0.85
class BLEUScorer:
"""BLEU score for translation and text generation quality."""
def score(self, predicted: str, reference: str, n_gram: int = 4) -> float
class ROUGEScorer:
"""ROUGE score for summarization quality."""
def score(self, predicted: str, reference: str, rouge_type: str = "rouge-l") -> float
class CustomScorer:
"""Base class for implementing custom scoring strategies."""
def score(self, predicted: str, reference: str, **kwargs) -> float:
"""Implement custom scoring logic."""
raise NotImplementedError
class DomainSpecificScorer(CustomScorer):
"""Custom scorer for domain-specific evaluation."""
def __init__(self, domain_keywords: List[str]):
self.domain_keywords = domain_keywords
def score(self, predicted: str, reference: str, **kwargs) -> float:
# Custom scoring logic
keyword_match_score = self._calculate_keyword_match(predicted)
semantic_score = self._calculate_semantic_similarity(predicted, reference)
return 0.6 * semantic_score + 0.4 * keyword_match_score
def _calculate_keyword_match(self, text: str) -> float:
# Implementation details
pass
def _calculate_semantic_similarity(self, pred: str, ref: str) -> float:
# Implementation details
pass
# Usage
custom_scorer = DomainSpecificScorer(["finance", "investment", "portfolio"])
score = custom_scorer.score(predicted_response, reference_response)
🛠️ Utilities & CLI¶
Command Line Interface¶
 **Powerful command-line interface for evaluation workflows** 📍 **Location**: `llm_evaluation_framework.cli`
#### **CLI Commands Overview** ##### `evaluate` **Options:** - `--model, -m`: Model ID to evaluate - `--test-cases, -t`: Path to test cases file - `--requirements, -r`: Path to requirements file - `--output, -o`: Output file for results - `--format`: Output format (json, yaml, csv) - `--verbose, -v`: Verbose logging ##### `compare` ##### `suggest` ##### `generate-dataset` ##### `register-model` ##### `list-models` #### **Configuration Files** ##### **Test Cases Format** (`test_cases.json`) ##### **Requirements Format** (`requirements.yaml`)
python -m llm_evaluation_framework.cli evaluate \
--model gpt-3.5-turbo \
--test-cases test_cases.json \
--requirements requirements.yaml \
--output results.json
python -m llm_evaluation_framework.cli compare \
--models gpt-3.5-turbo gpt-4 claude-3 \
--test-cases test_cases.json \
--output comparison.html \
--format html
python -m llm_evaluation_framework.cli suggest \
--evaluation-results results.json \
--requirements requirements.yaml \
--top-n 3
python -m llm_evaluation_framework.cli generate-dataset \
--domain customer_service \
--num-cases 50 \
--difficulty mixed \
--output generated_test_cases.json
python -m llm_evaluation_framework.cli register-model \
--name custom-model \
--config model_config.yaml
python -m llm_evaluation_framework.cli list-models \
--filter-by capabilities:reasoning \
--format table
🔧 Advanced Usage Patterns¶
Async Evaluation¶
import asyncio
from llm_evaluation_framework.engines import AsyncInferenceEngine
async def run_parallel_evaluation():
"""Example of running evaluations in parallel for multiple models."""
engine = AsyncInferenceEngine(registry)
# Evaluate multiple models concurrently
tasks = []
models = ["gpt-3.5-turbo", "gpt-4", "claude-3"]
for model_id in models:
task = engine.evaluate_model_async(model_id, test_cases, requirements)
tasks.append(task)
# Wait for all evaluations to complete
results = await asyncio.gather(*tasks)
# Process results
for model_id, result in zip(models, results):
print(f"{model_id}: {result['aggregate_metrics']['overall_score']:.3f}")
# Run async evaluation
asyncio.run(run_parallel_evaluation())
Custom Pipeline¶
class CustomEvaluationPipeline:
"""Example of building a custom evaluation pipeline."""
def __init__(self, registry: ModelRegistry):
self.registry = registry
self.inference_engine = ModelInferenceEngine(registry)
self.suggestion_engine = AutoSuggestionEngine(registry)
self.dataset_generator = TestDatasetGenerator()
def run_complete_evaluation(
self,
use_case_requirements: Dict[str, Any],
models_to_test: List[str]
) -> Dict[str, Any]:
"""Run complete evaluation pipeline."""
# Step 1: Generate test dataset
test_cases = self.dataset_generator.generate_test_cases(
use_case_requirements,
num_cases=20
)
# Step 2: Evaluate all models
all_results = {}
for model_id in models_to_test:
results = self.inference_engine.evaluate_model(
model_id, test_cases, use_case_requirements
)
all_results[model_id] = results
# Step 3: Get recommendations
evaluation_results = list(all_results.values())
recommendations = self.suggestion_engine.suggest_model(
evaluation_results, use_case_requirements
)
return {
"test_cases": test_cases,
"evaluation_results": all_results,
"recommendations": recommendations,
"summary": self._generate_summary(all_results, recommendations)
}
def _generate_summary(self, results, recommendations):
"""Generate evaluation summary."""
return {
"models_tested": len(results),
"best_model": recommendations[0]["model_id"],
"total_cost": sum(r["aggregate_metrics"]["total_cost"] for r in results.values()),
"avg_score": sum(r["aggregate_metrics"]["overall_score"] for r in results.values()) / len(results)
}
# Usage
pipeline = CustomEvaluationPipeline(registry)
results = pipeline.run_complete_evaluation(requirements, models_to_test)
📋 Error Handling & Exceptions¶
### Common Exceptions ### Graceful Error Handling
from llm_evaluation_framework.utils.error_handler import (
ModelNotFoundError,
InvalidConfigurationError,
EvaluationTimeoutError,
InsufficientFundsError,
RateLimitExceededError
)
try:
results = engine.evaluate_model("unknown-model", test_cases, requirements)
except ModelNotFoundError as e:
print(f"Model not registered: {e}")
except InvalidConfigurationError as e:
print(f"Configuration error: {e}")
except EvaluationTimeoutError as e:
print(f"Evaluation timed out: {e}")
except InsufficientFundsError as e:
print(f"Budget exceeded: {e}")
except RateLimitExceededError as e:
print(f"Rate limit hit: {e}")
from llm_evaluation_framework.utils.error_handler import handle_evaluation_errors
@handle_evaluation_errors
def safe_evaluation(model_id: str, test_cases: List[Dict]) -> Optional[Dict]:
"""Evaluation with automatic error handling and retry logic."""
return engine.evaluate_model(model_id, test_cases, requirements)
# Usage with error handling
result = safe_evaluation("gpt-3.5-turbo", test_cases)
if result:
print("Evaluation successful")
else:
print("Evaluation failed after retries")
🔍 Performance & Optimization¶
### Performance Best Practices #### **Batch Processing** #### **Caching** #### **Resource Management**
# Efficient batch evaluation
results = engine.evaluate_multiple_models(
model_ids=["gpt-3.5-turbo", "gpt-4"],
test_cases=test_cases,
use_case_requirements=requirements,
batch_size=10 # Process in batches to manage memory
)
from llm_evaluation_framework.utils.cache import enable_caching
# Enable result caching to avoid re-evaluation
enable_caching(cache_dir="./evaluation_cache", ttl_hours=24)
# Subsequent identical evaluations will use cached results
results = engine.evaluate_model("gpt-3.5-turbo", test_cases, requirements)
import resource
from llm_evaluation_framework.utils.monitoring import ResourceMonitor
# Monitor resource usage during evaluation
with ResourceMonitor() as monitor:
results = engine.evaluate_model("gpt-4", large_test_cases, requirements)
print(f"Peak memory usage: {monitor.peak_memory_mb:.2f} MB")
print(f"Execution time: {monitor.execution_time:.2f} seconds")
## 🎯 Quick Reference Summary **Essential imports for getting started:** --- ### 📚 **Related Documentation** [](getting-started.md) [](categories/examples.md) [](categories/advanced-usage.md) [](categories/developer-guide.md) --- **🤝 Questions or need help? [Open an issue](https://github.com/isathish/LLMEvaluationFramework/issues) or check our [discussion forums](https://github.com/isathish/LLMEvaluationFramework/discussions)** *Updated: API Reference v1.0.0 | Complete documentation coverage*
# Core components
from llm_evaluation_framework import (
ModelRegistry,
ModelInferenceEngine,
AutoSuggestionEngine,
TestDatasetGenerator
)
# Evaluation utilities
from llm_evaluation_framework.evaluation.scoring_strategies import (
AccuracyScorer,
SemanticSimilarityScorer,
BLEUScorer
)
# Persistence and utilities
from llm_evaluation_framework.persistence import PersistenceManager
from llm_evaluation_framework.utils.logger import setup_logging