๐ Getting Started¶
๐ฏ What You'll Achieve¶
โก 2 MinutesInstallation Complete Framework installed and verified | ๐ฏ 5 MinutesFirst Evaluation Run your first model evaluation | ๐ฅ๏ธ 8 MinutesCLI Proficiency Master command-line workflows | ๐ 15 MinutesProduction Ready Understand advanced patterns |
๐ By the end of this guide, you'll be able to:¶
- โ Install and configure the framework in any environment
- โ Register and evaluate your first language model
- โ Generate synthetic datasets for comprehensive testing
- โ Run evaluations and analyze detailed results
- โ Use the CLI for automated evaluation workflows
- โ Export results in multiple formats for reporting
- โ Troubleshoot common issues independently
๐ Prerequisites & System Requirements¶
### ๐ฅ๏ธ **System Requirements**
### ๐ ๏ธ **Platform Support**
| Component | Minimum | Recommended | Notes |
|---|---|---|---|
| Python | 3.8+ | 3.11+ | Type hints & performance improvements |
| Memory | 4GB RAM | 8GB+ RAM | For large-scale evaluations |
| Storage | 1GB free | 5GB+ free | For datasets and cached results |
| Network | Stable internet | High-speed connection | For API calls to LLM providers |
  
### ๐ง **Development Tools** (Optional but Recommended) โก Quick Installation¶
### ๐โโ๏ธ **Express Installation** (Recommended)
**This script automatically:** - Creates isolated Python environment - Installs the latest stable version - Configures CLI tools - Runs verification tests
### ๐ฆ **Standard Installation Methods** #### **Method 1: PyPI (Most Common)** #### **Method 2: Development Installation** #### **Method 3: Docker Installation** #### **Method 4: Conda Installation**
### ๐ **Installation Verification** # Step 1: Create virtual environment (strongly recommended)
python -m venv llm-eval-env
# Step 2: Activate environment
# On macOS/Linux:
source llm-eval-env/bin/activate
# On Windows PowerShell:
llm-eval-env\Scripts\Activate.ps1
# On Windows Command Prompt:
llm-eval-env\Scripts\activate.bat
# Step 3: Upgrade pip for better dependency resolution
python -m pip install --upgrade pip
# Step 4: Install framework with all components
pip install "llm-evaluation-framework[complete]"
# Step 5: Verify installation
python -c "import llm_evaluation_framework; print('โ
Installation successful!')"
llm-eval --help
# Step 1: Clone repository
git clone https://github.com/isathish/LLMEvaluationFramework.git
cd LLMEvaluationFramework
# Step 2: Create and activate virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Step 3: Install in development mode with all dependencies
pip install -e ".[dev,test,docs]"
# Step 4: Install pre-commit hooks (for contributors)
pre-commit install
# Step 5: Run test suite to verify installation
pytest --cov=llm_evaluation_framework -v
# Step 1: Pull latest image
docker pull ghcr.io/isathish/llm-evaluation-framework:latest
# Step 2: Run interactive container
docker run -it --rm \
-v $(pwd)/data:/app/data \
-v $(pwd)/results:/app/results \
ghcr.io/isathish/llm-evaluation-framework:latest
# Step 3: Verify inside container
llm-eval --version
# Step 1: Create conda environment
conda create -n llm-eval python=3.11 -y
conda activate llm-eval
# Step 2: Install from conda-forge
conda install -c conda-forge llm-evaluation-framework
# Alternative: Install via pip in conda environment
pip install llm-evaluation-framework
# Step 3: Verify installation
llm-eval --version
Run these commands to ensure everything is working correctly: **๐ If all commands run without errors, you're ready to proceed!**
# 1๏ธโฃ Version check
llm-eval --version
# Expected output: LLM Evaluation Framework v1.0.0
# 2๏ธโฃ Core import test
python -c "
from llm_evaluation_framework import (
ModelRegistry,
ModelInferenceEngine,
TestDatasetGenerator
)
print('โ
Core components imported successfully')
"
# 3๏ธโฃ CLI functionality test
llm-eval list --type capabilities
# Expected output: List of available capabilities
# 4๏ธโฃ Quick functionality test
python -c "
from llm_evaluation_framework import ModelRegistry
registry = ModelRegistry()
print(f'โ
Registry initialized with {len(registry.list_models())} pre-configured models')
"
๐ฏ Your First Evaluation¶
### ๐๏ธ **Step 1: Environment Setup** Create a new directory for your evaluation project: ### ๐ค **Step 2: Initialize the Framework**
# Create project directory
mkdir my-llm-evaluation
cd my-llm-evaluation
# Create Python file for our first evaluation
touch first_evaluation.py
**Create `first_evaluation.py`:**
### ๐โโ๏ธ **Step 3: Run Your First Evaluation** **Expected Output:** """
๐ My First LLM Evaluation
=========================
This script demonstrates the complete evaluation workflow:
1. Framework initialization
2. Model registration
3. Test dataset generation
4. Model evaluation
5. Results analysis
"""
import os
from datetime import datetime
from llm_evaluation_framework import (
ModelRegistry,
ModelInferenceEngine,
AutoSuggestionEngine,
TestDatasetGenerator
)
from llm_evaluation_framework.persistence import JSONStore
def main():
"""Run complete evaluation workflow."""
print("๐ LLM Evaluation Framework - First Evaluation")
print("=" * 55)
print(f"๐
Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
# Step 1: Initialize core components
print("\n๐ง Step 1: Initializing framework components...")
registry = ModelRegistry()
engine = ModelInferenceEngine(registry)
suggestion_engine = AutoSuggestionEngine(registry)
dataset_generator = TestDatasetGenerator()
print("โ
All components initialized successfully!")
# Step 2: Register models for evaluation
setup_models(registry)
# Step 3: Generate test dataset
test_cases = generate_test_dataset(dataset_generator)
# Step 4: Run evaluations
results = run_evaluations(engine, test_cases)
# Step 5: Get recommendations
recommendations = get_recommendations(suggestion_engine, results)
# Step 6: Save and display results
save_and_display_results(results, recommendations, test_cases)
print("\n๐ First evaluation completed successfully!")
print("๐ก Next steps: Try different models or custom test cases")
def setup_models(registry):
"""Register models for evaluation."""
print("\n๐ค Step 2: Registering evaluation models...")
# Model configurations
models = {
"gpt-3.5-turbo": {
"provider": "openai",
"api_cost_input": 0.0015,
"api_cost_output": 0.002,
"capabilities": ["reasoning", "creativity", "coding"],
"max_tokens": 4096,
"context_window": 16385
},
"claude-3-haiku": {
"provider": "anthropic",
"api_cost_input": 0.00325,
"api_cost_output": 0.01625,
"capabilities": ["reasoning", "analysis", "creative_writing"],
"max_tokens": 4096,
"context_window": 200000
},
"mock-model": {
"provider": "mock", # For testing without API costs
"api_cost_input": 0.0,
"api_cost_output": 0.0,
"capabilities": ["reasoning", "creativity"],
"max_tokens": 2048,
"context_window": 4096
}
}
# Register each model
for model_name, config in models.items():
success = registry.register_model(model_name, config)
status = "โ
" if success else "โ"
print(f" {status} {model_name}")
print(f"๐ Total registered models: {len(registry.list_models())}")
def generate_test_dataset(generator):
"""Generate comprehensive test dataset."""
print("\n๐งช Step 3: Generating test dataset...")
# Define evaluation requirements
requirements = {
"domain": "general_knowledge",
"required_capabilities": ["reasoning", "creativity"],
"difficulty_levels": ["easy", "medium", "hard"],
"test_types": ["factual", "analytical", "creative"]
}
# Generate test cases
test_cases = generator.generate_test_cases(requirements, count=8)
print(f"โ
Generated {len(test_cases)} test cases:")
for i, case in enumerate(test_cases, 1):
print(f" {i}. {case.get('type', 'unknown').title()}: {case.get('prompt', '')[:60]}...")
return test_cases
def run_evaluations(engine, test_cases):
"""Run evaluations for all registered models."""
print("\nโก Step 4: Running model evaluations...")
# Evaluation configuration
evaluation_config = {
"max_response_time": 30.0, # seconds
"budget": 0.50, # USD
"quality_threshold": 0.7,
"required_capabilities": ["reasoning"]
}
# Get available models (excluding potentially expensive ones for demo)
available_models = ["mock-model"] # Safe for testing
results = {}
for model_id in available_models:
print(f" ๐ Evaluating {model_id}...")
try:
result = engine.evaluate_model(model_id, test_cases, evaluation_config)
results[model_id] = result
# Display basic metrics
metrics = result.get('aggregate_metrics', {})
print(f" ๐ Score: {metrics.get('overall_score', 0):.3f}")
print(f" ๐ฐ Cost: ${metrics.get('total_cost', 0):.4f}")
print(f" โฑ๏ธ Time: {metrics.get('avg_response_time', 0):.2f}s")
print(f" โ
Success Rate: {metrics.get('success_rate', 0):.1%}")
except Exception as e:
print(f" โ Evaluation failed: {str(e)}")
continue
print(f"โ
Completed evaluations for {len(results)} models")
return results
def get_recommendations(suggestion_engine, results):
"""Get model recommendations based on results."""
print("\n๐ฏ Step 5: Generating recommendations...")
if not results:
print(" โ ๏ธ No results available for recommendations")
return []
# Recommendation requirements
recommendation_config = {
"weights": {
"accuracy": 0.4,
"cost_efficiency": 0.3,
"response_time": 0.2,
"reliability": 0.1
},
"constraints": {
"max_cost_per_request": 0.05,
"min_accuracy": 0.6
}
}
try:
evaluation_results = list(results.values())
recommendations = suggestion_engine.suggest_model(
evaluation_results,
recommendation_config
)
print(f"โ
Generated {len(recommendations)} recommendations")
if recommendations:
top_rec = recommendations[0]
print(f" ๐ Top recommendation: {top_rec.get('model_id', 'Unknown')}")
print(f" ๐ Recommendation score: {top_rec.get('recommendation_score', 0):.3f}")
return recommendations
except Exception as e:
print(f" โ Recommendation generation failed: {str(e)}")
return []
def save_and_display_results(results, recommendations, test_cases):
"""Save results and display summary."""
print("\n๐พ Step 6: Saving results and generating summary...")
# Prepare comprehensive results
complete_results = {
"evaluation_metadata": {
"timestamp": datetime.now().isoformat(),
"framework_version": "1.0.0",
"total_test_cases": len(test_cases),
"models_evaluated": len(results)
},
"test_cases": test_cases,
"evaluation_results": results,
"recommendations": recommendations
}
# Save to JSON file
try:
store = JSONStore("first_evaluation_results.json")
store.save_evaluation_result(complete_results)
print("โ
Results saved to 'first_evaluation_results.json'")
except Exception as e:
print(f"โ Failed to save results: {str(e)}")
# Display final summary
print("\n๐ EVALUATION SUMMARY")
print("=" * 25)
if results:
for model_id, result in results.items():
metrics = result.get('aggregate_metrics', {})
print(f"\n๐ค {model_id.upper()}:")
print(f" ๐ Overall Score: {metrics.get('overall_score', 0):.3f}/1.0")
print(f" ๐ฐ Total Cost: ${metrics.get('total_cost', 0):.4f}")
print(f" โฑ๏ธ Average Response Time: {metrics.get('avg_response_time', 0):.2f}s")
print(f" โ
Success Rate: {metrics.get('success_rate', 0):.1%}")
else:
print("โ ๏ธ No evaluation results to display")
if recommendations:
print(f"\n๐ฏ TOP RECOMMENDATION: {recommendations[0].get('model_id', 'Unknown')}")
print(f"๐ Confidence: {recommendations[0].get('recommendation_score', 0):.3f}")
if __name__ == "__main__":
main()
๐ LLM Evaluation Framework - First Evaluation
=======================================================
๐
Started at: 2024-01-15 14:30:25
๐ง Step 1: Initializing framework components...
โ
All components initialized successfully!
๐ค Step 2: Registering evaluation models...
โ
gpt-3.5-turbo
โ
claude-3-haiku
โ
mock-model
๐ Total registered models: 3
๐งช Step 3: Generating test dataset...
โ
Generated 8 test cases:
1. Factual: What is the capital of France and what is its pop...
2. Analytical: Compare the advantages and disadvantages of re...
3. Creative: Write a short story about a robot learning to...
โก Step 4: Running model evaluations...
๐ Evaluating mock-model...
๐ Score: 0.847
๐ฐ Cost: $0.0000
โฑ๏ธ Time: 0.15s
โ
Success Rate: 100.0%
โ
Completed evaluations for 1 models
๐ฏ Step 5: Generating recommendations...
โ
Generated 1 recommendations
๐ Top recommendation: mock-model
๐ Recommendation score: 0.923
๐พ Step 6: Saving results and generating summary...
โ
Results saved to 'first_evaluation_results.json'
๐ EVALUATION SUMMARY
=========================
๐ค MOCK-MODEL:
๐ Overall Score: 0.847/1.0
๐ฐ Total Cost: $0.0000
โฑ๏ธ Average Response Time: 0.15s
โ
Success Rate: 100.0%
๐ฏ TOP RECOMMENDATION: mock-model
๐ Confidence: 0.923
๐ First evaluation completed successfully!
๐ก Next steps: Try different models or custom test cases
๐ฅ๏ธ CLI Quickstart¶
### ๐ฏ **Essential CLI Commands** The LLM Evaluation Framework provides a powerful command-line interface for streamlined workflows. #### **๐ Discovery Commands** #### **โก Quick Evaluation Commands** #### **๐ง Model Management Commands** ### ๐ **Complete CLI Workflow Example**
# View all available commands and options
llm-eval --help
# List available resources
llm-eval list # List all available resources
llm-eval list --type models # List registered models
llm-eval list --type capabilities # List supported capabilities
llm-eval list --type providers # List supported providers
llm-eval list --type scorers # List scoring strategies
# Get detailed help for specific commands
llm-eval evaluate --help
llm-eval generate --help
llm-eval score --help
# Generate test dataset quickly
llm-eval generate \
--capability reasoning \
--count 10 \
--domain "customer_service" \
--output test_cases.json \
--format json
# Run evaluation with generated dataset
llm-eval evaluate \
--model gpt-3.5-turbo \
--test-cases test_cases.json \
--budget 0.10 \
--output evaluation_results.json \
--format json \
--verbose
# Score predictions quickly
llm-eval score \
--predictions "The sky is blue" "Paris is a city" \
--references "The sky is blue" "Paris is the capital of France" \
--metric semantic_similarity \
--output scores.json
# Register a new model
llm-eval register-model \
--name "my-custom-model" \
--provider "custom" \
--config model_config.yaml
# Validate model configuration
llm-eval validate-model --name "gpt-3.5-turbo"
# Update model configuration
llm-eval update-model \
--name "gpt-3.5-turbo" \
--field "api_cost_input" \
--value "0.002"
# Remove model registration
llm-eval remove-model --name "my-custom-model" --confirm
Let's walk through a complete evaluation workflow using only CLI commands: **Expected CLI Output:**
### ๐ง **Advanced CLI Patterns** # ๐๏ธ Step 1: Setup project directory
mkdir cli-evaluation && cd cli-evaluation
# ๐งช Step 2: Generate comprehensive test dataset
llm-eval generate \
--capability reasoning creativity coding \
--count 15 \
--domain "software_development" \
--difficulty mixed \
--output comprehensive_tests.json \
--verbose
# ๐ Step 3: Run evaluation on multiple models
llm-eval evaluate \
--model gpt-3.5-turbo gpt-4 claude-3 \
--test-cases comprehensive_tests.json \
--budget 0.25 \
--max-time 60 \
--output evaluation_results.json \
--format json \
--parallel \
--verbose
# ๐ฏ Step 4: Get model recommendations
llm-eval recommend \
--evaluation-results evaluation_results.json \
--weights accuracy:0.4 cost:0.3 speed:0.2 creativity:0.1 \
--constraints max_cost:0.05 min_accuracy:0.7 \
--output recommendations.json \
--top-n 3
# ๐ Step 5: Generate analysis report
llm-eval analyze \
--evaluation-results evaluation_results.json \
--recommendations recommendations.json \
--output analysis_report.html \
--format html \
--include-charts
# ๐ Step 6: View quick summary
llm-eval summary \
--evaluation-results evaluation_results.json \
--format table
๐ EVALUATION SUMMARY
====================
Models Evaluated: 3
Test Cases: 15
Total Cost: $0.18
Total Time: 47.3s
๐ TOP PERFORMERS:
1. gpt-4 (Score: 0.924, Cost: $0.12)
2. claude-3 (Score: 0.891, Cost: $0.04)
3. gpt-3.5-turbo (Score: 0.856, Cost: $0.02)
๐ก RECOMMENDATION: claude-3 (Best cost-performance balance)
#### **Configuration File Usage** Create reusable configuration files for consistent evaluations: **`evaluation_config.yaml`:** **Use configuration file:** #### **Pipeline Integration** #### **Monitoring and Alerts**
models:
- gpt-3.5-turbo
- claude-3-haiku
test_generation:
capabilities: [reasoning, creativity]
count: 20
domain: customer_service
difficulty: mixed
evaluation:
budget: 0.20
max_response_time: 30
quality_threshold: 0.7
scoring:
weights:
accuracy: 0.4
cost_efficiency: 0.3
response_time: 0.2
creativity: 0.1
output:
format: json
include_metadata: true
save_individual_results: true
# CI/CD Pipeline example
#!/bin/bash
# Quality gate evaluation
llm-eval evaluate \
--model production-model \
--test-cases qa_test_suite.json \
--quality-gate 0.85 \
--fail-on-low-score \
--output qa_results.json
# If evaluation passes, deploy
if [ $? -eq 0 ]; then
echo "โ
Quality gate passed - deploying model"
./deploy_model.sh
else
echo "โ Quality gate failed - blocking deployment"
exit 1
fi
๐ ๏ธ Environment Configuration¶
### ๐ **API Keys and Credentials** Set up authentication for different LLM providers: ### โ๏ธ **Framework Configuration** **Create `llm_eval_config.yaml`:** ### ๐ **Project Structure** Recommended project organization:
# Create .env file in your project directory
touch .env
# Add your API keys (never commit this file!)
echo "OPENAI_API_KEY=your_openai_api_key_here" >> .env
echo "ANTHROPIC_API_KEY=your_anthropic_api_key_here" >> .env
echo "AZURE_OPENAI_API_KEY=your_azure_api_key_here" >> .env
echo "AZURE_OPENAI_ENDPOINT=your_azure_endpoint_here" >> .env
# Alternative: Set environment variables directly
export OPENAI_API_KEY="your_openai_api_key_here"
export ANTHROPIC_API_KEY="your_anthropic_api_key_here"
# Global framework configuration
framework:
logging_level: INFO
cache_enabled: true
cache_ttl_hours: 24
max_concurrent_evaluations: 5
# Default model settings
defaults:
max_tokens: 2048
temperature: 0.7
timeout_seconds: 30
retry_attempts: 3
# Cost management
cost_management:
daily_budget_limit: 10.00 # USD
per_request_limit: 0.10 # USD
alert_threshold: 0.80 # 80% of budget
# Output preferences
output:
default_format: json
include_metadata: true
save_intermediate_results: true
results_directory: "./evaluation_results"
my-llm-evaluation/
โโโ .env # API keys (git-ignored)
โโโ llm_eval_config.yaml # Framework configuration
โโโ requirements.txt # Python dependencies
โโโ evaluation_scripts/
โ โโโ first_evaluation.py
โ โโโ batch_evaluation.py
โ โโโ custom_evaluation.py
โโโ test_datasets/
โ โโโ general_knowledge.json
โ โโโ reasoning_tests.json
โ โโโ creativity_tests.json
โโโ results/
โ โโโ 2024-01-15/
โ โโโ latest/
โโโ configs/
โ โโโ evaluation_config.yaml
โ โโโ model_configs/
โโโ reports/
โโโ analysis_report.html
โโโ summary_dashboard.html
๐งช Verification & Testing¶
### โ
**Complete Installation Test** Run this comprehensive test to verify your installation: **Create `installation_test.py`:** **Run the test:** ### ๐ **Performance Benchmark** Verify performance with the included benchmark:
"""
๐งช LLM Evaluation Framework Installation Test
===========================================
Comprehensive test to verify all components are working correctly.
"""
import sys
import traceback
from datetime import datetime
def run_test(test_name, test_func):
"""Run a test and report results."""
try:
print(f"๐ Running {test_name}...")
test_func()
print(f"โ
{test_name} PASSED")
return True
except Exception as e:
print(f"โ {test_name} FAILED: {str(e)}")
if "--verbose" in sys.argv:
traceback.print_exc()
return False
def test_core_imports():
"""Test core framework imports."""
from llm_evaluation_framework import (
ModelRegistry,
ModelInferenceEngine,
AutoSuggestionEngine,
TestDatasetGenerator
)
def test_registry_functionality():
"""Test model registry functionality."""
from llm_evaluation_framework import ModelRegistry
registry = ModelRegistry()
# Test model registration
config = {
"provider": "mock",
"api_cost_input": 0.001,
"api_cost_output": 0.002,
"capabilities": ["reasoning"]
}
assert registry.register_model("test-model", config)
assert "test-model" in registry.list_models()
assert registry.get_model("test-model")["provider"] == "mock"
def test_dataset_generation():
"""Test dataset generation functionality."""
from llm_evaluation_framework import TestDatasetGenerator
generator = TestDatasetGenerator()
requirements = {
"domain": "general",
"required_capabilities": ["reasoning"]
}
test_cases = generator.generate_test_cases(requirements, count=3)
assert len(test_cases) == 3
assert all("prompt" in case for case in test_cases)
def test_inference_engine():
"""Test inference engine functionality."""
from llm_evaluation_framework import ModelRegistry, ModelInferenceEngine
registry = ModelRegistry()
engine = ModelInferenceEngine(registry)
# Register mock model for testing
config = {
"provider": "mock",
"api_cost_input": 0.0,
"api_cost_output": 0.0,
"capabilities": ["reasoning"]
}
registry.register_model("mock-test", config)
# Test evaluation
test_cases = [{
"id": "test1",
"prompt": "What is 2+2?",
"expected_output": "4"
}]
requirements = {"budget": 1.0}
results = engine.evaluate_model("mock-test", test_cases, requirements)
assert "aggregate_metrics" in results
assert "test_results" in results
def test_cli_availability():
"""Test CLI command availability."""
import subprocess
result = subprocess.run(
["llm-eval", "--version"],
capture_output=True,
text=True
)
assert result.returncode == 0
def test_persistence():
"""Test persistence functionality."""
from llm_evaluation_framework.persistence import JSONStore
import tempfile
import os
with tempfile.NamedTemporaryFile(suffix='.json', delete=False) as tmp:
store = JSONStore(tmp.name)
test_data = {"test": "data", "timestamp": datetime.now().isoformat()}
store.save_evaluation_result(test_data)
loaded_data = store.load_evaluation_result()
assert loaded_data["test"] == "data"
os.unlink(tmp.name)
def main():
"""Run all installation tests."""
print("๐งช LLM Evaluation Framework Installation Test")
print("=" * 50)
print(f"๐
Test started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print()
tests = [
("Core Imports", test_core_imports),
("Model Registry", test_registry_functionality),
("Dataset Generation", test_dataset_generation),
("Inference Engine", test_inference_engine),
("CLI Availability", test_cli_availability),
("Persistence Layer", test_persistence)
]
passed = 0
total = len(tests)
for test_name, test_func in tests:
if run_test(test_name, test_func):
passed += 1
print()
print("๐ TEST SUMMARY")
print("=" * 15)
print(f"โ
Passed: {passed}/{total}")
print(f"โ Failed: {total - passed}/{total}")
print(f"๐ Success Rate: {(passed/total)*100:.1f}%")
if passed == total:
print("\n๐ ALL TESTS PASSED! Installation is complete and working correctly.")
print("๐ก You're ready to start evaluating LLMs!")
else:
print("\nโ ๏ธ Some tests failed. Please check the error messages above.")
print("๐ Need help? Visit: https://github.com/isathish/LLMEvaluationFramework/issues")
sys.exit(1)
if __name__ == "__main__":
main()
๐จ Troubleshooting¶
### ๐ง **Common Issues and Solutions** ### ๐ **Getting Help**
#### **๐ Installation Issues** #### **๐ Authentication Issues** #### **๐ฐ Cost Management Issues** #### **๐ Performance Issues** #### **๐ Result Issues**
### ๐ **Diagnostic Commands** Use these commands to diagnose issues: ModuleNotFoundError: No module named 'llm_evaluation_framework'
**Problem**: Framework not properly installed or virtual environment not activated. **Solutions**:# 1. Check if virtual environment is activated
which python
# Should show path to virtual environment
# 2. Reinstall framework
pip uninstall llm-evaluation-framework
pip install llm-evaluation-framework
# 3. Check Python path
python -c "import sys; print('\n'.join(sys.path))"
# 4. Force reinstall with no cache
pip install --no-cache-dir --force-reinstall llm-evaluation-framework
Command 'llm-eval' not found
**Problem**: CLI not properly installed or not in PATH. **Solutions**:# 1. Check if pip installed scripts correctly
pip show -f llm-evaluation-framework | grep bin
# 2. Add pip bin directory to PATH
export PATH="$PATH:$(python -m site --user-base)/bin"
# 3. Use module syntax as alternative
python -m llm_evaluation_framework.cli --help
# 4. Reinstall with explicit user flag
pip install --user llm-evaluation-framework
Permission denied errors
**Problem**: Insufficient permissions for installation. **Solutions**:# 1. Use virtual environment (recommended)
python -m venv llm-eval-env
source llm-eval-env/bin/activate
pip install llm-evaluation-framework
# 2. Install for current user only
pip install --user llm-evaluation-framework
# 3. Fix permissions (macOS/Linux)
sudo chown -R $(whoami) ~/.local
# 4. Use conda environment
conda create -n llm-eval python=3.11
conda activate llm-eval
pip install llm-evaluation-framework
API key authentication failures
**Problem**: API keys not configured or invalid. **Solutions**:# 1. Verify environment variables are set
echo $OPENAI_API_KEY
echo $ANTHROPIC_API_KEY
# 2. Test API key validity
curl -H "Authorization: Bearer $OPENAI_API_KEY" \
https://api.openai.com/v1/models
# 3. Check .env file is in correct location
cat .env # Should show your API keys
# 4. Verify Python can read environment variables
python -c "import os; print(os.getenv('OPENAI_API_KEY', 'NOT_SET'))"
Rate limiting or quota errors
**Problem**: API rate limits exceeded or quota exhausted. **Solutions**:# 1. Check current usage (OpenAI)
llm-eval check-usage --provider openai
# 2. Reduce concurrent requests in config
# Edit llm_eval_config.yaml:
# max_concurrent_evaluations: 1
# 3. Add delays between requests
llm-eval evaluate --rate-limit 1.0 # 1 second between requests
# 4. Use mock models for testing
llm-eval evaluate --model mock-model --test-cases test.json
Unexpected high API costs
**Problem**: Evaluation costs higher than expected. **Prevention & Solutions**:# 1. Set strict budget limits
llm-eval evaluate --budget 0.10 --fail-on-exceed
# 2. Use cost estimation before evaluation
llm-eval estimate-cost \
--model gpt-4 \
--test-cases large_dataset.json
# 3. Use cheaper models for testing
llm-eval evaluate --model gpt-3.5-turbo # Instead of gpt-4
# 4. Monitor costs in real-time
llm-eval evaluate --monitor-costs --cost-alert 0.05
Slow evaluation performance
**Problem**: Evaluations taking too long to complete. **Solutions**:# 1. Enable parallel processing
llm-eval evaluate --parallel --max-workers 4
# 2. Use batch processing
llm-eval evaluate --batch-size 10
# 3. Optimize model configuration
# Reduce max_tokens in model config
llm-eval update-model --name gpt-3.5-turbo --max-tokens 1000
# 4. Enable caching for repeated evaluations
llm-eval evaluate --enable-cache
# 5. Use faster models for initial testing
llm-eval evaluate --model claude-3-haiku # Faster than claude-3-opus
Inconsistent or unexpected results
**Problem**: Evaluation results vary significantly between runs. **Solutions**:# 1. Set deterministic parameters
llm-eval evaluate --temperature 0.0 --seed 42
# 2. Increase test case count for statistical significance
llm-eval generate --count 50 # Instead of 10
# 3. Run multiple evaluation rounds
llm-eval evaluate --rounds 3 --aggregate-method mean
# 4. Use more robust scoring methods
llm-eval evaluate --scorer semantic_similarity # Instead of exact_match
# System information
llm-eval diagnose --system
# Framework configuration check
llm-eval diagnose --config
# Model registry validation
llm-eval diagnose --models
# Connectivity test
llm-eval diagnose --connectivity
# Permission check
llm-eval diagnose --permissions
# Generate comprehensive diagnostic report
llm-eval diagnose --all --output diagnostic_report.txt
If you're still experiencing issues:
**When reporting issues, include:** - Output of `llm-eval diagnose --all` - Your operating system and Python version - Complete error message and stack trace - Steps to reproduce the issue - Your configuration files (with sensitive data removed)
| Resource | Best For | Response Time |
|---|---|---|
| GitHub Issues | Bug reports, feature requests | 1-3 days |
| GitHub Discussions | Q&A, general help, ideas | Community-driven |
| Documentation | Detailed guides, API reference | Immediate |
| Stack Overflow | Technical questions | Community-driven |
๐ฏ Next Steps & Learning Path¶
## ๐ Congratulations! You're Now Ready to Evaluate LLMs **You've successfully set up the LLM Evaluation Framework and run your first evaluation!** ### ๐ **What You've Accomplished** โ
**Installed** the framework in your environment โ
**Configured** models and evaluation settings โ
**Generated** synthetic test datasets โ
**Executed** comprehensive model evaluations โ
**Analyzed** results and got recommendations โ
**Learned** CLI commands for automation ### ๐ **Ready for Advanced Topics?** [](core-concepts.md) [](examples.md) [](advanced-usage.md) [](developer-guide.md) --- ### ๐ฌ **Join Our Community** [](https://github.com/isathish/LLMEvaluationFramework/discussions) [](https://stackoverflow.com/questions/tagged/llm-evaluation-framework) [](../contributing.md) --- **โญ Found this helpful? [Star us on GitHub](https://github.com/isathish/LLMEvaluationFramework) to support the project!** *Happy evaluating! ๐*