🚀 Getting Started¶

![Getting Started](https://img.shields.io/badge/Getting%20Started-Your%20LLM%20Evaluation%20Journey-22c55e?style=for-the-badge&logo=rocket&logoColor=white) **From zero to expert in 15 minutes** *Complete setup guide with interactive examples and instant verification* [![Quick Install](https://img.shields.io/badge/⚡_Quick_Install-2_Minutes-ef4444?style=for-the-badge)](#quick-installation) [![First Evaluation](https://img.shields.io/badge/🎯_First_Evaluation-5_Minutes-f59e0b?style=for-the-badge)](#your-first-evaluation) [![CLI Mastery](https://img.shields.io/badge/🖥️_CLI_Mastery-3_Minutes-3b82f6?style=for-the-badge)](#cli-quickstart) [![Troubleshooting](https://img.shields.io/badge/🛠️_Help_%26_Support-Always_Available-8b5cf6?style=for-the-badge)](#troubleshooting)

🎯 What You'll Achieve¶

⚡ 2 Minutes

Installation Complete

Framework installed and verified

🎯 5 Minutes

First Evaluation

Run your first model evaluation

🖥️ 8 Minutes

CLI Proficiency

Master command-line workflows

🚀 15 Minutes

Production Ready

Understand advanced patterns

🌟 By the end of this guide, you'll be able to:¶

✅ Install and configure the framework in any environment
✅ Register and evaluate your first language model
✅ Generate synthetic datasets for comprehensive testing
✅ Run evaluations and analyze detailed results
✅ Use the CLI for automated evaluation workflows
✅ Export results in multiple formats for reporting
✅ Troubleshoot common issues independently

📋 Prerequisites & System Requirements¶

### 🖥️ **System Requirements**

Component	Minimum	Recommended	Notes
Python	3.8+	3.11+	Type hints & performance improvements
Memory	4GB RAM	8GB+ RAM	For large-scale evaluations
Storage	1GB free	5GB+ free	For datasets and cached results
Network	Stable internet	High-speed connection	For API calls to LLM providers

### 🛠️ **Platform Support**

![Linux](https://img.shields.io/badge/Linux-Ubuntu%20|%20CentOS%20|%20RHEL-22c55e?style=for-the-badge&logo=linux&logoColor=white) ![macOS](https://img.shields.io/badge/macOS-Intel%20%26%20Apple%20Silicon-3b82f6?style=for-the-badge&logo=apple&logoColor=white) ![Windows](https://img.shields.io/badge/Windows-10%20|%2011-f59e0b?style=for-the-badge&logo=windows&logoColor=white)

### 🔧 **Development Tools** (Optional but Recommended)

# Version management
pyenv          # Python version management
pipenv         # Dependency management
conda          # Data science environment

# Development tools
git            # Version control
docker         # Containerization
vs-code        # IDE with Python extensions

⚡ Quick Installation¶

### 🏃‍♂️ **Express Installation** (Recommended)

# 🚀 One-command installation
curl -sSL https://install.llmevalframework.com | bash

# ✅ Verify installation
llm-eval --version && echo "🎉 Ready to evaluate!"

**This script automatically:** - Creates isolated Python environment - Installs the latest stable version - Configures CLI tools - Runs verification tests

### 📦 **Standard Installation Methods**

#### **Method 1: PyPI (Most Common)**

# Step 1: Create virtual environment (strongly recommended)
python -m venv llm-eval-env

# Step 2: Activate environment
# On macOS/Linux:
source llm-eval-env/bin/activate
# On Windows PowerShell:
llm-eval-env\Scripts\Activate.ps1
# On Windows Command Prompt:
llm-eval-env\Scripts\activate.bat

# Step 3: Upgrade pip for better dependency resolution
python -m pip install --upgrade pip

# Step 4: Install framework with all components
pip install "llm-evaluation-framework[complete]"

# Step 5: Verify installation
python -c "import llm_evaluation_framework; print('✅ Installation successful!')"
llm-eval --help

#### **Method 2: Development Installation**

# Step 1: Clone repository
git clone https://github.com/isathish/LLMEvaluationFramework.git
cd LLMEvaluationFramework

# Step 2: Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Step 3: Install in development mode with all dependencies
pip install -e ".[dev,test,docs]"

# Step 4: Install pre-commit hooks (for contributors)
pre-commit install

# Step 5: Run test suite to verify installation
pytest --cov=llm_evaluation_framework -v

#### **Method 3: Docker Installation**

# Step 1: Pull latest image
docker pull ghcr.io/isathish/llm-evaluation-framework:latest

# Step 2: Run interactive container
docker run -it --rm \
  -v $(pwd)/data:/app/data \
  -v $(pwd)/results:/app/results \
  ghcr.io/isathish/llm-evaluation-framework:latest

# Step 3: Verify inside container
llm-eval --version

#### **Method 4: Conda Installation**

# Step 1: Create conda environment
conda create -n llm-eval python=3.11 -y
conda activate llm-eval

# Step 2: Install from conda-forge
conda install -c conda-forge llm-evaluation-framework

# Alternative: Install via pip in conda environment
pip install llm-evaluation-framework

# Step 3: Verify installation
llm-eval --version

### 🔍 **Installation Verification**

Run these commands to ensure everything is working correctly:

# 1️⃣ Version check
llm-eval --version
# Expected output: LLM Evaluation Framework v1.0.0

# 2️⃣ Core import test
python -c "
from llm_evaluation_framework import (
    ModelRegistry, 
    ModelInferenceEngine, 
    TestDatasetGenerator
)
print('✅ Core components imported successfully')
"

# 3️⃣ CLI functionality test
llm-eval list --type capabilities
# Expected output: List of available capabilities

# 4️⃣ Quick functionality test
python -c "
from llm_evaluation_framework import ModelRegistry
registry = ModelRegistry()
print(f'✅ Registry initialized with {len(registry.list_models())} pre-configured models')
"

**🎉 If all commands run without errors, you're ready to proceed!**

🎯 Your First Evaluation¶

### 🏗️ **Step 1: Environment Setup** Create a new directory for your evaluation project:

# Create project directory
mkdir my-llm-evaluation
cd my-llm-evaluation

# Create Python file for our first evaluation
touch first_evaluation.py

### 🤖 **Step 2: Initialize the Framework**

 **Create `first_evaluation.py`:** """
🚀 My First LLM Evaluation
=========================

This script demonstrates the complete evaluation workflow:
1. Framework initialization
2. Model registration  
3. Test dataset generation
4. Model evaluation
5. Results analysis
"""

import os
from datetime import datetime
from llm_evaluation_framework import (
    ModelRegistry,
    ModelInferenceEngine,
    AutoSuggestionEngine,
    TestDatasetGenerator
)
from llm_evaluation_framework.persistence import JSONStore

def main():
    """Run complete evaluation workflow."""

    print("🚀 LLM Evaluation Framework - First Evaluation")
    print("=" * 55)
    print(f"📅 Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

    # Step 1: Initialize core components
    print("\n🔧 Step 1: Initializing framework components...")
    registry = ModelRegistry()
    engine = ModelInferenceEngine(registry)
    suggestion_engine = AutoSuggestionEngine(registry)
    dataset_generator = TestDatasetGenerator()
    print("✅ All components initialized successfully!")

    # Step 2: Register models for evaluation
    setup_models(registry)

    # Step 3: Generate test dataset
    test_cases = generate_test_dataset(dataset_generator)

    # Step 4: Run evaluations
    results = run_evaluations(engine, test_cases)

    # Step 5: Get recommendations
    recommendations = get_recommendations(suggestion_engine, results)

    # Step 6: Save and display results
    save_and_display_results(results, recommendations, test_cases)

    print("\n🎉 First evaluation completed successfully!")
    print("💡 Next steps: Try different models or custom test cases")

def setup_models(registry):
    """Register models for evaluation."""
    print("\n🤖 Step 2: Registering evaluation models...")

    # Model configurations
    models = {
        "gpt-3.5-turbo": {
            "provider": "openai",
            "api_cost_input": 0.0015,
            "api_cost_output": 0.002,
            "capabilities": ["reasoning", "creativity", "coding"],
            "max_tokens": 4096,
            "context_window": 16385
        },
        "claude-3-haiku": {
            "provider": "anthropic", 
            "api_cost_input": 0.00325,
            "api_cost_output": 0.01625,
            "capabilities": ["reasoning", "analysis", "creative_writing"],
            "max_tokens": 4096,
            "context_window": 200000
        },
        "mock-model": {
            "provider": "mock",  # For testing without API costs
            "api_cost_input": 0.0,
            "api_cost_output": 0.0,
            "capabilities": ["reasoning", "creativity"],
            "max_tokens": 2048,
            "context_window": 4096
        }
    }

    # Register each model
    for model_name, config in models.items():
        success = registry.register_model(model_name, config)
        status = "✅" if success else "❌"
        print(f"   {status} {model_name}")

    print(f"📊 Total registered models: {len(registry.list_models())}")

def generate_test_dataset(generator):
    """Generate comprehensive test dataset."""
    print("\n🧪 Step 3: Generating test dataset...")

    # Define evaluation requirements
    requirements = {
        "domain": "general_knowledge",
        "required_capabilities": ["reasoning", "creativity"],
        "difficulty_levels": ["easy", "medium", "hard"],
        "test_types": ["factual", "analytical", "creative"]
    }

    # Generate test cases
    test_cases = generator.generate_test_cases(requirements, count=8)

    print(f"✅ Generated {len(test_cases)} test cases:")
    for i, case in enumerate(test_cases, 1):
        print(f"   {i}. {case.get('type', 'unknown').title()}: {case.get('prompt', '')[:60]}...")

    return test_cases

def run_evaluations(engine, test_cases):
    """Run evaluations for all registered models."""
    print("\n⚡ Step 4: Running model evaluations...")

    # Evaluation configuration
    evaluation_config = {
        "max_response_time": 30.0,  # seconds
        "budget": 0.50,             # USD
        "quality_threshold": 0.7,
        "required_capabilities": ["reasoning"]
    }

    # Get available models (excluding potentially expensive ones for demo)
    available_models = ["mock-model"]  # Safe for testing

    results = {}
    for model_id in available_models:
        print(f"   🔄 Evaluating {model_id}...")
        try:
            result = engine.evaluate_model(model_id, test_cases, evaluation_config)
            results[model_id] = result

            # Display basic metrics
            metrics = result.get('aggregate_metrics', {})
            print(f"      📊 Score: {metrics.get('overall_score', 0):.3f}")
            print(f"      💰 Cost: ${metrics.get('total_cost', 0):.4f}")
            print(f"      ⏱️  Time: {metrics.get('avg_response_time', 0):.2f}s")
            print(f"      ✅ Success Rate: {metrics.get('success_rate', 0):.1%}")

        except Exception as e:
            print(f"      ❌ Evaluation failed: {str(e)}")
            continue

    print(f"✅ Completed evaluations for {len(results)} models")
    return results

def get_recommendations(suggestion_engine, results):
    """Get model recommendations based on results."""
    print("\n🎯 Step 5: Generating recommendations...")

    if not results:
        print("   ⚠️  No results available for recommendations")
        return []

    # Recommendation requirements
    recommendation_config = {
        "weights": {
            "accuracy": 0.4,
            "cost_efficiency": 0.3,
            "response_time": 0.2,
            "reliability": 0.1
        },
        "constraints": {
            "max_cost_per_request": 0.05,
            "min_accuracy": 0.6
        }
    }

    try:
        evaluation_results = list(results.values())
        recommendations = suggestion_engine.suggest_model(
            evaluation_results, 
            recommendation_config
        )

        print(f"✅ Generated {len(recommendations)} recommendations")
        if recommendations:
            top_rec = recommendations[0]
            print(f"   🏆 Top recommendation: {top_rec.get('model_id', 'Unknown')}")
            print(f"   📊 Recommendation score: {top_rec.get('recommendation_score', 0):.3f}")

        return recommendations

    except Exception as e:
        print(f"   ❌ Recommendation generation failed: {str(e)}")
        return []

def save_and_display_results(results, recommendations, test_cases):
    """Save results and display summary."""
    print("\n💾 Step 6: Saving results and generating summary...")

    # Prepare comprehensive results
    complete_results = {
        "evaluation_metadata": {
            "timestamp": datetime.now().isoformat(),
            "framework_version": "1.0.0",
            "total_test_cases": len(test_cases),
            "models_evaluated": len(results)
        },
        "test_cases": test_cases,
        "evaluation_results": results,
        "recommendations": recommendations
    }

    # Save to JSON file
    try:
        store = JSONStore("first_evaluation_results.json")
        store.save_evaluation_result(complete_results)
        print("✅ Results saved to 'first_evaluation_results.json'")
    except Exception as e:
        print(f"❌ Failed to save results: {str(e)}")

    # Display final summary
    print("\n📋 EVALUATION SUMMARY")
    print("=" * 25)

    if results:
        for model_id, result in results.items():
            metrics = result.get('aggregate_metrics', {})
            print(f"\n🤖 {model_id.upper()}:")
            print(f"   📊 Overall Score: {metrics.get('overall_score', 0):.3f}/1.0")
            print(f"   💰 Total Cost: ${metrics.get('total_cost', 0):.4f}")
            print(f"   ⏱️  Average Response Time: {metrics.get('avg_response_time', 0):.2f}s")
            print(f"   ✅ Success Rate: {metrics.get('success_rate', 0):.1%}")
    else:
        print("⚠️  No evaluation results to display")

    if recommendations:
        print(f"\n🎯 TOP RECOMMENDATION: {recommendations[0].get('model_id', 'Unknown')}")
        print(f"📈 Confidence: {recommendations[0].get('recommendation_score', 0):.3f}")

if __name__ == "__main__":
    main()

### 🏃‍♂️ **Step 3: Run Your First Evaluation**

# Execute the evaluation script
python first_evaluation.py

**Expected Output:**

🚀 LLM Evaluation Framework - First Evaluation
=======================================================
📅 Started at: 2024-01-15 14:30:25

🔧 Step 1: Initializing framework components...
✅ All components initialized successfully!

🤖 Step 2: Registering evaluation models...
   ✅ gpt-3.5-turbo
   ✅ claude-3-haiku  
   ✅ mock-model
📊 Total registered models: 3

🧪 Step 3: Generating test dataset...
✅ Generated 8 test cases:
   1. Factual: What is the capital of France and what is its pop...
   2. Analytical: Compare the advantages and disadvantages of re...
   3. Creative: Write a short story about a robot learning to...

⚡ Step 4: Running model evaluations...
   🔄 Evaluating mock-model...
      📊 Score: 0.847
      💰 Cost: $0.0000
      ⏱️  Time: 0.15s
      ✅ Success Rate: 100.0%
✅ Completed evaluations for 1 models

🎯 Step 5: Generating recommendations...
✅ Generated 1 recommendations
   🏆 Top recommendation: mock-model
   📊 Recommendation score: 0.923

💾 Step 6: Saving results and generating summary...
✅ Results saved to 'first_evaluation_results.json'

📋 EVALUATION SUMMARY
=========================

🤖 MOCK-MODEL:
   📊 Overall Score: 0.847/1.0
   💰 Total Cost: $0.0000
   ⏱️  Average Response Time: 0.15s
   ✅ Success Rate: 100.0%

🎯 TOP RECOMMENDATION: mock-model
📈 Confidence: 0.923

🎉 First evaluation completed successfully!
💡 Next steps: Try different models or custom test cases

🖥️ CLI Quickstart¶

### 🎯 **Essential CLI Commands** The LLM Evaluation Framework provides a powerful command-line interface for streamlined workflows. #### **🔍 Discovery Commands**

# View all available commands and options
llm-eval --help

# List available resources
llm-eval list                           # List all available resources
llm-eval list --type models            # List registered models
llm-eval list --type capabilities      # List supported capabilities  
llm-eval list --type providers         # List supported providers
llm-eval list --type scorers           # List scoring strategies

# Get detailed help for specific commands
llm-eval evaluate --help
llm-eval generate --help
llm-eval score --help

#### **⚡ Quick Evaluation Commands**

# Generate test dataset quickly
llm-eval generate \
  --capability reasoning \
  --count 10 \
  --domain "customer_service" \
  --output test_cases.json \
  --format json

# Run evaluation with generated dataset
llm-eval evaluate \
  --model gpt-3.5-turbo \
  --test-cases test_cases.json \
  --budget 0.10 \
  --output evaluation_results.json \
  --format json \
  --verbose

# Score predictions quickly
llm-eval score \
  --predictions "The sky is blue" "Paris is a city" \
  --references "The sky is blue" "Paris is the capital of France" \
  --metric semantic_similarity \
  --output scores.json

#### **🔧 Model Management Commands**

# Register a new model
llm-eval register-model \
  --name "my-custom-model" \
  --provider "custom" \
  --config model_config.yaml

# Validate model configuration
llm-eval validate-model --name "gpt-3.5-turbo"

# Update model configuration
llm-eval update-model \
  --name "gpt-3.5-turbo" \
  --field "api_cost_input" \
  --value "0.002"

# Remove model registration
llm-eval remove-model --name "my-custom-model" --confirm

### 🔄 **Complete CLI Workflow Example**

Let's walk through a complete evaluation workflow using only CLI commands:

# 🏗️ Step 1: Setup project directory
mkdir cli-evaluation && cd cli-evaluation

# 🧪 Step 2: Generate comprehensive test dataset
llm-eval generate \
  --capability reasoning creativity coding \
  --count 15 \
  --domain "software_development" \
  --difficulty mixed \
  --output comprehensive_tests.json \
  --verbose

# 📊 Step 3: Run evaluation on multiple models
llm-eval evaluate \
  --model gpt-3.5-turbo gpt-4 claude-3 \
  --test-cases comprehensive_tests.json \
  --budget 0.25 \
  --max-time 60 \
  --output evaluation_results.json \
  --format json \
  --parallel \
  --verbose

# 🎯 Step 4: Get model recommendations
llm-eval recommend \
  --evaluation-results evaluation_results.json \
  --weights accuracy:0.4 cost:0.3 speed:0.2 creativity:0.1 \
  --constraints max_cost:0.05 min_accuracy:0.7 \
  --output recommendations.json \
  --top-n 3

# 📈 Step 5: Generate analysis report
llm-eval analyze \
  --evaluation-results evaluation_results.json \
  --recommendations recommendations.json \
  --output analysis_report.html \
  --format html \
  --include-charts

# 📋 Step 6: View quick summary
llm-eval summary \
  --evaluation-results evaluation_results.json \
  --format table

**Expected CLI Output:**

📊 EVALUATION SUMMARY
====================
Models Evaluated: 3
Test Cases: 15
Total Cost: $0.18
Total Time: 47.3s

🏆 TOP PERFORMERS:
1. gpt-4        (Score: 0.924, Cost: $0.12)
2. claude-3     (Score: 0.891, Cost: $0.04) 
3. gpt-3.5-turbo (Score: 0.856, Cost: $0.02)

💡 RECOMMENDATION: claude-3 (Best cost-performance balance)

### 🔧 **Advanced CLI Patterns**

#### **Configuration File Usage** Create reusable configuration files for consistent evaluations: **`evaluation_config.yaml`:**

models:
  - gpt-3.5-turbo
  - claude-3-haiku

test_generation:
  capabilities: [reasoning, creativity]
  count: 20
  domain: customer_service
  difficulty: mixed

evaluation:
  budget: 0.20
  max_response_time: 30
  quality_threshold: 0.7

scoring:
  weights:
    accuracy: 0.4
    cost_efficiency: 0.3
    response_time: 0.2
    creativity: 0.1

output:
  format: json
  include_metadata: true
  save_individual_results: true

**Use configuration file:**

llm-eval run-batch --config evaluation_config.yaml --output batch_results/

#### **Pipeline Integration**

# CI/CD Pipeline example
#!/bin/bash

# Quality gate evaluation
llm-eval evaluate \
  --model production-model \
  --test-cases qa_test_suite.json \
  --quality-gate 0.85 \
  --fail-on-low-score \
  --output qa_results.json

# If evaluation passes, deploy
if [ $? -eq 0 ]; then
    echo "✅ Quality gate passed - deploying model"
    ./deploy_model.sh
else
    echo "❌ Quality gate failed - blocking deployment"
    exit 1
fi

#### **Monitoring and Alerts**

# Continuous monitoring
llm-eval monitor \
  --model production-model \
  --baseline baseline_metrics.json \
  --alert-threshold 0.05 \
  --webhook https://alerts.company.com/llm-degradation \
  --interval 3600  # Check every hour

🛠️ Environment Configuration¶

### 🔐 **API Keys and Credentials** Set up authentication for different LLM providers:

# Create .env file in your project directory
touch .env

# Add your API keys (never commit this file!)
echo "OPENAI_API_KEY=your_openai_api_key_here" >> .env
echo "ANTHROPIC_API_KEY=your_anthropic_api_key_here" >> .env
echo "AZURE_OPENAI_API_KEY=your_azure_api_key_here" >> .env
echo "AZURE_OPENAI_ENDPOINT=your_azure_endpoint_here" >> .env

# Alternative: Set environment variables directly
export OPENAI_API_KEY="your_openai_api_key_here"
export ANTHROPIC_API_KEY="your_anthropic_api_key_here"

### ⚙️ **Framework Configuration** **Create `llm_eval_config.yaml`:**

# Global framework configuration
framework:
  logging_level: INFO
  cache_enabled: true
  cache_ttl_hours: 24
  max_concurrent_evaluations: 5

# Default model settings
defaults:
  max_tokens: 2048
  temperature: 0.7
  timeout_seconds: 30
  retry_attempts: 3

# Cost management
cost_management:
  daily_budget_limit: 10.00  # USD
  per_request_limit: 0.10    # USD
  alert_threshold: 0.80      # 80% of budget

# Output preferences
output:
  default_format: json
  include_metadata: true
  save_intermediate_results: true
  results_directory: "./evaluation_results"

### 📁 **Project Structure** Recommended project organization:

my-llm-evaluation/
├── .env                          # API keys (git-ignored)
├── llm_eval_config.yaml         # Framework configuration
├── requirements.txt             # Python dependencies
├── evaluation_scripts/
│   ├── first_evaluation.py
│   ├── batch_evaluation.py
│   └── custom_evaluation.py
├── test_datasets/
│   ├── general_knowledge.json
│   ├── reasoning_tests.json
│   └── creativity_tests.json
├── results/
│   ├── 2024-01-15/
│   └── latest/
├── configs/
│   ├── evaluation_config.yaml
│   └── model_configs/
└── reports/
    ├── analysis_report.html
    └── summary_dashboard.html

🧪 Verification & Testing¶

### ✅ **Complete Installation Test** Run this comprehensive test to verify your installation: **Create `installation_test.py`:**

"""
🧪 LLM Evaluation Framework Installation Test
===========================================

Comprehensive test to verify all components are working correctly.
"""

import sys
import traceback
from datetime import datetime

def run_test(test_name, test_func):
    """Run a test and report results."""
    try:
        print(f"🔄 Running {test_name}...")
        test_func()
        print(f"✅ {test_name} PASSED")
        return True
    except Exception as e:
        print(f"❌ {test_name} FAILED: {str(e)}")
        if "--verbose" in sys.argv:
            traceback.print_exc()
        return False

def test_core_imports():
    """Test core framework imports."""
    from llm_evaluation_framework import (
        ModelRegistry,
        ModelInferenceEngine,
        AutoSuggestionEngine,
        TestDatasetGenerator
    )

def test_registry_functionality():
    """Test model registry functionality."""
    from llm_evaluation_framework import ModelRegistry

    registry = ModelRegistry()

    # Test model registration
    config = {
        "provider": "mock",
        "api_cost_input": 0.001,
        "api_cost_output": 0.002,
        "capabilities": ["reasoning"]
    }

    assert registry.register_model("test-model", config)
    assert "test-model" in registry.list_models()
    assert registry.get_model("test-model")["provider"] == "mock"

def test_dataset_generation():
    """Test dataset generation functionality."""
    from llm_evaluation_framework import TestDatasetGenerator

    generator = TestDatasetGenerator()
    requirements = {
        "domain": "general",
        "required_capabilities": ["reasoning"]
    }

    test_cases = generator.generate_test_cases(requirements, count=3)
    assert len(test_cases) == 3
    assert all("prompt" in case for case in test_cases)

def test_inference_engine():
    """Test inference engine functionality."""
    from llm_evaluation_framework import ModelRegistry, ModelInferenceEngine

    registry = ModelRegistry()
    engine = ModelInferenceEngine(registry)

    # Register mock model for testing
    config = {
        "provider": "mock",
        "api_cost_input": 0.0,
        "api_cost_output": 0.0,
        "capabilities": ["reasoning"]
    }
    registry.register_model("mock-test", config)

    # Test evaluation
    test_cases = [{
        "id": "test1",
        "prompt": "What is 2+2?",
        "expected_output": "4"
    }]

    requirements = {"budget": 1.0}
    results = engine.evaluate_model("mock-test", test_cases, requirements)

    assert "aggregate_metrics" in results
    assert "test_results" in results

def test_cli_availability():
    """Test CLI command availability."""
    import subprocess

    result = subprocess.run(
        ["llm-eval", "--version"],
        capture_output=True,
        text=True
    )
    assert result.returncode == 0

def test_persistence():
    """Test persistence functionality."""
    from llm_evaluation_framework.persistence import JSONStore
    import tempfile
    import os

    with tempfile.NamedTemporaryFile(suffix='.json', delete=False) as tmp:
        store = JSONStore(tmp.name)
        test_data = {"test": "data", "timestamp": datetime.now().isoformat()}

        store.save_evaluation_result(test_data)
        loaded_data = store.load_evaluation_result()

        assert loaded_data["test"] == "data"
        os.unlink(tmp.name)

def main():
    """Run all installation tests."""
    print("🧪 LLM Evaluation Framework Installation Test")
    print("=" * 50)
    print(f"📅 Test started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print()

    tests = [
        ("Core Imports", test_core_imports),
        ("Model Registry", test_registry_functionality),
        ("Dataset Generation", test_dataset_generation),
        ("Inference Engine", test_inference_engine),
        ("CLI Availability", test_cli_availability),
        ("Persistence Layer", test_persistence)
    ]

    passed = 0
    total = len(tests)

    for test_name, test_func in tests:
        if run_test(test_name, test_func):
            passed += 1
        print()

    print("📋 TEST SUMMARY")
    print("=" * 15)
    print(f"✅ Passed: {passed}/{total}")
    print(f"❌ Failed: {total - passed}/{total}")
    print(f"📊 Success Rate: {(passed/total)*100:.1f}%")

    if passed == total:
        print("\n🎉 ALL TESTS PASSED! Installation is complete and working correctly.")
        print("💡 You're ready to start evaluating LLMs!")
    else:
        print("\n⚠️  Some tests failed. Please check the error messages above.")
        print("🔗 Need help? Visit: https://github.com/isathish/LLMEvaluationFramework/issues")
        sys.exit(1)

if __name__ == "__main__":
    main()

**Run the test:**

python installation_test.py
# For detailed error information:
python installation_test.py --verbose

### 🚀 **Performance Benchmark** Verify performance with the included benchmark:

# Run performance benchmark
llm-eval benchmark \
  --models mock-model \
  --test-cases 100 \
  --iterations 3 \
  --output benchmark_results.json

# Expected output:
# 📊 Average evaluation time: 0.15s per test case
# 🚀 Throughput: 6.67 evaluations per second
# 💾 Memory usage: 45.2 MB peak

🚨 Troubleshooting¶

### 🔧 **Common Issues and Solutions**

#### **🐛 Installation Issues**

ModuleNotFoundError: No module named 'llm_evaluation_framework'

**Problem**: Framework not properly installed or virtual environment not activated. **Solutions**:

# 1. Check if virtual environment is activated
which python
# Should show path to virtual environment

# 2. Reinstall framework
pip uninstall llm-evaluation-framework
pip install llm-evaluation-framework

# 3. Check Python path
python -c "import sys; print('\n'.join(sys.path))"

# 4. Force reinstall with no cache
pip install --no-cache-dir --force-reinstall llm-evaluation-framework

Command 'llm-eval' not found

**Problem**: CLI not properly installed or not in PATH. **Solutions**:

# 1. Check if pip installed scripts correctly
pip show -f llm-evaluation-framework | grep bin

# 2. Add pip bin directory to PATH
export PATH="$PATH:$(python -m site --user-base)/bin"

# 3. Use module syntax as alternative
python -m llm_evaluation_framework.cli --help

# 4. Reinstall with explicit user flag
pip install --user llm-evaluation-framework

Permission denied errors

**Problem**: Insufficient permissions for installation. **Solutions**:

# 1. Use virtual environment (recommended)
python -m venv llm-eval-env
source llm-eval-env/bin/activate
pip install llm-evaluation-framework

# 2. Install for current user only
pip install --user llm-evaluation-framework

# 3. Fix permissions (macOS/Linux)
sudo chown -R $(whoami) ~/.local

# 4. Use conda environment
conda create -n llm-eval python=3.11
conda activate llm-eval
pip install llm-evaluation-framework

#### **🔑 Authentication Issues**

API key authentication failures

**Problem**: API keys not configured or invalid. **Solutions**:

# 1. Verify environment variables are set
echo $OPENAI_API_KEY
echo $ANTHROPIC_API_KEY

# 2. Test API key validity
curl -H "Authorization: Bearer $OPENAI_API_KEY" \
     https://api.openai.com/v1/models

# 3. Check .env file is in correct location
cat .env  # Should show your API keys

# 4. Verify Python can read environment variables
python -c "import os; print(os.getenv('OPENAI_API_KEY', 'NOT_SET'))"

Rate limiting or quota errors

**Problem**: API rate limits exceeded or quota exhausted. **Solutions**:

# 1. Check current usage (OpenAI)
llm-eval check-usage --provider openai

# 2. Reduce concurrent requests in config
# Edit llm_eval_config.yaml:
# max_concurrent_evaluations: 1

# 3. Add delays between requests
llm-eval evaluate --rate-limit 1.0  # 1 second between requests

# 4. Use mock models for testing
llm-eval evaluate --model mock-model --test-cases test.json

#### **💰 Cost Management Issues**

Unexpected high API costs

**Problem**: Evaluation costs higher than expected. **Prevention & Solutions**:

# 1. Set strict budget limits
llm-eval evaluate --budget 0.10 --fail-on-exceed

# 2. Use cost estimation before evaluation
llm-eval estimate-cost \
  --model gpt-4 \
  --test-cases large_dataset.json

# 3. Use cheaper models for testing
llm-eval evaluate --model gpt-3.5-turbo  # Instead of gpt-4

# 4. Monitor costs in real-time
llm-eval evaluate --monitor-costs --cost-alert 0.05

#### **🚀 Performance Issues**

Slow evaluation performance

**Problem**: Evaluations taking too long to complete. **Solutions**:

# 1. Enable parallel processing
llm-eval evaluate --parallel --max-workers 4

# 2. Use batch processing
llm-eval evaluate --batch-size 10

# 3. Optimize model configuration
# Reduce max_tokens in model config
llm-eval update-model --name gpt-3.5-turbo --max-tokens 1000

# 4. Enable caching for repeated evaluations
llm-eval evaluate --enable-cache

# 5. Use faster models for initial testing
llm-eval evaluate --model claude-3-haiku  # Faster than claude-3-opus

#### **📊 Result Issues**

Inconsistent or unexpected results

**Problem**: Evaluation results vary significantly between runs. **Solutions**:

# 1. Set deterministic parameters
llm-eval evaluate --temperature 0.0 --seed 42

# 2. Increase test case count for statistical significance
llm-eval generate --count 50  # Instead of 10

# 3. Run multiple evaluation rounds
llm-eval evaluate --rounds 3 --aggregate-method mean

# 4. Use more robust scoring methods
llm-eval evaluate --scorer semantic_similarity  # Instead of exact_match

### 🔍 **Diagnostic Commands** Use these commands to diagnose issues:

# System information
llm-eval diagnose --system

# Framework configuration check
llm-eval diagnose --config

# Model registry validation
llm-eval diagnose --models

# Connectivity test
llm-eval diagnose --connectivity

# Permission check
llm-eval diagnose --permissions

# Generate comprehensive diagnostic report
llm-eval diagnose --all --output diagnostic_report.txt

### 📞 **Getting Help**

If you're still experiencing issues:

Resource	Best For	Response Time
GitHub Issues	Bug reports, feature requests	1-3 days
GitHub Discussions	Q&A, general help, ideas	Community-driven
Documentation	Detailed guides, API reference	Immediate
Stack Overflow	Technical questions	Community-driven

**When reporting issues, include:** - Output of `llm-eval diagnose --all` - Your operating system and Python version - Complete error message and stack trace - Steps to reproduce the issue - Your configuration files (with sensitive data removed)

🎯 Next Steps & Learning Path¶

### 🛤️ **Recommended Learning Path**

#### **Phase 1: Foundation** (Days 1-2) - ✅ **Completed**: Installation and first evaluation - 🎯 **Next**: [Core Concepts](core-concepts.md) - Understand framework architecture - 📚 **Resource**: [API Reference](api-reference.md) - Bookmark for quick reference #### **Phase 2: Practical Skills** (Days 3-5) - 🎯 **Goal**: [Examples & Tutorials](examples.md) - Hands-on learning with real scenarios - 🎯 **Goal**: [CLI Usage Guide](cli-usage.md) - Master command-line workflows - 🎯 **Goal**: [Evaluation & Scoring](evaluation-and-scoring.md) - Advanced scoring strategies #### **Phase 3: Advanced Usage** (Week 2) - 🎯 **Goal**: [Advanced Usage](advanced-usage.md) - Performance optimization and production patterns - 🎯 **Goal**: [Model Registry](model-registry.md) - Advanced model management - 🎯 **Goal**: [Persistence Layer](persistence-layer.md) - Data management and storage #### **Phase 4: Customization** (Week 3) - 🎯 **Goal**: [Developer Guide](developer-guide.md) - Build custom components - 🎯 **Goal**: [Auto-Suggestion Engine](auto-suggestion-engine.md) - Advanced recommendation systems - 🎯 **Goal**: [Contributing](../contributing.md) - Contribute back to the project

### 🎯 **Immediate Next Actions**

**📋 Choose your immediate next step:**

🧠 I want to understand the concepts Learn the framework architecture and design principles → Read Core Concepts	💻 I want to see more examples Explore practical examples and real-world use cases → Try Examples
⚡ I want to optimize performance Learn advanced patterns and production best practices → Advanced Usage	🛠️ I want to customize the framework Build custom components and contribute to the project → Developer Guide

### 🚀 **Quick Wins to Try**

**🎯 Complete these in the next 30 minutes:** 1. **🔧 Model Comparison**

llm-eval evaluate --model gpt-3.5-turbo claude-3-haiku --test-cases 5 --output comparison.json

2. **📊 Custom Scoring**

llm-eval score --predictions "Your model output" --references "Expected output" --metric bleu

3. **🎨 Creative Dataset**

llm-eval generate --capability creativity --domain "science fiction" --count 10

4. **📈 Results Analysis**

llm-eval analyze --evaluation-results comparison.json --format html

### 🏆 **Achievements System** Track your progress as you learn:

- [ ] **🚀 First Steps** - Complete installation and first evaluation - [ ] **🎯 Evaluator** - Run 10 different evaluations - [ ] **🧪 Dataset Creator** - Generate 5 different test datasets - [ ] **⚡ CLI Master** - Use 10 different CLI commands - [ ] **🔧 Optimizer** - Improve evaluation performance by 50% - [ ] **🏗️ Builder** - Create a custom scoring strategy - [ ] **🤝 Contributor** - Submit your first pull request - [ ] **📚 Teacher** - Help others in GitHub Discussions **Share your achievements on social media with #LLMEvalFramework!**

## 🎉 Congratulations! You're Now Ready to Evaluate LLMs **You've successfully set up the LLM Evaluation Framework and run your first evaluation!** ### 🌟 **What You've Accomplished** ✅ **Installed** the framework in your environment ✅ **Configured** models and evaluation settings ✅ **Generated** synthetic test datasets ✅ **Executed** comprehensive model evaluations ✅ **Analyzed** results and got recommendations ✅ **Learned** CLI commands for automation ### 🚀 **Ready for Advanced Topics?** [![Core Concepts](https://img.shields.io/badge/📚_Core_Concepts-Understand_the_Framework-22c55e?style=for-the-badge)](core-concepts.md) [![Examples Hub](https://img.shields.io/badge/💡_Examples_Hub-Hands%20On%20Learning-3b82f6?style=for-the-badge)](examples.md) [![Advanced Usage](https://img.shields.io/badge/⚡_Advanced_Usage-Power_User_Guide-f59e0b?style=for-the-badge)](advanced-usage.md) [![Developer Guide](https://img.shields.io/badge/🛠️_Developer_Guide-Build_%26_Contribute-8b5cf6?style=for-the-badge)](developer-guide.md) --- ### 💬 **Join Our Community** [![GitHub Discussions](https://img.shields.io/badge/💬_GitHub_Discussions-Join_the_Conversation-purple?style=for-the-badge)](https://github.com/isathish/LLMEvaluationFramework/discussions) [![Stack Overflow](https://img.shields.io/badge/❓_Stack_Overflow-Get_Technical_Help-orange?style=for-the-badge)](https://stackoverflow.com/questions/tagged/llm-evaluation-framework) [![Contributing](https://img.shields.io/badge/🤝_Contributing-Make_a_Difference-green?style=for-the-badge)](../contributing.md) --- **⭐ Found this helpful? [Star us on GitHub](https://github.com/isathish/LLMEvaluationFramework) to support the project!** *Happy evaluating! 🚀*