Skip to content

๐Ÿš€ Getting Started

![Getting Started](https://img.shields.io/badge/Getting%20Started-Your%20LLM%20Evaluation%20Journey-22c55e?style=for-the-badge&logo=rocket&logoColor=white) **From zero to expert in 15 minutes** *Complete setup guide with interactive examples and instant verification* [![Quick Install](https://img.shields.io/badge/โšก_Quick_Install-2_Minutes-ef4444?style=for-the-badge)](#quick-installation) [![First Evaluation](https://img.shields.io/badge/๐ŸŽฏ_First_Evaluation-5_Minutes-f59e0b?style=for-the-badge)](#your-first-evaluation) [![CLI Mastery](https://img.shields.io/badge/๐Ÿ–ฅ๏ธ_CLI_Mastery-3_Minutes-3b82f6?style=for-the-badge)](#cli-quickstart) [![Troubleshooting](https://img.shields.io/badge/๐Ÿ› ๏ธ_Help_%26_Support-Always_Available-8b5cf6?style=for-the-badge)](#troubleshooting)

๐ŸŽฏ What You'll Achieve

โšก 2 Minutes

Installation Complete

Framework installed and verified

๐ŸŽฏ 5 Minutes

First Evaluation

Run your first model evaluation

๐Ÿ–ฅ๏ธ 8 Minutes

CLI Proficiency

Master command-line workflows

๐Ÿš€ 15 Minutes

Production Ready

Understand advanced patterns

๐ŸŒŸ By the end of this guide, you'll be able to:

  • โœ… Install and configure the framework in any environment
  • โœ… Register and evaluate your first language model
  • โœ… Generate synthetic datasets for comprehensive testing
  • โœ… Run evaluations and analyze detailed results
  • โœ… Use the CLI for automated evaluation workflows
  • โœ… Export results in multiple formats for reporting
  • โœ… Troubleshoot common issues independently

๐Ÿ“‹ Prerequisites & System Requirements

### ๐Ÿ–ฅ๏ธ **System Requirements**
Component Minimum Recommended Notes
Python 3.8+ 3.11+ Type hints & performance improvements
Memory 4GB RAM 8GB+ RAM For large-scale evaluations
Storage 1GB free 5GB+ free For datasets and cached results
Network Stable internet High-speed connection For API calls to LLM providers
### ๐Ÿ› ๏ธ **Platform Support**
![Linux](https://img.shields.io/badge/Linux-Ubuntu%20|%20CentOS%20|%20RHEL-22c55e?style=for-the-badge&logo=linux&logoColor=white) ![macOS](https://img.shields.io/badge/macOS-Intel%20%26%20Apple%20Silicon-3b82f6?style=for-the-badge&logo=apple&logoColor=white) ![Windows](https://img.shields.io/badge/Windows-10%20|%2011-f59e0b?style=for-the-badge&logo=windows&logoColor=white)
### ๐Ÿ”ง **Development Tools** (Optional but Recommended)
# Version management
pyenv          # Python version management
pipenv         # Dependency management
conda          # Data science environment

# Development tools
git            # Version control
docker         # Containerization
vs-code        # IDE with Python extensions

โšก Quick Installation

### ๐Ÿƒโ€โ™‚๏ธ **Express Installation** (Recommended)
# ๐Ÿš€ One-command installation
curl -sSL https://install.llmevalframework.com | bash

# โœ… Verify installation
llm-eval --version && echo "๐ŸŽ‰ Ready to evaluate!"
**This script automatically:** - Creates isolated Python environment - Installs the latest stable version - Configures CLI tools - Runs verification tests
### ๐Ÿ“ฆ **Standard Installation Methods**
#### **Method 1: PyPI (Most Common)**
# Step 1: Create virtual environment (strongly recommended)
python -m venv llm-eval-env

# Step 2: Activate environment
# On macOS/Linux:
source llm-eval-env/bin/activate
# On Windows PowerShell:
llm-eval-env\Scripts\Activate.ps1
# On Windows Command Prompt:
llm-eval-env\Scripts\activate.bat

# Step 3: Upgrade pip for better dependency resolution
python -m pip install --upgrade pip

# Step 4: Install framework with all components
pip install "llm-evaluation-framework[complete]"

# Step 5: Verify installation
python -c "import llm_evaluation_framework; print('โœ… Installation successful!')"
llm-eval --help
#### **Method 2: Development Installation**
# Step 1: Clone repository
git clone https://github.com/isathish/LLMEvaluationFramework.git
cd LLMEvaluationFramework

# Step 2: Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Step 3: Install in development mode with all dependencies
pip install -e ".[dev,test,docs]"

# Step 4: Install pre-commit hooks (for contributors)
pre-commit install

# Step 5: Run test suite to verify installation
pytest --cov=llm_evaluation_framework -v
#### **Method 3: Docker Installation**
# Step 1: Pull latest image
docker pull ghcr.io/isathish/llm-evaluation-framework:latest

# Step 2: Run interactive container
docker run -it --rm \
  -v $(pwd)/data:/app/data \
  -v $(pwd)/results:/app/results \
  ghcr.io/isathish/llm-evaluation-framework:latest

# Step 3: Verify inside container
llm-eval --version
#### **Method 4: Conda Installation**
# Step 1: Create conda environment
conda create -n llm-eval python=3.11 -y
conda activate llm-eval

# Step 2: Install from conda-forge
conda install -c conda-forge llm-evaluation-framework

# Alternative: Install via pip in conda environment
pip install llm-evaluation-framework

# Step 3: Verify installation
llm-eval --version
### ๐Ÿ” **Installation Verification**
Run these commands to ensure everything is working correctly:
# 1๏ธโƒฃ Version check
llm-eval --version
# Expected output: LLM Evaluation Framework v1.0.0

# 2๏ธโƒฃ Core import test
python -c "
from llm_evaluation_framework import (
    ModelRegistry, 
    ModelInferenceEngine, 
    TestDatasetGenerator
)
print('โœ… Core components imported successfully')
"

# 3๏ธโƒฃ CLI functionality test
llm-eval list --type capabilities
# Expected output: List of available capabilities

# 4๏ธโƒฃ Quick functionality test
python -c "
from llm_evaluation_framework import ModelRegistry
registry = ModelRegistry()
print(f'โœ… Registry initialized with {len(registry.list_models())} pre-configured models')
"
**๐ŸŽ‰ If all commands run without errors, you're ready to proceed!**

๐ŸŽฏ Your First Evaluation

### ๐Ÿ—๏ธ **Step 1: Environment Setup** Create a new directory for your evaluation project:
# Create project directory
mkdir my-llm-evaluation
cd my-llm-evaluation

# Create Python file for our first evaluation
touch first_evaluation.py
### ๐Ÿค– **Step 2: Initialize the Framework**
**Create `first_evaluation.py`:**
"""
๐Ÿš€ My First LLM Evaluation
=========================

This script demonstrates the complete evaluation workflow:
1. Framework initialization
2. Model registration  
3. Test dataset generation
4. Model evaluation
5. Results analysis
"""

import os
from datetime import datetime
from llm_evaluation_framework import (
    ModelRegistry,
    ModelInferenceEngine,
    AutoSuggestionEngine,
    TestDatasetGenerator
)
from llm_evaluation_framework.persistence import JSONStore

def main():
    """Run complete evaluation workflow."""

    print("๐Ÿš€ LLM Evaluation Framework - First Evaluation")
    print("=" * 55)
    print(f"๐Ÿ“… Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

    # Step 1: Initialize core components
    print("\n๐Ÿ”ง Step 1: Initializing framework components...")
    registry = ModelRegistry()
    engine = ModelInferenceEngine(registry)
    suggestion_engine = AutoSuggestionEngine(registry)
    dataset_generator = TestDatasetGenerator()
    print("โœ… All components initialized successfully!")

    # Step 2: Register models for evaluation
    setup_models(registry)

    # Step 3: Generate test dataset
    test_cases = generate_test_dataset(dataset_generator)

    # Step 4: Run evaluations
    results = run_evaluations(engine, test_cases)

    # Step 5: Get recommendations
    recommendations = get_recommendations(suggestion_engine, results)

    # Step 6: Save and display results
    save_and_display_results(results, recommendations, test_cases)

    print("\n๐ŸŽ‰ First evaluation completed successfully!")
    print("๐Ÿ’ก Next steps: Try different models or custom test cases")

def setup_models(registry):
    """Register models for evaluation."""
    print("\n๐Ÿค– Step 2: Registering evaluation models...")

    # Model configurations
    models = {
        "gpt-3.5-turbo": {
            "provider": "openai",
            "api_cost_input": 0.0015,
            "api_cost_output": 0.002,
            "capabilities": ["reasoning", "creativity", "coding"],
            "max_tokens": 4096,
            "context_window": 16385
        },
        "claude-3-haiku": {
            "provider": "anthropic", 
            "api_cost_input": 0.00325,
            "api_cost_output": 0.01625,
            "capabilities": ["reasoning", "analysis", "creative_writing"],
            "max_tokens": 4096,
            "context_window": 200000
        },
        "mock-model": {
            "provider": "mock",  # For testing without API costs
            "api_cost_input": 0.0,
            "api_cost_output": 0.0,
            "capabilities": ["reasoning", "creativity"],
            "max_tokens": 2048,
            "context_window": 4096
        }
    }

    # Register each model
    for model_name, config in models.items():
        success = registry.register_model(model_name, config)
        status = "โœ…" if success else "โŒ"
        print(f"   {status} {model_name}")

    print(f"๐Ÿ“Š Total registered models: {len(registry.list_models())}")

def generate_test_dataset(generator):
    """Generate comprehensive test dataset."""
    print("\n๐Ÿงช Step 3: Generating test dataset...")

    # Define evaluation requirements
    requirements = {
        "domain": "general_knowledge",
        "required_capabilities": ["reasoning", "creativity"],
        "difficulty_levels": ["easy", "medium", "hard"],
        "test_types": ["factual", "analytical", "creative"]
    }

    # Generate test cases
    test_cases = generator.generate_test_cases(requirements, count=8)

    print(f"โœ… Generated {len(test_cases)} test cases:")
    for i, case in enumerate(test_cases, 1):
        print(f"   {i}. {case.get('type', 'unknown').title()}: {case.get('prompt', '')[:60]}...")

    return test_cases

def run_evaluations(engine, test_cases):
    """Run evaluations for all registered models."""
    print("\nโšก Step 4: Running model evaluations...")

    # Evaluation configuration
    evaluation_config = {
        "max_response_time": 30.0,  # seconds
        "budget": 0.50,             # USD
        "quality_threshold": 0.7,
        "required_capabilities": ["reasoning"]
    }

    # Get available models (excluding potentially expensive ones for demo)
    available_models = ["mock-model"]  # Safe for testing

    results = {}
    for model_id in available_models:
        print(f"   ๐Ÿ”„ Evaluating {model_id}...")
        try:
            result = engine.evaluate_model(model_id, test_cases, evaluation_config)
            results[model_id] = result

            # Display basic metrics
            metrics = result.get('aggregate_metrics', {})
            print(f"      ๐Ÿ“Š Score: {metrics.get('overall_score', 0):.3f}")
            print(f"      ๐Ÿ’ฐ Cost: ${metrics.get('total_cost', 0):.4f}")
            print(f"      โฑ๏ธ  Time: {metrics.get('avg_response_time', 0):.2f}s")
            print(f"      โœ… Success Rate: {metrics.get('success_rate', 0):.1%}")

        except Exception as e:
            print(f"      โŒ Evaluation failed: {str(e)}")
            continue

    print(f"โœ… Completed evaluations for {len(results)} models")
    return results

def get_recommendations(suggestion_engine, results):
    """Get model recommendations based on results."""
    print("\n๐ŸŽฏ Step 5: Generating recommendations...")

    if not results:
        print("   โš ๏ธ  No results available for recommendations")
        return []

    # Recommendation requirements
    recommendation_config = {
        "weights": {
            "accuracy": 0.4,
            "cost_efficiency": 0.3,
            "response_time": 0.2,
            "reliability": 0.1
        },
        "constraints": {
            "max_cost_per_request": 0.05,
            "min_accuracy": 0.6
        }
    }

    try:
        evaluation_results = list(results.values())
        recommendations = suggestion_engine.suggest_model(
            evaluation_results, 
            recommendation_config
        )

        print(f"โœ… Generated {len(recommendations)} recommendations")
        if recommendations:
            top_rec = recommendations[0]
            print(f"   ๐Ÿ† Top recommendation: {top_rec.get('model_id', 'Unknown')}")
            print(f"   ๐Ÿ“Š Recommendation score: {top_rec.get('recommendation_score', 0):.3f}")

        return recommendations

    except Exception as e:
        print(f"   โŒ Recommendation generation failed: {str(e)}")
        return []

def save_and_display_results(results, recommendations, test_cases):
    """Save results and display summary."""
    print("\n๐Ÿ’พ Step 6: Saving results and generating summary...")

    # Prepare comprehensive results
    complete_results = {
        "evaluation_metadata": {
            "timestamp": datetime.now().isoformat(),
            "framework_version": "1.0.0",
            "total_test_cases": len(test_cases),
            "models_evaluated": len(results)
        },
        "test_cases": test_cases,
        "evaluation_results": results,
        "recommendations": recommendations
    }

    # Save to JSON file
    try:
        store = JSONStore("first_evaluation_results.json")
        store.save_evaluation_result(complete_results)
        print("โœ… Results saved to 'first_evaluation_results.json'")
    except Exception as e:
        print(f"โŒ Failed to save results: {str(e)}")

    # Display final summary
    print("\n๐Ÿ“‹ EVALUATION SUMMARY")
    print("=" * 25)

    if results:
        for model_id, result in results.items():
            metrics = result.get('aggregate_metrics', {})
            print(f"\n๐Ÿค– {model_id.upper()}:")
            print(f"   ๐Ÿ“Š Overall Score: {metrics.get('overall_score', 0):.3f}/1.0")
            print(f"   ๐Ÿ’ฐ Total Cost: ${metrics.get('total_cost', 0):.4f}")
            print(f"   โฑ๏ธ  Average Response Time: {metrics.get('avg_response_time', 0):.2f}s")
            print(f"   โœ… Success Rate: {metrics.get('success_rate', 0):.1%}")
    else:
        print("โš ๏ธ  No evaluation results to display")

    if recommendations:
        print(f"\n๐ŸŽฏ TOP RECOMMENDATION: {recommendations[0].get('model_id', 'Unknown')}")
        print(f"๐Ÿ“ˆ Confidence: {recommendations[0].get('recommendation_score', 0):.3f}")

if __name__ == "__main__":
    main()
### ๐Ÿƒโ€โ™‚๏ธ **Step 3: Run Your First Evaluation**
# Execute the evaluation script
python first_evaluation.py
**Expected Output:**
๐Ÿš€ LLM Evaluation Framework - First Evaluation
=======================================================
๐Ÿ“… Started at: 2024-01-15 14:30:25

๐Ÿ”ง Step 1: Initializing framework components...
โœ… All components initialized successfully!

๐Ÿค– Step 2: Registering evaluation models...
   โœ… gpt-3.5-turbo
   โœ… claude-3-haiku  
   โœ… mock-model
๐Ÿ“Š Total registered models: 3

๐Ÿงช Step 3: Generating test dataset...
โœ… Generated 8 test cases:
   1. Factual: What is the capital of France and what is its pop...
   2. Analytical: Compare the advantages and disadvantages of re...
   3. Creative: Write a short story about a robot learning to...

โšก Step 4: Running model evaluations...
   ๐Ÿ”„ Evaluating mock-model...
      ๐Ÿ“Š Score: 0.847
      ๐Ÿ’ฐ Cost: $0.0000
      โฑ๏ธ  Time: 0.15s
      โœ… Success Rate: 100.0%
โœ… Completed evaluations for 1 models

๐ŸŽฏ Step 5: Generating recommendations...
โœ… Generated 1 recommendations
   ๐Ÿ† Top recommendation: mock-model
   ๐Ÿ“Š Recommendation score: 0.923

๐Ÿ’พ Step 6: Saving results and generating summary...
โœ… Results saved to 'first_evaluation_results.json'

๐Ÿ“‹ EVALUATION SUMMARY
=========================

๐Ÿค– MOCK-MODEL:
   ๐Ÿ“Š Overall Score: 0.847/1.0
   ๐Ÿ’ฐ Total Cost: $0.0000
   โฑ๏ธ  Average Response Time: 0.15s
   โœ… Success Rate: 100.0%

๐ŸŽฏ TOP RECOMMENDATION: mock-model
๐Ÿ“ˆ Confidence: 0.923

๐ŸŽ‰ First evaluation completed successfully!
๐Ÿ’ก Next steps: Try different models or custom test cases

๐Ÿ–ฅ๏ธ CLI Quickstart

### ๐ŸŽฏ **Essential CLI Commands** The LLM Evaluation Framework provides a powerful command-line interface for streamlined workflows. #### **๐Ÿ” Discovery Commands**
# View all available commands and options
llm-eval --help

# List available resources
llm-eval list                           # List all available resources
llm-eval list --type models            # List registered models
llm-eval list --type capabilities      # List supported capabilities  
llm-eval list --type providers         # List supported providers
llm-eval list --type scorers           # List scoring strategies

# Get detailed help for specific commands
llm-eval evaluate --help
llm-eval generate --help
llm-eval score --help
#### **โšก Quick Evaluation Commands**
# Generate test dataset quickly
llm-eval generate \
  --capability reasoning \
  --count 10 \
  --domain "customer_service" \
  --output test_cases.json \
  --format json

# Run evaluation with generated dataset
llm-eval evaluate \
  --model gpt-3.5-turbo \
  --test-cases test_cases.json \
  --budget 0.10 \
  --output evaluation_results.json \
  --format json \
  --verbose

# Score predictions quickly
llm-eval score \
  --predictions "The sky is blue" "Paris is a city" \
  --references "The sky is blue" "Paris is the capital of France" \
  --metric semantic_similarity \
  --output scores.json
#### **๐Ÿ”ง Model Management Commands**
# Register a new model
llm-eval register-model \
  --name "my-custom-model" \
  --provider "custom" \
  --config model_config.yaml

# Validate model configuration
llm-eval validate-model --name "gpt-3.5-turbo"

# Update model configuration
llm-eval update-model \
  --name "gpt-3.5-turbo" \
  --field "api_cost_input" \
  --value "0.002"

# Remove model registration
llm-eval remove-model --name "my-custom-model" --confirm
### ๐Ÿ”„ **Complete CLI Workflow Example**
Let's walk through a complete evaluation workflow using only CLI commands:
# ๐Ÿ—๏ธ Step 1: Setup project directory
mkdir cli-evaluation && cd cli-evaluation

# ๐Ÿงช Step 2: Generate comprehensive test dataset
llm-eval generate \
  --capability reasoning creativity coding \
  --count 15 \
  --domain "software_development" \
  --difficulty mixed \
  --output comprehensive_tests.json \
  --verbose

# ๐Ÿ“Š Step 3: Run evaluation on multiple models
llm-eval evaluate \
  --model gpt-3.5-turbo gpt-4 claude-3 \
  --test-cases comprehensive_tests.json \
  --budget 0.25 \
  --max-time 60 \
  --output evaluation_results.json \
  --format json \
  --parallel \
  --verbose

# ๐ŸŽฏ Step 4: Get model recommendations
llm-eval recommend \
  --evaluation-results evaluation_results.json \
  --weights accuracy:0.4 cost:0.3 speed:0.2 creativity:0.1 \
  --constraints max_cost:0.05 min_accuracy:0.7 \
  --output recommendations.json \
  --top-n 3

# ๐Ÿ“ˆ Step 5: Generate analysis report
llm-eval analyze \
  --evaluation-results evaluation_results.json \
  --recommendations recommendations.json \
  --output analysis_report.html \
  --format html \
  --include-charts

# ๐Ÿ“‹ Step 6: View quick summary
llm-eval summary \
  --evaluation-results evaluation_results.json \
  --format table
**Expected CLI Output:**
๐Ÿ“Š EVALUATION SUMMARY
====================
Models Evaluated: 3
Test Cases: 15
Total Cost: $0.18
Total Time: 47.3s

๐Ÿ† TOP PERFORMERS:
1. gpt-4        (Score: 0.924, Cost: $0.12)
2. claude-3     (Score: 0.891, Cost: $0.04) 
3. gpt-3.5-turbo (Score: 0.856, Cost: $0.02)

๐Ÿ’ก RECOMMENDATION: claude-3 (Best cost-performance balance)
### ๐Ÿ”ง **Advanced CLI Patterns**
#### **Configuration File Usage** Create reusable configuration files for consistent evaluations: **`evaluation_config.yaml`:**
models:
  - gpt-3.5-turbo
  - claude-3-haiku

test_generation:
  capabilities: [reasoning, creativity]
  count: 20
  domain: customer_service
  difficulty: mixed

evaluation:
  budget: 0.20
  max_response_time: 30
  quality_threshold: 0.7

scoring:
  weights:
    accuracy: 0.4
    cost_efficiency: 0.3
    response_time: 0.2
    creativity: 0.1

output:
  format: json
  include_metadata: true
  save_individual_results: true
**Use configuration file:**
llm-eval run-batch --config evaluation_config.yaml --output batch_results/
#### **Pipeline Integration**
# CI/CD Pipeline example
#!/bin/bash

# Quality gate evaluation
llm-eval evaluate \
  --model production-model \
  --test-cases qa_test_suite.json \
  --quality-gate 0.85 \
  --fail-on-low-score \
  --output qa_results.json

# If evaluation passes, deploy
if [ $? -eq 0 ]; then
    echo "โœ… Quality gate passed - deploying model"
    ./deploy_model.sh
else
    echo "โŒ Quality gate failed - blocking deployment"
    exit 1
fi
#### **Monitoring and Alerts**
# Continuous monitoring
llm-eval monitor \
  --model production-model \
  --baseline baseline_metrics.json \
  --alert-threshold 0.05 \
  --webhook https://alerts.company.com/llm-degradation \
  --interval 3600  # Check every hour

๐Ÿ› ๏ธ Environment Configuration

### ๐Ÿ” **API Keys and Credentials** Set up authentication for different LLM providers:
# Create .env file in your project directory
touch .env

# Add your API keys (never commit this file!)
echo "OPENAI_API_KEY=your_openai_api_key_here" >> .env
echo "ANTHROPIC_API_KEY=your_anthropic_api_key_here" >> .env
echo "AZURE_OPENAI_API_KEY=your_azure_api_key_here" >> .env
echo "AZURE_OPENAI_ENDPOINT=your_azure_endpoint_here" >> .env

# Alternative: Set environment variables directly
export OPENAI_API_KEY="your_openai_api_key_here"
export ANTHROPIC_API_KEY="your_anthropic_api_key_here"
### โš™๏ธ **Framework Configuration** **Create `llm_eval_config.yaml`:**
# Global framework configuration
framework:
  logging_level: INFO
  cache_enabled: true
  cache_ttl_hours: 24
  max_concurrent_evaluations: 5

# Default model settings
defaults:
  max_tokens: 2048
  temperature: 0.7
  timeout_seconds: 30
  retry_attempts: 3

# Cost management
cost_management:
  daily_budget_limit: 10.00  # USD
  per_request_limit: 0.10    # USD
  alert_threshold: 0.80      # 80% of budget

# Output preferences
output:
  default_format: json
  include_metadata: true
  save_intermediate_results: true
  results_directory: "./evaluation_results"
### ๐Ÿ“ **Project Structure** Recommended project organization:
my-llm-evaluation/
โ”œโ”€โ”€ .env                          # API keys (git-ignored)
โ”œโ”€โ”€ llm_eval_config.yaml         # Framework configuration
โ”œโ”€โ”€ requirements.txt             # Python dependencies
โ”œโ”€โ”€ evaluation_scripts/
โ”‚   โ”œโ”€โ”€ first_evaluation.py
โ”‚   โ”œโ”€โ”€ batch_evaluation.py
โ”‚   โ””โ”€โ”€ custom_evaluation.py
โ”œโ”€โ”€ test_datasets/
โ”‚   โ”œโ”€โ”€ general_knowledge.json
โ”‚   โ”œโ”€โ”€ reasoning_tests.json
โ”‚   โ””โ”€โ”€ creativity_tests.json
โ”œโ”€โ”€ results/
โ”‚   โ”œโ”€โ”€ 2024-01-15/
โ”‚   โ””โ”€โ”€ latest/
โ”œโ”€โ”€ configs/
โ”‚   โ”œโ”€โ”€ evaluation_config.yaml
โ”‚   โ””โ”€โ”€ model_configs/
โ””โ”€โ”€ reports/
    โ”œโ”€โ”€ analysis_report.html
    โ””โ”€โ”€ summary_dashboard.html

๐Ÿงช Verification & Testing

### โœ… **Complete Installation Test** Run this comprehensive test to verify your installation: **Create `installation_test.py`:**
"""
๐Ÿงช LLM Evaluation Framework Installation Test
===========================================

Comprehensive test to verify all components are working correctly.
"""

import sys
import traceback
from datetime import datetime

def run_test(test_name, test_func):
    """Run a test and report results."""
    try:
        print(f"๐Ÿ”„ Running {test_name}...")
        test_func()
        print(f"โœ… {test_name} PASSED")
        return True
    except Exception as e:
        print(f"โŒ {test_name} FAILED: {str(e)}")
        if "--verbose" in sys.argv:
            traceback.print_exc()
        return False

def test_core_imports():
    """Test core framework imports."""
    from llm_evaluation_framework import (
        ModelRegistry,
        ModelInferenceEngine,
        AutoSuggestionEngine,
        TestDatasetGenerator
    )

def test_registry_functionality():
    """Test model registry functionality."""
    from llm_evaluation_framework import ModelRegistry

    registry = ModelRegistry()

    # Test model registration
    config = {
        "provider": "mock",
        "api_cost_input": 0.001,
        "api_cost_output": 0.002,
        "capabilities": ["reasoning"]
    }

    assert registry.register_model("test-model", config)
    assert "test-model" in registry.list_models()
    assert registry.get_model("test-model")["provider"] == "mock"

def test_dataset_generation():
    """Test dataset generation functionality."""
    from llm_evaluation_framework import TestDatasetGenerator

    generator = TestDatasetGenerator()
    requirements = {
        "domain": "general",
        "required_capabilities": ["reasoning"]
    }

    test_cases = generator.generate_test_cases(requirements, count=3)
    assert len(test_cases) == 3
    assert all("prompt" in case for case in test_cases)

def test_inference_engine():
    """Test inference engine functionality."""
    from llm_evaluation_framework import ModelRegistry, ModelInferenceEngine

    registry = ModelRegistry()
    engine = ModelInferenceEngine(registry)

    # Register mock model for testing
    config = {
        "provider": "mock",
        "api_cost_input": 0.0,
        "api_cost_output": 0.0,
        "capabilities": ["reasoning"]
    }
    registry.register_model("mock-test", config)

    # Test evaluation
    test_cases = [{
        "id": "test1",
        "prompt": "What is 2+2?",
        "expected_output": "4"
    }]

    requirements = {"budget": 1.0}
    results = engine.evaluate_model("mock-test", test_cases, requirements)

    assert "aggregate_metrics" in results
    assert "test_results" in results

def test_cli_availability():
    """Test CLI command availability."""
    import subprocess

    result = subprocess.run(
        ["llm-eval", "--version"],
        capture_output=True,
        text=True
    )
    assert result.returncode == 0

def test_persistence():
    """Test persistence functionality."""
    from llm_evaluation_framework.persistence import JSONStore
    import tempfile
    import os

    with tempfile.NamedTemporaryFile(suffix='.json', delete=False) as tmp:
        store = JSONStore(tmp.name)
        test_data = {"test": "data", "timestamp": datetime.now().isoformat()}

        store.save_evaluation_result(test_data)
        loaded_data = store.load_evaluation_result()

        assert loaded_data["test"] == "data"
        os.unlink(tmp.name)

def main():
    """Run all installation tests."""
    print("๐Ÿงช LLM Evaluation Framework Installation Test")
    print("=" * 50)
    print(f"๐Ÿ“… Test started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print()

    tests = [
        ("Core Imports", test_core_imports),
        ("Model Registry", test_registry_functionality),
        ("Dataset Generation", test_dataset_generation),
        ("Inference Engine", test_inference_engine),
        ("CLI Availability", test_cli_availability),
        ("Persistence Layer", test_persistence)
    ]

    passed = 0
    total = len(tests)

    for test_name, test_func in tests:
        if run_test(test_name, test_func):
            passed += 1
        print()

    print("๐Ÿ“‹ TEST SUMMARY")
    print("=" * 15)
    print(f"โœ… Passed: {passed}/{total}")
    print(f"โŒ Failed: {total - passed}/{total}")
    print(f"๐Ÿ“Š Success Rate: {(passed/total)*100:.1f}%")

    if passed == total:
        print("\n๐ŸŽ‰ ALL TESTS PASSED! Installation is complete and working correctly.")
        print("๐Ÿ’ก You're ready to start evaluating LLMs!")
    else:
        print("\nโš ๏ธ  Some tests failed. Please check the error messages above.")
        print("๐Ÿ”— Need help? Visit: https://github.com/isathish/LLMEvaluationFramework/issues")
        sys.exit(1)

if __name__ == "__main__":
    main()
**Run the test:**
python installation_test.py
# For detailed error information:
python installation_test.py --verbose
### ๐Ÿš€ **Performance Benchmark** Verify performance with the included benchmark:
# Run performance benchmark
llm-eval benchmark \
  --models mock-model \
  --test-cases 100 \
  --iterations 3 \
  --output benchmark_results.json

# Expected output:
# ๐Ÿ“Š Average evaluation time: 0.15s per test case
# ๐Ÿš€ Throughput: 6.67 evaluations per second
# ๐Ÿ’พ Memory usage: 45.2 MB peak

๐Ÿšจ Troubleshooting

### ๐Ÿ”ง **Common Issues and Solutions**
#### **๐Ÿ› Installation Issues**
ModuleNotFoundError: No module named 'llm_evaluation_framework' **Problem**: Framework not properly installed or virtual environment not activated. **Solutions**:
# 1. Check if virtual environment is activated
which python
# Should show path to virtual environment

# 2. Reinstall framework
pip uninstall llm-evaluation-framework
pip install llm-evaluation-framework

# 3. Check Python path
python -c "import sys; print('\n'.join(sys.path))"

# 4. Force reinstall with no cache
pip install --no-cache-dir --force-reinstall llm-evaluation-framework
Command 'llm-eval' not found **Problem**: CLI not properly installed or not in PATH. **Solutions**:
# 1. Check if pip installed scripts correctly
pip show -f llm-evaluation-framework | grep bin

# 2. Add pip bin directory to PATH
export PATH="$PATH:$(python -m site --user-base)/bin"

# 3. Use module syntax as alternative
python -m llm_evaluation_framework.cli --help

# 4. Reinstall with explicit user flag
pip install --user llm-evaluation-framework
Permission denied errors **Problem**: Insufficient permissions for installation. **Solutions**:
# 1. Use virtual environment (recommended)
python -m venv llm-eval-env
source llm-eval-env/bin/activate
pip install llm-evaluation-framework

# 2. Install for current user only
pip install --user llm-evaluation-framework

# 3. Fix permissions (macOS/Linux)
sudo chown -R $(whoami) ~/.local

# 4. Use conda environment
conda create -n llm-eval python=3.11
conda activate llm-eval
pip install llm-evaluation-framework
#### **๐Ÿ”‘ Authentication Issues**
API key authentication failures **Problem**: API keys not configured or invalid. **Solutions**:
# 1. Verify environment variables are set
echo $OPENAI_API_KEY
echo $ANTHROPIC_API_KEY

# 2. Test API key validity
curl -H "Authorization: Bearer $OPENAI_API_KEY" \
     https://api.openai.com/v1/models

# 3. Check .env file is in correct location
cat .env  # Should show your API keys

# 4. Verify Python can read environment variables
python -c "import os; print(os.getenv('OPENAI_API_KEY', 'NOT_SET'))"
Rate limiting or quota errors **Problem**: API rate limits exceeded or quota exhausted. **Solutions**:
# 1. Check current usage (OpenAI)
llm-eval check-usage --provider openai

# 2. Reduce concurrent requests in config
# Edit llm_eval_config.yaml:
# max_concurrent_evaluations: 1

# 3. Add delays between requests
llm-eval evaluate --rate-limit 1.0  # 1 second between requests

# 4. Use mock models for testing
llm-eval evaluate --model mock-model --test-cases test.json
#### **๐Ÿ’ฐ Cost Management Issues**
Unexpected high API costs **Problem**: Evaluation costs higher than expected. **Prevention & Solutions**:
# 1. Set strict budget limits
llm-eval evaluate --budget 0.10 --fail-on-exceed

# 2. Use cost estimation before evaluation
llm-eval estimate-cost \
  --model gpt-4 \
  --test-cases large_dataset.json

# 3. Use cheaper models for testing
llm-eval evaluate --model gpt-3.5-turbo  # Instead of gpt-4

# 4. Monitor costs in real-time
llm-eval evaluate --monitor-costs --cost-alert 0.05
#### **๐Ÿš€ Performance Issues**
Slow evaluation performance **Problem**: Evaluations taking too long to complete. **Solutions**:
# 1. Enable parallel processing
llm-eval evaluate --parallel --max-workers 4

# 2. Use batch processing
llm-eval evaluate --batch-size 10

# 3. Optimize model configuration
# Reduce max_tokens in model config
llm-eval update-model --name gpt-3.5-turbo --max-tokens 1000

# 4. Enable caching for repeated evaluations
llm-eval evaluate --enable-cache

# 5. Use faster models for initial testing
llm-eval evaluate --model claude-3-haiku  # Faster than claude-3-opus
#### **๐Ÿ“Š Result Issues**
Inconsistent or unexpected results **Problem**: Evaluation results vary significantly between runs. **Solutions**:
# 1. Set deterministic parameters
llm-eval evaluate --temperature 0.0 --seed 42

# 2. Increase test case count for statistical significance
llm-eval generate --count 50  # Instead of 10

# 3. Run multiple evaluation rounds
llm-eval evaluate --rounds 3 --aggregate-method mean

# 4. Use more robust scoring methods
llm-eval evaluate --scorer semantic_similarity  # Instead of exact_match
### ๐Ÿ” **Diagnostic Commands** Use these commands to diagnose issues:
# System information
llm-eval diagnose --system

# Framework configuration check
llm-eval diagnose --config

# Model registry validation
llm-eval diagnose --models

# Connectivity test
llm-eval diagnose --connectivity

# Permission check
llm-eval diagnose --permissions

# Generate comprehensive diagnostic report
llm-eval diagnose --all --output diagnostic_report.txt
### ๐Ÿ“ž **Getting Help**
If you're still experiencing issues:
Resource Best For Response Time
GitHub Issues Bug reports, feature requests 1-3 days
GitHub Discussions Q&A, general help, ideas Community-driven
Documentation Detailed guides, API reference Immediate
Stack Overflow Technical questions Community-driven
**When reporting issues, include:** - Output of `llm-eval diagnose --all` - Your operating system and Python version - Complete error message and stack trace - Steps to reproduce the issue - Your configuration files (with sensitive data removed)

๐ŸŽฏ Next Steps & Learning Path

### ๐Ÿ›ค๏ธ **Recommended Learning Path**
#### **Phase 1: Foundation** (Days 1-2) - โœ… **Completed**: Installation and first evaluation - ๐ŸŽฏ **Next**: [Core Concepts](core-concepts.md) - Understand framework architecture - ๐Ÿ“š **Resource**: [API Reference](api-reference.md) - Bookmark for quick reference #### **Phase 2: Practical Skills** (Days 3-5) - ๐ŸŽฏ **Goal**: [Examples & Tutorials](examples.md) - Hands-on learning with real scenarios - ๐ŸŽฏ **Goal**: [CLI Usage Guide](cli-usage.md) - Master command-line workflows - ๐ŸŽฏ **Goal**: [Evaluation & Scoring](evaluation-and-scoring.md) - Advanced scoring strategies #### **Phase 3: Advanced Usage** (Week 2) - ๐ŸŽฏ **Goal**: [Advanced Usage](advanced-usage.md) - Performance optimization and production patterns - ๐ŸŽฏ **Goal**: [Model Registry](model-registry.md) - Advanced model management - ๐ŸŽฏ **Goal**: [Persistence Layer](persistence-layer.md) - Data management and storage #### **Phase 4: Customization** (Week 3) - ๐ŸŽฏ **Goal**: [Developer Guide](developer-guide.md) - Build custom components - ๐ŸŽฏ **Goal**: [Auto-Suggestion Engine](auto-suggestion-engine.md) - Advanced recommendation systems - ๐ŸŽฏ **Goal**: [Contributing](../contributing.md) - Contribute back to the project
### ๐ŸŽฏ **Immediate Next Actions**
**๐Ÿ“‹ Choose your immediate next step:**

๐Ÿง  I want to understand the concepts

Learn the framework architecture and design principles

โ†’ Read Core Concepts

๐Ÿ’ป I want to see more examples

Explore practical examples and real-world use cases

โ†’ Try Examples

โšก I want to optimize performance

Learn advanced patterns and production best practices

โ†’ Advanced Usage

๐Ÿ› ๏ธ I want to customize the framework

Build custom components and contribute to the project

โ†’ Developer Guide
### ๐Ÿš€ **Quick Wins to Try**
**๐ŸŽฏ Complete these in the next 30 minutes:** 1. **๐Ÿ”ง Model Comparison**
llm-eval evaluate --model gpt-3.5-turbo claude-3-haiku --test-cases 5 --output comparison.json
2. **๐Ÿ“Š Custom Scoring**
llm-eval score --predictions "Your model output" --references "Expected output" --metric bleu
3. **๐ŸŽจ Creative Dataset**
llm-eval generate --capability creativity --domain "science fiction" --count 10
4. **๐Ÿ“ˆ Results Analysis**
llm-eval analyze --evaluation-results comparison.json --format html
### ๐Ÿ† **Achievements System** Track your progress as you learn:
- [ ] **๐Ÿš€ First Steps** - Complete installation and first evaluation - [ ] **๐ŸŽฏ Evaluator** - Run 10 different evaluations - [ ] **๐Ÿงช Dataset Creator** - Generate 5 different test datasets - [ ] **โšก CLI Master** - Use 10 different CLI commands - [ ] **๐Ÿ”ง Optimizer** - Improve evaluation performance by 50% - [ ] **๐Ÿ—๏ธ Builder** - Create a custom scoring strategy - [ ] **๐Ÿค Contributor** - Submit your first pull request - [ ] **๐Ÿ“š Teacher** - Help others in GitHub Discussions **Share your achievements on social media with #LLMEvalFramework!**

## ๐ŸŽ‰ Congratulations! You're Now Ready to Evaluate LLMs **You've successfully set up the LLM Evaluation Framework and run your first evaluation!** ### ๐ŸŒŸ **What You've Accomplished** โœ… **Installed** the framework in your environment โœ… **Configured** models and evaluation settings โœ… **Generated** synthetic test datasets โœ… **Executed** comprehensive model evaluations โœ… **Analyzed** results and got recommendations โœ… **Learned** CLI commands for automation ### ๐Ÿš€ **Ready for Advanced Topics?** [![Core Concepts](https://img.shields.io/badge/๐Ÿ“š_Core_Concepts-Understand_the_Framework-22c55e?style=for-the-badge)](core-concepts.md) [![Examples Hub](https://img.shields.io/badge/๐Ÿ’ก_Examples_Hub-Hands%20On%20Learning-3b82f6?style=for-the-badge)](examples.md) [![Advanced Usage](https://img.shields.io/badge/โšก_Advanced_Usage-Power_User_Guide-f59e0b?style=for-the-badge)](advanced-usage.md) [![Developer Guide](https://img.shields.io/badge/๐Ÿ› ๏ธ_Developer_Guide-Build_%26_Contribute-8b5cf6?style=for-the-badge)](developer-guide.md) --- ### ๐Ÿ’ฌ **Join Our Community** [![GitHub Discussions](https://img.shields.io/badge/๐Ÿ’ฌ_GitHub_Discussions-Join_the_Conversation-purple?style=for-the-badge)](https://github.com/isathish/LLMEvaluationFramework/discussions) [![Stack Overflow](https://img.shields.io/badge/โ“_Stack_Overflow-Get_Technical_Help-orange?style=for-the-badge)](https://stackoverflow.com/questions/tagged/llm-evaluation-framework) [![Contributing](https://img.shields.io/badge/๐Ÿค_Contributing-Make_a_Difference-green?style=for-the-badge)](../contributing.md) --- **โญ Found this helpful? [Star us on GitHub](https://github.com/isathish/LLMEvaluationFramework) to support the project!** *Happy evaluating! ๐Ÿš€*