👨‍💻 Developer Guide¶

🎯 Developer Journey Map¶

🚀 Setup

Environment & tools

🧪 Testing

Quality & validation

🔧 Building

Features & extensions

🤝 Contributing

Collaboration & review

🚀 Quick Development Setup¶

⚡ Lightning Setup (5 minutes)¶

# 1️⃣ Clone and enter directory
git clone https://github.com/isathish/LLMEvaluationFramework.git
cd LLMEvaluationFramework

# 2️⃣ Create isolated environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3️⃣ Install development dependencies
pip install -e ".[dev,docs,test]"

# 4️⃣ Verify installation
python -c "from llm_evaluation_framework import ModelRegistry; print('✅ Setup successful!')"

# 5️⃣ Run quick test
pytest tests/test_quick_setup.py -v

**🎉 You're ready to develop!**

🛠️ Complete Development Environment¶

#### **Prerequisites** - **Python 3.8+** (recommended: 3.11) - **Git** for version control - **Visual Studio Code** (recommended IDE) - **Docker** (optional, for containerized development) #### **Development Dependencies**

# Install all development tools
pip install -e ".[dev]"

# This includes:
# - pytest (testing framework)
# - pytest-cov (coverage reporting)
# - black (code formatting)
# - flake8 (linting)
# - mypy (type checking)
# - pre-commit (git hooks)
# - sphinx (documentation)

#### **IDE Configuration** **Visual Studio Code Settings** (`.vscode/settings.json`):

{
    "python.defaultInterpreterPath": "./venv/bin/python",
    "python.formatting.provider": "black",
    "python.linting.enabled": true,
    "python.linting.flake8Enabled": true,
    "python.linting.mypyEnabled": true,
    "python.testing.pytestEnabled": true,
    "python.testing.pytestArgs": [
        "tests",
        "--cov=llm_evaluation_framework",
        "--cov-report=html"
    ],
    "files.associations": {
        "*.md": "markdown"
    },
    "markdown.extension.toc.levels": "1..3"
}

#### **Pre-commit Hooks Setup**

# Install pre-commit hooks
pre-commit install

# Run hooks on all files (optional)
pre-commit run --all-files

🧪 Comprehensive Testing Guide¶

🎯 Testing Philosophy¶

Our testing strategy ensures reliability, maintainability, and confidence in every release:

| **Level** | **Coverage** | **Purpose** | **Tools** | |-----------|--------------|-------------|-----------| | **Unit Tests** | Individual functions/classes | Logic validation | pytest, unittest.mock | | **Integration Tests** | Component interactions | Workflow validation | pytest, fixtures | | **End-to-End Tests** | Complete workflows | User experience | pytest, real API calls | | **Performance Tests** | Speed/memory usage | Optimization validation | pytest-benchmark |

🔬 Testing Commands¶

#### **Basic Testing**

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_model_registry.py

# Run specific test function
pytest tests/test_model_registry.py::test_register_model

# Run tests matching pattern
pytest -k "test_model" -v

#### **Coverage Analysis**

# Generate coverage report
pytest --cov=llm_evaluation_framework

# Generate HTML coverage report
pytest --cov=llm_evaluation_framework --cov-report=html

# Coverage with specific threshold
pytest --cov=llm_evaluation_framework --cov-fail-under=85

# Show missing lines
pytest --cov=llm_evaluation_framework --cov-report=term-missing

#### **Performance Testing**

# Run performance benchmarks
pytest tests/benchmarks/ --benchmark-only

# Compare performance
pytest tests/benchmarks/ --benchmark-compare

# Save benchmark results
pytest tests/benchmarks/ --benchmark-save=baseline

#### **Advanced Testing Options**

# Parallel testing (faster)
pytest -n auto

# Stop on first failure
pytest -x

# Run last failed tests only
pytest --lf

# Show local variables on failures
pytest -l

# Disable warnings
pytest --disable-warnings

🧪 Writing Quality Tests¶

#### **Test Structure Template**

"""
Test template following AAA pattern (Arrange, Act, Assert)
"""
import pytest
from unittest.mock import Mock, patch
from llm_evaluation_framework import ModelRegistry

class TestModelRegistry:
    """Test class for ModelRegistry functionality"""

    def setup_method(self):
        """Setup before each test method"""
        self.registry = ModelRegistry()
        self.sample_config = {
            "provider": "openai",
            "api_cost_input": 0.001,
            "api_cost_output": 0.002,
            "capabilities": ["reasoning"]
        }

    def test_register_model_success(self):
        """Test successful model registration"""
        # Arrange
        model_name = "test-model"

        # Act
        result = self.registry.register_model(model_name, self.sample_config)

        # Assert
        assert result is True
        assert model_name in self.registry._models
        assert self.registry._models[model_name] == self.sample_config

    def test_register_model_invalid_config(self):
        """Test model registration with invalid configuration"""
        # Arrange
        model_name = "test-model"
        invalid_config = {"provider": "unknown"}

        # Act & Assert
        with pytest.raises(ValueError, match="Invalid provider"):
            self.registry.register_model(model_name, invalid_config)

    @pytest.mark.parametrize("provider,expected", [
        ("openai", True),
        ("anthropic", True),
        ("invalid", False)
    ])
    def test_validate_provider(self, provider, expected):
        """Test provider validation with multiple inputs"""
        # Arrange
        config = self.sample_config.copy()
        config["provider"] = provider

        # Act
        result = self.registry._validate_provider(config)

        # Assert
        assert result == expected

    @patch('llm_evaluation_framework.model_registry.requests.get')
    def test_validate_api_key(self, mock_get):
        """Test API key validation with mocked HTTP calls"""
        # Arrange
        mock_get.return_value.status_code = 200

        # Act
        result = self.registry._validate_api_key("test-key", "openai")

        # Assert
        assert result is True
        mock_get.assert_called_once()

# Integration test example
class TestModelRegistryIntegration:
    """Integration tests for ModelRegistry with other components"""

    def test_registry_with_inference_engine(self):
        """Test registry integration with inference engine"""
        # Arrange
        registry = ModelRegistry()
        engine = ModelInferenceEngine(registry)

        # Register model
        registry.register_model("gpt-3.5-turbo", self.sample_config)

        # Act
        available_models = engine.get_available_models()

        # Assert
        assert "gpt-3.5-turbo" in available_models

# Fixture examples
@pytest.fixture
def sample_registry():
    """Fixture providing a pre-configured registry"""
    registry = ModelRegistry()
    registry.register_model("test-model", {
        "provider": "openai",
        "capabilities": ["reasoning"]
    })
    return registry

@pytest.fixture
def sample_test_cases():
    """Fixture providing sample test cases"""
    return [
        {
            "id": "test_1",
            "prompt": "What is 2+2?",
            "expected_output": "4",
            "metadata": {"difficulty": "easy"}
        },
        {
            "id": "test_2", 
            "prompt": "Explain quantum computing",
            "expected_output": "Quantum computing uses quantum mechanics...",
            "metadata": {"difficulty": "hard"}
        }
    ]

#### **Test Categories and Conventions** | **Test Type** | **Naming** | **Location** | **Purpose** | |---------------|------------|--------------|-------------| | **Unit Tests** | `test_` | `tests/unit/` | Individual function testing | | **Integration Tests** | `test__integration` | `tests/integration/` | Component interaction testing | | **E2E Tests** | `test__e2e` | `tests/e2e/` | Complete workflow testing | | **Performance Tests** | `test__performance` | `tests/benchmarks/` | Performance validation |

🔧 Building Custom Extensions¶

🏗️ Extension Architecture¶

The framework is designed with **extensibility** as a core principle. Every major component can be extended or replaced:

graph TB
    subgraph "Extension Points"
        A[Scoring Strategies]
        B[Persistence Backends]
        C[Model Providers]
        D[Test Generators]
        E[Evaluation Hooks]
    end

    subgraph "Base Interfaces"
        F[ScoringStrategy]
        G[PersistenceBackend]
        H[ModelProvider]
        I[TestGenerator]
        J[EvaluationHook]
    end

    A --> F
    B --> G
    C --> H
    D --> I
    E --> J

    subgraph "Your Extensions"
        K[CustomScorer]
        L[CloudStorage]
        M[CustomLLM]
        N[DomainGenerator]
        O[MetricsHook]
    end

    F --> K
    G --> L
    H --> M
    I --> N
    J --> O

🎯 Custom Scoring Strategy¶

"""
Example: Building a domain-specific scoring strategy
"""
from llm_evaluation_framework.evaluation.scoring_strategies import ScoringStrategy
from typing import List, Dict, Any
import re
import nltk
from textstat import flesch_reading_ease

class ReadabilityScorer(ScoringStrategy):
    """
    Scoring strategy focused on text readability and comprehension
    Perfect for educational content evaluation
    """

    def __init__(self, target_grade_level: int = 8):
        """
        Initialize readability scorer

        Args:
            target_grade_level: Target reading grade level (1-12)
        """
        self.target_grade_level = target_grade_level
        self.weights = {
            'readability': 0.4,
            'clarity': 0.3,
            'completeness': 0.3
        }

    def calculate_score(self, predictions: List[str], references: List[str]) -> Dict[str, Any]:
        """Calculate readability-focused scores"""

        scores = {
            'readability_scores': [],
            'clarity_scores': [],
            'completeness_scores': []
        }

        for pred, ref in zip(predictions, references):
            # Readability analysis
            readability = self._analyze_readability(pred)
            clarity = self._analyze_clarity(pred)
            completeness = self._analyze_completeness(pred, ref)

            scores['readability_scores'].append(readability)
            scores['clarity_scores'].append(clarity)
            scores['completeness_scores'].append(completeness)

        # Calculate component averages
        avg_scores = {
            'readability': sum(scores['readability_scores']) / len(scores['readability_scores']),
            'clarity': sum(scores['clarity_scores']) / len(scores['clarity_scores']),
            'completeness': sum(scores['completeness_scores']) / len(scores['completeness_scores'])
        }

        # Calculate weighted overall score
        overall_score = sum(
            avg_scores[component] * self.weights[component]
            for component in avg_scores
        )

        return {
            'overall_score': overall_score,
            'component_scores': avg_scores,
            'detailed_metrics': {
                'target_grade_level': self.target_grade_level,
                'weights_used': self.weights,
                'individual_scores': scores
            }
        }

    def _analyze_readability(self, text: str) -> float:
        """Analyze text readability using multiple metrics"""

        # Flesch Reading Ease Score
        flesch_score = flesch_reading_ease(text)

        # Convert Flesch score to grade level approximation
        if flesch_score >= 90:
            grade_level = 5
        elif flesch_score >= 80:
            grade_level = 6
        elif flesch_score >= 70:
            grade_level = 7
        elif flesch_score >= 60:
            grade_level = 8
        elif flesch_score >= 50:
            grade_level = 9
        elif flesch_score >= 30:
            grade_level = 10
        else:
            grade_level = 12

        # Score based on proximity to target grade level
        grade_diff = abs(grade_level - self.target_grade_level)
        readability_score = max(0, 1 - (grade_diff / 6))  # Normalize to 0-1

        return readability_score

    def _analyze_clarity(self, text: str) -> float:
        """Analyze text clarity using linguistic features"""

        sentences = nltk.sent_tokenize(text)
        words = nltk.word_tokenize(text)

        # Average sentence length (optimal: 15-20 words)
        avg_sentence_length = len(words) / len(sentences)
        if 15 <= avg_sentence_length <= 20:
            length_score = 1.0
        else:
            length_score = max(0, 1 - abs(avg_sentence_length - 17.5) / 17.5)

        # Transition words and phrases
        transitions = [
            'however', 'therefore', 'furthermore', 'moreover',
            'in addition', 'for example', 'in conclusion', 'as a result'
        ]
        transition_count = sum(1 for transition in transitions 
                             if transition in text.lower())
        transition_score = min(1.0, transition_count / 3)

        # Active vs passive voice (prefer active)
        passive_indicators = ['was', 'were', 'been', 'being']
        passive_count = sum(1 for indicator in passive_indicators 
                          if indicator in text.lower())
        active_score = max(0, 1 - (passive_count / len(words)) * 10)

        # Combined clarity score
        clarity_score = (length_score + transition_score + active_score) / 3

        return clarity_score

    def _analyze_completeness(self, prediction: str, reference: str) -> float:
        """Analyze completeness compared to reference"""

        # Key concepts extraction (simplified)
        pred_concepts = set(re.findall(r'\b\w{4,}\b', prediction.lower()))
        ref_concepts = set(re.findall(r'\b\w{4,}\b', reference.lower()))

        if not ref_concepts:
            return 1.0

        # Calculate concept coverage
        covered_concepts = pred_concepts.intersection(ref_concepts)
        coverage_score = len(covered_concepts) / len(ref_concepts)

        return coverage_score

# Plugin registration system
class ScorerRegistry:
    """Registry for custom scoring strategies"""

    _scorers = {}

    @classmethod
    def register(cls, name: str, scorer_class: type):
        """Register a custom scorer"""
        cls._scorers[name] = scorer_class

    @classmethod
    def get_scorer(cls, name: str, **kwargs):
        """Get a registered scorer instance"""
        if name not in cls._scorers:
            raise ValueError(f"Scorer '{name}' not registered")
        return cls._scorers[name](**kwargs)

    @classmethod
    def list_scorers(cls) -> List[str]:
        """List all registered scorers"""
        return list(cls._scorers.keys())

# Register the custom scorer
ScorerRegistry.register('readability', ReadabilityScorer)

# Usage example
def use_custom_scorer():
    """Example of using custom scoring strategy"""

    # Create custom scorer
    readability_scorer = ScorerRegistry.get_scorer('readability', target_grade_level=8)

    # Use in evaluation
    from llm_evaluation_framework import ModelInferenceEngine, ModelRegistry

    registry = ModelRegistry()
    engine = ModelInferenceEngine(registry)

    # Register model
    registry.register_model("gpt-3.5-turbo", {
        "provider": "openai",
        "capabilities": ["reasoning"]
    })

    # Generate educational test cases
    test_cases = [
        {
            "prompt": "Explain photosynthesis to an 8th grader",
            "expected_output": "Photosynthesis is how plants make food using sunlight, water, and carbon dioxide."
        }
    ]

    # Evaluate with custom scorer
    results = engine.evaluate_model(
        model_name="gpt-3.5-turbo",
        test_cases=test_cases,
        scoring_strategy=readability_scorer
    )

    # Display results
    print(f"Overall Readability Score: {results['scores']['overall_score']:.2f}")
    print(f"Component Scores: {results['scores']['component_scores']}")

🗄️ Custom Persistence Backend¶

"""
Example: Building a custom cloud storage backend
"""
import json
import boto3
from typing import Dict, Any, List, Optional
from llm_evaluation_framework.persistence.base_store import BaseStore
from botocore.exceptions import ClientError

class S3PersistenceBackend(BaseStore):
    """
    AWS S3 persistence backend for scalable evaluation storage
    """

    def __init__(self, bucket_name: str, region: str = 'us-east-1', 
                 key_prefix: str = 'llm-evaluations/'):
        """
        Initialize S3 backend

        Args:
            bucket_name: S3 bucket name
            region: AWS region
            key_prefix: Prefix for all keys
        """
        self.bucket_name = bucket_name
        self.key_prefix = key_prefix

        # Initialize S3 client
        self.s3_client = boto3.client('s3', region_name=region)

        # Verify bucket access
        self._verify_bucket_access()

    def _verify_bucket_access(self):
        """Verify bucket exists and is accessible"""
        try:
            self.s3_client.head_bucket(Bucket=self.bucket_name)
        except ClientError as e:
            if e.response['Error']['Code'] == '404':
                raise ValueError(f"Bucket {self.bucket_name} does not exist")
            else:
                raise ValueError(f"Cannot access bucket {self.bucket_name}: {e}")

    def save(self, key: str, data: Dict[str, Any]) -> bool:
        """Save evaluation results to S3"""
        try:
            full_key = f"{self.key_prefix}{key}.json"

            # Add metadata for search/indexing
            metadata = self._extract_metadata(data)

            # Upload to S3
            self.s3_client.put_object(
                Bucket=self.bucket_name,
                Key=full_key,
                Body=json.dumps(data, default=str),
                ContentType='application/json',
                Metadata=metadata,
                ServerSideEncryption='AES256'  # Enable encryption
            )

            return True

        except Exception as e:
            print(f"S3 save failed: {e}")
            return False

    def load(self, key: str) -> Dict[str, Any]:
        """Load evaluation results from S3"""
        try:
            full_key = f"{self.key_prefix}{key}.json"

            response = self.s3_client.get_object(
                Bucket=self.bucket_name,
                Key=full_key
            )

            return json.loads(response['Body'].read())

        except ClientError as e:
            if e.response['Error']['Code'] == 'NoSuchKey':
                raise KeyError(f"Key not found: {key}")
            raise

    def query(self, filters: Dict[str, Any] = None) -> List[Dict[str, Any]]:
        """Query evaluations with optional filters"""
        try:
            # List objects with prefix
            response = self.s3_client.list_objects_v2(
                Bucket=self.bucket_name,
                Prefix=self.key_prefix
            )

            if 'Contents' not in response:
                return []

            results = []
            for obj in response['Contents']:
                try:
                    # Get object metadata
                    head_response = self.s3_client.head_object(
                        Bucket=self.bucket_name,
                        Key=obj['Key']
                    )

                    metadata = head_response.get('Metadata', {})

                    # Apply filters if provided
                    if filters and not self._matches_filters(metadata, filters):
                        continue

                    # Load full object if it matches filters
                    key = obj['Key'].replace(self.key_prefix, '').replace('.json', '')
                    full_data = self.load(key)
                    results.append(full_data)

                except Exception as e:
                    print(f"Error processing object {obj['Key']}: {e}")
                    continue

            return results

        except Exception as e:
            print(f"S3 query failed: {e}")
            return []

    def delete(self, key: str) -> bool:
        """Delete evaluation from S3"""
        try:
            full_key = f"{self.key_prefix}{key}.json"

            self.s3_client.delete_object(
                Bucket=self.bucket_name,
                Key=full_key
            )

            return True

        except Exception as e:
            print(f"S3 delete failed: {e}")
            return False

    def list_evaluations(self, limit: int = 100) -> List[Dict[str, str]]:
        """List available evaluations with metadata"""
        try:
            response = self.s3_client.list_objects_v2(
                Bucket=self.bucket_name,
                Prefix=self.key_prefix,
                MaxKeys=limit
            )

            if 'Contents' not in response:
                return []

            evaluations = []
            for obj in response['Contents']:
                key = obj['Key'].replace(self.key_prefix, '').replace('.json', '')
                evaluations.append({
                    'key': key,
                    'last_modified': obj['LastModified'].isoformat(),
                    'size': obj['Size']
                })

            return evaluations

        except Exception as e:
            print(f"S3 list failed: {e}")
            return []

    def _extract_metadata(self, data: Dict[str, Any]) -> Dict[str, str]:
        """Extract metadata for S3 object tagging"""
        metadata = {}

        if 'model_name' in data:
            metadata['model-name'] = data['model_name']

        if 'timestamp' in data:
            metadata['timestamp'] = data['timestamp']

        if 'aggregate_metrics' in data:
            metrics = data['aggregate_metrics']
            if 'accuracy' in metrics:
                metadata['accuracy'] = str(round(metrics['accuracy'], 3))
            if 'total_cost' in metrics:
                metadata['total-cost'] = str(round(metrics['total_cost'], 4))

        return metadata

    def _matches_filters(self, metadata: Dict[str, str], 
                        filters: Dict[str, Any]) -> bool:
        """Check if metadata matches filters"""
        for filter_key, filter_value in filters.items():
            if filter_key == 'model_name':
                if metadata.get('model-name') != filter_value:
                    return False
            elif filter_key == 'min_accuracy':
                accuracy = float(metadata.get('accuracy', 0))
                if accuracy < filter_value:
                    return False
            # Add more filter conditions as needed

        return True

# Usage example
def use_custom_persistence():
    """Example of using custom S3 persistence backend"""

    # Initialize S3 backend
    s3_backend = S3PersistenceBackend(
        bucket_name='my-llm-evaluations',
        region='us-west-2',
        key_prefix='evaluations/production/'
    )

    # Use in persistence manager
    from llm_evaluation_framework.persistence import PersistenceManager

    persistence_manager = PersistenceManager({
        's3': s3_backend,
        'local': JsonStore('./local_backup/')  # Local backup
    })

    # Use in evaluation engine
    from llm_evaluation_framework import ModelInferenceEngine, ModelRegistry

    registry = ModelRegistry()
    engine = ModelInferenceEngine(registry, persistence_manager)

    # Run evaluation - results automatically saved to S3
    results = engine.evaluate_model("gpt-3.5-turbo", test_cases)

    # Query stored evaluations
    recent_evaluations = s3_backend.query({
        'model_name': 'gpt-3.5-turbo',
        'min_accuracy': 0.8
    })

    print(f"Found {len(recent_evaluations)} high-accuracy evaluations")

📝 Code Quality Standards¶

🎯 Quality Guidelines¶

We maintain **enterprise-grade code quality** through rigorous standards: #### **Code Style & Formatting** - **PEP 8**: Python Enhancement Proposal 8 compliance - **Black**: Automatic code formatting (line length: 88 characters) - **isort**: Import statement organization - **Line Length**: Maximum 88 characters (Black standard) #### **Type Safety** - **100% Type Hints**: All functions, methods, and variables must have type annotations - **mypy**: Static type checking with strict configuration - **Runtime Validation**: Input validation using type checking #### **Documentation Standards** - **Docstrings**: Google-style docstrings for all public APIs - **Type Documentation**: Document complex types and data structures - **Examples**: Include usage examples in docstrings - **API Documentation**: Auto-generated from code using Sphinx #### **Testing Requirements** - **Coverage**: Minimum 85% test coverage - **Test Types**: Unit, integration, and end-to-end tests - **Documentation**: Tests serve as living documentation - **Performance**: Include performance regression tests

📋 Code Review Checklist¶

#### **Before Submitting PR** - [ ] **Tests Pass**: All tests pass locally - [ ] **Coverage**: New code has appropriate test coverage - [ ] **Type Checking**: mypy passes without errors - [ ] **Linting**: flake8 passes without errors - [ ] **Formatting**: Code formatted with Black - [ ] **Documentation**: Public APIs documented - [ ] **Examples**: Complex features include usage examples - [ ] **Performance**: No significant performance regression #### **PR Description Requirements** - [ ] **Clear Title**: Descriptive PR title - [ ] **Problem Description**: What issue does this solve? - [ ] **Solution Overview**: How does this solve the issue? - [ ] **Testing**: How was this tested? - [ ] **Breaking Changes**: Any breaking changes noted - [ ] **Documentation**: Documentation updates included

🔧 Development Tools Configuration¶

#### **pyproject.toml Configuration**

[tool.black]
line-length = 88
target-version = ['py38']
include = '\.pyi?$'
extend-exclude = '''
/(
  # directories
  \.eggs
  | \.git
  | \.hg
  | \.mypy_cache
  | \.tox
  | \.venv
  | build
  | dist
)/
'''

[tool.isort]
profile = "black"
multi_line_output = 3
line_length = 88
known_first_party = ["llm_evaluation_framework"]

[tool.mypy]
python_version = "3.8"
warn_return_any = true
warn_unused_configs = true
disallow_untyped_defs = true
disallow_incomplete_defs = true
check_untyped_defs = true
no_implicit_optional = true
warn_redundant_casts = true
warn_unused_ignores = true
show_error_codes = true

[tool.pytest.ini_options]
minversion = "6.0"
addopts = "-ra -q --strict-markers --cov=llm_evaluation_framework"
testpaths = ["tests"]
markers = [
    "slow: marks tests as slow (deselect with '-m \"not slow\"')",
    "integration: marks tests as integration tests",
    "e2e: marks tests as end-to-end tests",
]

[tool.coverage.run]
source = ["llm_evaluation_framework"]
omit = [
    "*/tests/*",
    "*/test_*",
    "*/conftest.py",
]

[tool.coverage.report]
exclude_lines = [
    "pragma: no cover",
    "def __repr__",
    "if self.debug:",
    "if settings.DEBUG",
    "raise AssertionError",
    "raise NotImplementedError",
    "if 0:",
    "if __name__ == .__main__.:",
    "class .*\bProtocol\):",
    "@(abc\.)?abstractmethod",
]

#### **Pre-commit Configuration** (`.pre-commit-config.yaml`)

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.4.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-added-large-files

  - repo: https://github.com/psf/black
    rev: 23.1.0
    hooks:
      - id: black
        language_version: python3

  - repo: https://github.com/pycqa/isort
    rev: 5.12.0
    hooks:
      - id: isort

  - repo: https://github.com/pycqa/flake8
    rev: 6.0.0
    hooks:
      - id: flake8

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.0.1
    hooks:
      - id: mypy
        additional_dependencies: [types-all]

🤝 Contributing Workflow¶

🔄 Contribution Process¶

#### **1. Issue Discovery & Planning**

# Check for existing issues
# https://github.com/isathish/LLMEvaluationFramework/issues

# Create new issue if needed
# Use issue templates for bugs, features, or documentation

#### **2. Development Setup**

# Fork repository on GitHub
# Clone your fork
git clone https://github.com/YOUR_USERNAME/LLMEvaluationFramework.git
cd LLMEvaluationFramework

# Add upstream remote
git remote add upstream https://github.com/isathish/LLMEvaluationFramework.git

# Create feature branch
git checkout -b feature/your-feature-name

#### **3. Development Cycle**

# Make changes
# Write tests
# Update documentation

# Check code quality
black .
isort .
flake8
mypy .

# Run tests
pytest --cov=llm_evaluation_framework

# Commit changes
git add .
git commit -m "feat: add new feature description"

#### **4. Submission Process**

# Sync with upstream
git fetch upstream
git rebase upstream/main

# Push to your fork
git push origin feature/your-feature-name

# Create Pull Request on GitHub
# Fill out PR template completely

#### **5. Review & Merge** - **Code Review**: Maintainers review code - **CI Checks**: Automated tests must pass - **Approval**: At least one maintainer approval required - **Merge**: Squash and merge to main branch

🏷️ Commit Message Standards¶

We follow **Conventional Commits** specification: #### **Commit Format**

<type>[optional scope]: <description>

[optional body]

[optional footer(s)]

#### **Commit Types** - **feat**: New feature - **fix**: Bug fix - **docs**: Documentation only changes - **style**: Changes that do not affect the meaning of the code - **refactor**: Code change that neither fixes a bug nor adds a feature - **perf**: Performance improvement - **test**: Adding missing tests or correcting existing tests - **chore**: Changes to the build process or auxiliary tools #### **Examples**

feat(registry): add model validation with capability checking

fix(engine): resolve timeout issue in async evaluation

docs(api): update ModelRegistry docstrings with examples

test(scoring): add comprehensive tests for F1 scoring strategy

refactor(persistence): simplify database connection management

📊 Project Structure & Architecture¶

🏗️ Directory Structure¶

LLMEvaluationFramework/
├── 📁 llm_evaluation_framework/          # Main package
│   ├── 📄 __init__.py                    # Package initialization
│   ├── 📄 cli.py                         # Command-line interface
│   ├── 📄 model_registry.py              # Model management
│   ├── 📄 model_inference_engine.py      # Evaluation engine
│   ├── 📄 test_dataset_generator.py      # Test case generation
│   ├── 📄 auto_suggestion_engine.py      # AI recommendations
│   │
│   ├── 📁 core/                          # Core interfaces
│   │   ├── 📄 __init__.py
│   │   ├── 📄 base_engine.py             # Base engine interface
│   │   └── 📄 base_registry.py           # Base registry interface
│   │
│   ├── 📁 engines/                       # Evaluation engines
│   │   ├── 📄 __init__.py
│   │   └── 📄 async_inference_engine.py  # Async evaluation
│   │
│   ├── 📁 evaluation/                    # Scoring strategies
│   │   ├── 📄 __init__.py
│   │   └── 📄 scoring_strategies.py      # Evaluation metrics
│   │
│   ├── 📁 persistence/                   # Data storage
│   │   ├── 📄 __init__.py
│   │   ├── 📄 persistence_manager.py     # Storage coordination
│   │   ├── 📄 json_store.py              # JSON file storage
│   │   └── 📄 db_store.py                # Database storage
│   │
│   ├── 📁 registry/                      # Model registries
│   │   ├── 📄 __init__.py
│   │   └── 📄 model_registry.py          # Model configuration
│   │
│   ├── 📁 utils/                         # Utilities
│   │   ├── 📄 __init__.py
│   │   ├── 📄 error_handler.py           # Error management
│   │   └── 📄 logger.py                  # Logging utilities
│   │
│   └── 📁 cli/                           # CLI components
│       ├── 📄 __init__.py
│       └── 📄 main.py                    # CLI implementation
│
├── 📁 tests/                             # Test suite
│   ├── 📄 __init__.py
│   ├── 📄 conftest.py                    # Test configuration
│   ├── 📁 unit/                          # Unit tests
│   ├── 📁 integration/                   # Integration tests
│   ├── 📁 e2e/                           # End-to-end tests
│   └── 📁 benchmarks/                    # Performance tests
│
├── 📁 docs/                              # Documentation
│   ├── 📄 index.md                       # Main documentation
│   ├── 📁 categories/                    # Documentation categories
│   └── 📄 mkdocs.yml                     # Documentation config
│
├── 📁 examples/                          # Usage examples
│   ├── 📄 basic_usage.py                 # Simple examples
│   ├── 📄 advanced_async_usage.py        # Advanced patterns
│   └── 📄 custom_scoring_and_persistence.py
│
├── 📁 scripts/                           # Development scripts
│   ├── 📄 setup_dev.sh                   # Development setup
│   ├── 📄 run_tests.sh                   # Test execution
│   └── 📄 build_docs.sh                  # Documentation build
│
├── 📄 pyproject.toml                     # Project configuration
├── 📄 setup.py                           # Package setup
├── 📄 requirements.txt                   # Dependencies
├── 📄 requirements-dev.txt               # Development dependencies
├── 📄 README.md                          # Project overview
├── 📄 LICENSE                            # License file
├── 📄 CHANGELOG.md                       # Version history
└── 📄 CONTRIBUTING.md                    # Contribution guidelines

🎯 Component Architecture¶

graph TB
    subgraph "User Interface Layer"
        CLI[CLI Interface]
        API[Python API]
    end

    subgraph "Core Framework Layer"
        Registry[Model Registry]
        Engine[Inference Engine]
        Generator[Dataset Generator]
        Suggestions[Auto Suggestions]
    end

    subgraph "Evaluation Layer"
        Scoring[Scoring Strategies]
        Validation[Result Validation]
        Analytics[Analytics Engine]
    end

    subgraph "Persistence Layer"
        Manager[Persistence Manager]
        JSON[JSON Store]
        DB[Database Store]
        Custom[Custom Backends]
    end

    subgraph "Infrastructure Layer"
        Logging[Logging System]
        Errors[Error Handling]
        Config[Configuration]
        Monitor[Monitoring]
    end

    CLI --> Engine
    API --> Engine
    Engine --> Registry
    Engine --> Generator
    Engine --> Scoring
    Engine --> Manager
    Scoring --> Validation
    Manager --> JSON
    Manager --> DB
    Manager --> Custom
    Engine --> Logging
    Engine --> Errors

🚀 Release Process¶

📦 Versioning Strategy¶

We follow **Semantic Versioning (SemVer)**: - **MAJOR.MINOR.PATCH** (e.g., 1.2.3) - **MAJOR**: Breaking changes - **MINOR**: New features (backward compatible) - **PATCH**: Bug fixes (backward compatible) #### **Release Schedule** - **Major Releases**: Quarterly (breaking changes) - **Minor Releases**: Monthly (new features) - **Patch Releases**: As needed (critical bug fixes) #### **Release Process** 1. **Planning**: Feature planning and roadmap review 2. **Development**: Feature implementation and testing 3. **Testing**: Comprehensive testing across environments 4. **Documentation**: Update documentation and examples 5. **Release**: Package build and distribution 6. **Announcement**: Release notes and community update

🎯 Getting Involved¶

#### **Ways to Contribute** | **Level** | **Activities** | **Time Commitment** | |-----------|----------------|-------------------| | **Beginner** | Bug reports, documentation fixes, typos | 15-30 minutes | | **Intermediate** | Feature requests, small features, tests | 1-3 hours | | **Advanced** | Major features, performance improvements | 3+ hours | | **Expert** | Architecture decisions, mentoring, releases | Ongoing | #### **Communication Channels** - **GitHub Issues**: Bug reports and feature requests - **GitHub Discussions**: Q&A and community chat - **Pull Requests**: Code contributions and reviews - **Discord** (Coming Soon): Real-time community chat #### **Recognition** - **Contributors**: Listed in CONTRIBUTORS.md - **Maintainers**: Core team with merge permissions - **Champions**: Community leaders and advocates

## 🎉 Ready to Contribute? **Join our community of developers building the future of LLM evaluation!** [![🐛 Report Bug](https://img.shields.io/badge/🐛_Report_Bug-GitHub%20Issues-ef4444?style=for-the-badge)](https://github.com/isathish/LLMEvaluationFramework/issues/new?template=bug_report.md) [![✨ Request Feature](https://img.shields.io/badge/✨_Request_Feature-GitHub%20Issues-22c55e?style=for-the-badge)](https://github.com/isathish/LLMEvaluationFramework/issues/new?template=feature_request.md) [![🤝 Start Contributing](https://img.shields.io/badge/🤝_Start_Contributing-Fork%20Now-6366f1?style=for-the-badge)](https://github.com/isathish/LLMEvaluationFramework/fork) [![💬 Join Discussions](https://img.shields.io/badge/💬_Join_Discussions-GitHub-f59e0b?style=for-the-badge)](https://github.com/isathish/LLMEvaluationFramework/discussions) --- **Every contribution makes a difference! 🌟** *Whether you're fixing a typo or building a major feature, your work helps thousands of developers build better AI systems.*