Skip to content

๐Ÿ‘จโ€๐Ÿ’ป Developer Guide

![Developer Guide](https://img.shields.io/badge/Developer%20Guide-Build%20%26%20Contribute-6366f1?style=for-the-badge&logo=code&logoColor=white) **Complete guide for contributors, maintainers, and extension developers** *Join the community building the future of LLM evaluation*

๐ŸŽฏ Developer Journey Map

๐Ÿš€ Setup

Environment & tools

๐Ÿงช Testing

Quality & validation

๐Ÿ”ง Building

Features & extensions

๐Ÿค Contributing

Collaboration & review


๐Ÿš€ Quick Development Setup

โšก Lightning Setup (5 minutes)

# 1๏ธโƒฃ Clone and enter directory
git clone https://github.com/isathish/LLMEvaluationFramework.git
cd LLMEvaluationFramework

# 2๏ธโƒฃ Create isolated environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3๏ธโƒฃ Install development dependencies
pip install -e ".[dev,docs,test]"

# 4๏ธโƒฃ Verify installation
python -c "from llm_evaluation_framework import ModelRegistry; print('โœ… Setup successful!')"

# 5๏ธโƒฃ Run quick test
pytest tests/test_quick_setup.py -v
**๐ŸŽ‰ You're ready to develop!**

๐Ÿ› ๏ธ Complete Development Environment

#### **Prerequisites** - **Python 3.8+** (recommended: 3.11) - **Git** for version control - **Visual Studio Code** (recommended IDE) - **Docker** (optional, for containerized development) #### **Development Dependencies**
# Install all development tools
pip install -e ".[dev]"

# This includes:
# - pytest (testing framework)
# - pytest-cov (coverage reporting)
# - black (code formatting)
# - flake8 (linting)
# - mypy (type checking)
# - pre-commit (git hooks)
# - sphinx (documentation)
#### **IDE Configuration** **Visual Studio Code Settings** (`.vscode/settings.json`):
{
    "python.defaultInterpreterPath": "./venv/bin/python",
    "python.formatting.provider": "black",
    "python.linting.enabled": true,
    "python.linting.flake8Enabled": true,
    "python.linting.mypyEnabled": true,
    "python.testing.pytestEnabled": true,
    "python.testing.pytestArgs": [
        "tests",
        "--cov=llm_evaluation_framework",
        "--cov-report=html"
    ],
    "files.associations": {
        "*.md": "markdown"
    },
    "markdown.extension.toc.levels": "1..3"
}
#### **Pre-commit Hooks Setup**
# Install pre-commit hooks
pre-commit install

# Run hooks on all files (optional)
pre-commit run --all-files

๐Ÿงช Comprehensive Testing Guide

๐ŸŽฏ Testing Philosophy

Our testing strategy ensures reliability, maintainability, and confidence in every release:

| **Level** | **Coverage** | **Purpose** | **Tools** | |-----------|--------------|-------------|-----------| | **Unit Tests** | Individual functions/classes | Logic validation | pytest, unittest.mock | | **Integration Tests** | Component interactions | Workflow validation | pytest, fixtures | | **End-to-End Tests** | Complete workflows | User experience | pytest, real API calls | | **Performance Tests** | Speed/memory usage | Optimization validation | pytest-benchmark |

๐Ÿ”ฌ Testing Commands

#### **Basic Testing**
# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_model_registry.py

# Run specific test function
pytest tests/test_model_registry.py::test_register_model

# Run tests matching pattern
pytest -k "test_model" -v
#### **Coverage Analysis**
# Generate coverage report
pytest --cov=llm_evaluation_framework

# Generate HTML coverage report
pytest --cov=llm_evaluation_framework --cov-report=html

# Coverage with specific threshold
pytest --cov=llm_evaluation_framework --cov-fail-under=85

# Show missing lines
pytest --cov=llm_evaluation_framework --cov-report=term-missing
#### **Performance Testing**
# Run performance benchmarks
pytest tests/benchmarks/ --benchmark-only

# Compare performance
pytest tests/benchmarks/ --benchmark-compare

# Save benchmark results
pytest tests/benchmarks/ --benchmark-save=baseline
#### **Advanced Testing Options**
# Parallel testing (faster)
pytest -n auto

# Stop on first failure
pytest -x

# Run last failed tests only
pytest --lf

# Show local variables on failures
pytest -l

# Disable warnings
pytest --disable-warnings

๐Ÿงช Writing Quality Tests

#### **Test Structure Template**
"""
Test template following AAA pattern (Arrange, Act, Assert)
"""
import pytest
from unittest.mock import Mock, patch
from llm_evaluation_framework import ModelRegistry

class TestModelRegistry:
    """Test class for ModelRegistry functionality"""

    def setup_method(self):
        """Setup before each test method"""
        self.registry = ModelRegistry()
        self.sample_config = {
            "provider": "openai",
            "api_cost_input": 0.001,
            "api_cost_output": 0.002,
            "capabilities": ["reasoning"]
        }

    def test_register_model_success(self):
        """Test successful model registration"""
        # Arrange
        model_name = "test-model"

        # Act
        result = self.registry.register_model(model_name, self.sample_config)

        # Assert
        assert result is True
        assert model_name in self.registry._models
        assert self.registry._models[model_name] == self.sample_config

    def test_register_model_invalid_config(self):
        """Test model registration with invalid configuration"""
        # Arrange
        model_name = "test-model"
        invalid_config = {"provider": "unknown"}

        # Act & Assert
        with pytest.raises(ValueError, match="Invalid provider"):
            self.registry.register_model(model_name, invalid_config)

    @pytest.mark.parametrize("provider,expected", [
        ("openai", True),
        ("anthropic", True),
        ("invalid", False)
    ])
    def test_validate_provider(self, provider, expected):
        """Test provider validation with multiple inputs"""
        # Arrange
        config = self.sample_config.copy()
        config["provider"] = provider

        # Act
        result = self.registry._validate_provider(config)

        # Assert
        assert result == expected

    @patch('llm_evaluation_framework.model_registry.requests.get')
    def test_validate_api_key(self, mock_get):
        """Test API key validation with mocked HTTP calls"""
        # Arrange
        mock_get.return_value.status_code = 200

        # Act
        result = self.registry._validate_api_key("test-key", "openai")

        # Assert
        assert result is True
        mock_get.assert_called_once()

# Integration test example
class TestModelRegistryIntegration:
    """Integration tests for ModelRegistry with other components"""

    def test_registry_with_inference_engine(self):
        """Test registry integration with inference engine"""
        # Arrange
        registry = ModelRegistry()
        engine = ModelInferenceEngine(registry)

        # Register model
        registry.register_model("gpt-3.5-turbo", self.sample_config)

        # Act
        available_models = engine.get_available_models()

        # Assert
        assert "gpt-3.5-turbo" in available_models

# Fixture examples
@pytest.fixture
def sample_registry():
    """Fixture providing a pre-configured registry"""
    registry = ModelRegistry()
    registry.register_model("test-model", {
        "provider": "openai",
        "capabilities": ["reasoning"]
    })
    return registry

@pytest.fixture
def sample_test_cases():
    """Fixture providing sample test cases"""
    return [
        {
            "id": "test_1",
            "prompt": "What is 2+2?",
            "expected_output": "4",
            "metadata": {"difficulty": "easy"}
        },
        {
            "id": "test_2", 
            "prompt": "Explain quantum computing",
            "expected_output": "Quantum computing uses quantum mechanics...",
            "metadata": {"difficulty": "hard"}
        }
    ]
#### **Test Categories and Conventions** | **Test Type** | **Naming** | **Location** | **Purpose** | |---------------|------------|--------------|-------------| | **Unit Tests** | `test_` | `tests/unit/` | Individual function testing | | **Integration Tests** | `test__integration` | `tests/integration/` | Component interaction testing | | **E2E Tests** | `test__e2e` | `tests/e2e/` | Complete workflow testing | | **Performance Tests** | `test__performance` | `tests/benchmarks/` | Performance validation |

๐Ÿ”ง Building Custom Extensions

๐Ÿ—๏ธ Extension Architecture

The framework is designed with **extensibility** as a core principle. Every major component can be extended or replaced:
graph TB
    subgraph "Extension Points"
        A[Scoring Strategies]
        B[Persistence Backends]
        C[Model Providers]
        D[Test Generators]
        E[Evaluation Hooks]
    end

    subgraph "Base Interfaces"
        F[ScoringStrategy]
        G[PersistenceBackend]
        H[ModelProvider]
        I[TestGenerator]
        J[EvaluationHook]
    end

    A --> F
    B --> G
    C --> H
    D --> I
    E --> J

    subgraph "Your Extensions"
        K[CustomScorer]
        L[CloudStorage]
        M[CustomLLM]
        N[DomainGenerator]
        O[MetricsHook]
    end

    F --> K
    G --> L
    H --> M
    I --> N
    J --> O

๐ŸŽฏ Custom Scoring Strategy

"""
Example: Building a domain-specific scoring strategy
"""
from llm_evaluation_framework.evaluation.scoring_strategies import ScoringStrategy
from typing import List, Dict, Any
import re
import nltk
from textstat import flesch_reading_ease

class ReadabilityScorer(ScoringStrategy):
    """
    Scoring strategy focused on text readability and comprehension
    Perfect for educational content evaluation
    """

    def __init__(self, target_grade_level: int = 8):
        """
        Initialize readability scorer

        Args:
            target_grade_level: Target reading grade level (1-12)
        """
        self.target_grade_level = target_grade_level
        self.weights = {
            'readability': 0.4,
            'clarity': 0.3,
            'completeness': 0.3
        }

    def calculate_score(self, predictions: List[str], references: List[str]) -> Dict[str, Any]:
        """Calculate readability-focused scores"""

        scores = {
            'readability_scores': [],
            'clarity_scores': [],
            'completeness_scores': []
        }

        for pred, ref in zip(predictions, references):
            # Readability analysis
            readability = self._analyze_readability(pred)
            clarity = self._analyze_clarity(pred)
            completeness = self._analyze_completeness(pred, ref)

            scores['readability_scores'].append(readability)
            scores['clarity_scores'].append(clarity)
            scores['completeness_scores'].append(completeness)

        # Calculate component averages
        avg_scores = {
            'readability': sum(scores['readability_scores']) / len(scores['readability_scores']),
            'clarity': sum(scores['clarity_scores']) / len(scores['clarity_scores']),
            'completeness': sum(scores['completeness_scores']) / len(scores['completeness_scores'])
        }

        # Calculate weighted overall score
        overall_score = sum(
            avg_scores[component] * self.weights[component]
            for component in avg_scores
        )

        return {
            'overall_score': overall_score,
            'component_scores': avg_scores,
            'detailed_metrics': {
                'target_grade_level': self.target_grade_level,
                'weights_used': self.weights,
                'individual_scores': scores
            }
        }

    def _analyze_readability(self, text: str) -> float:
        """Analyze text readability using multiple metrics"""

        # Flesch Reading Ease Score
        flesch_score = flesch_reading_ease(text)

        # Convert Flesch score to grade level approximation
        if flesch_score >= 90:
            grade_level = 5
        elif flesch_score >= 80:
            grade_level = 6
        elif flesch_score >= 70:
            grade_level = 7
        elif flesch_score >= 60:
            grade_level = 8
        elif flesch_score >= 50:
            grade_level = 9
        elif flesch_score >= 30:
            grade_level = 10
        else:
            grade_level = 12

        # Score based on proximity to target grade level
        grade_diff = abs(grade_level - self.target_grade_level)
        readability_score = max(0, 1 - (grade_diff / 6))  # Normalize to 0-1

        return readability_score

    def _analyze_clarity(self, text: str) -> float:
        """Analyze text clarity using linguistic features"""

        sentences = nltk.sent_tokenize(text)
        words = nltk.word_tokenize(text)

        # Average sentence length (optimal: 15-20 words)
        avg_sentence_length = len(words) / len(sentences)
        if 15 <= avg_sentence_length <= 20:
            length_score = 1.0
        else:
            length_score = max(0, 1 - abs(avg_sentence_length - 17.5) / 17.5)

        # Transition words and phrases
        transitions = [
            'however', 'therefore', 'furthermore', 'moreover',
            'in addition', 'for example', 'in conclusion', 'as a result'
        ]
        transition_count = sum(1 for transition in transitions 
                             if transition in text.lower())
        transition_score = min(1.0, transition_count / 3)

        # Active vs passive voice (prefer active)
        passive_indicators = ['was', 'were', 'been', 'being']
        passive_count = sum(1 for indicator in passive_indicators 
                          if indicator in text.lower())
        active_score = max(0, 1 - (passive_count / len(words)) * 10)

        # Combined clarity score
        clarity_score = (length_score + transition_score + active_score) / 3

        return clarity_score

    def _analyze_completeness(self, prediction: str, reference: str) -> float:
        """Analyze completeness compared to reference"""

        # Key concepts extraction (simplified)
        pred_concepts = set(re.findall(r'\b\w{4,}\b', prediction.lower()))
        ref_concepts = set(re.findall(r'\b\w{4,}\b', reference.lower()))

        if not ref_concepts:
            return 1.0

        # Calculate concept coverage
        covered_concepts = pred_concepts.intersection(ref_concepts)
        coverage_score = len(covered_concepts) / len(ref_concepts)

        return coverage_score

# Plugin registration system
class ScorerRegistry:
    """Registry for custom scoring strategies"""

    _scorers = {}

    @classmethod
    def register(cls, name: str, scorer_class: type):
        """Register a custom scorer"""
        cls._scorers[name] = scorer_class

    @classmethod
    def get_scorer(cls, name: str, **kwargs):
        """Get a registered scorer instance"""
        if name not in cls._scorers:
            raise ValueError(f"Scorer '{name}' not registered")
        return cls._scorers[name](**kwargs)

    @classmethod
    def list_scorers(cls) -> List[str]:
        """List all registered scorers"""
        return list(cls._scorers.keys())

# Register the custom scorer
ScorerRegistry.register('readability', ReadabilityScorer)

# Usage example
def use_custom_scorer():
    """Example of using custom scoring strategy"""

    # Create custom scorer
    readability_scorer = ScorerRegistry.get_scorer('readability', target_grade_level=8)

    # Use in evaluation
    from llm_evaluation_framework import ModelInferenceEngine, ModelRegistry

    registry = ModelRegistry()
    engine = ModelInferenceEngine(registry)

    # Register model
    registry.register_model("gpt-3.5-turbo", {
        "provider": "openai",
        "capabilities": ["reasoning"]
    })

    # Generate educational test cases
    test_cases = [
        {
            "prompt": "Explain photosynthesis to an 8th grader",
            "expected_output": "Photosynthesis is how plants make food using sunlight, water, and carbon dioxide."
        }
    ]

    # Evaluate with custom scorer
    results = engine.evaluate_model(
        model_name="gpt-3.5-turbo",
        test_cases=test_cases,
        scoring_strategy=readability_scorer
    )

    # Display results
    print(f"Overall Readability Score: {results['scores']['overall_score']:.2f}")
    print(f"Component Scores: {results['scores']['component_scores']}")

๐Ÿ—„๏ธ Custom Persistence Backend

"""
Example: Building a custom cloud storage backend
"""
import json
import boto3
from typing import Dict, Any, List, Optional
from llm_evaluation_framework.persistence.base_store import BaseStore
from botocore.exceptions import ClientError

class S3PersistenceBackend(BaseStore):
    """
    AWS S3 persistence backend for scalable evaluation storage
    """

    def __init__(self, bucket_name: str, region: str = 'us-east-1', 
                 key_prefix: str = 'llm-evaluations/'):
        """
        Initialize S3 backend

        Args:
            bucket_name: S3 bucket name
            region: AWS region
            key_prefix: Prefix for all keys
        """
        self.bucket_name = bucket_name
        self.key_prefix = key_prefix

        # Initialize S3 client
        self.s3_client = boto3.client('s3', region_name=region)

        # Verify bucket access
        self._verify_bucket_access()

    def _verify_bucket_access(self):
        """Verify bucket exists and is accessible"""
        try:
            self.s3_client.head_bucket(Bucket=self.bucket_name)
        except ClientError as e:
            if e.response['Error']['Code'] == '404':
                raise ValueError(f"Bucket {self.bucket_name} does not exist")
            else:
                raise ValueError(f"Cannot access bucket {self.bucket_name}: {e}")

    def save(self, key: str, data: Dict[str, Any]) -> bool:
        """Save evaluation results to S3"""
        try:
            full_key = f"{self.key_prefix}{key}.json"

            # Add metadata for search/indexing
            metadata = self._extract_metadata(data)

            # Upload to S3
            self.s3_client.put_object(
                Bucket=self.bucket_name,
                Key=full_key,
                Body=json.dumps(data, default=str),
                ContentType='application/json',
                Metadata=metadata,
                ServerSideEncryption='AES256'  # Enable encryption
            )

            return True

        except Exception as e:
            print(f"S3 save failed: {e}")
            return False

    def load(self, key: str) -> Dict[str, Any]:
        """Load evaluation results from S3"""
        try:
            full_key = f"{self.key_prefix}{key}.json"

            response = self.s3_client.get_object(
                Bucket=self.bucket_name,
                Key=full_key
            )

            return json.loads(response['Body'].read())

        except ClientError as e:
            if e.response['Error']['Code'] == 'NoSuchKey':
                raise KeyError(f"Key not found: {key}")
            raise

    def query(self, filters: Dict[str, Any] = None) -> List[Dict[str, Any]]:
        """Query evaluations with optional filters"""
        try:
            # List objects with prefix
            response = self.s3_client.list_objects_v2(
                Bucket=self.bucket_name,
                Prefix=self.key_prefix
            )

            if 'Contents' not in response:
                return []

            results = []
            for obj in response['Contents']:
                try:
                    # Get object metadata
                    head_response = self.s3_client.head_object(
                        Bucket=self.bucket_name,
                        Key=obj['Key']
                    )

                    metadata = head_response.get('Metadata', {})

                    # Apply filters if provided
                    if filters and not self._matches_filters(metadata, filters):
                        continue

                    # Load full object if it matches filters
                    key = obj['Key'].replace(self.key_prefix, '').replace('.json', '')
                    full_data = self.load(key)
                    results.append(full_data)

                except Exception as e:
                    print(f"Error processing object {obj['Key']}: {e}")
                    continue

            return results

        except Exception as e:
            print(f"S3 query failed: {e}")
            return []

    def delete(self, key: str) -> bool:
        """Delete evaluation from S3"""
        try:
            full_key = f"{self.key_prefix}{key}.json"

            self.s3_client.delete_object(
                Bucket=self.bucket_name,
                Key=full_key
            )

            return True

        except Exception as e:
            print(f"S3 delete failed: {e}")
            return False

    def list_evaluations(self, limit: int = 100) -> List[Dict[str, str]]:
        """List available evaluations with metadata"""
        try:
            response = self.s3_client.list_objects_v2(
                Bucket=self.bucket_name,
                Prefix=self.key_prefix,
                MaxKeys=limit
            )

            if 'Contents' not in response:
                return []

            evaluations = []
            for obj in response['Contents']:
                key = obj['Key'].replace(self.key_prefix, '').replace('.json', '')
                evaluations.append({
                    'key': key,
                    'last_modified': obj['LastModified'].isoformat(),
                    'size': obj['Size']
                })

            return evaluations

        except Exception as e:
            print(f"S3 list failed: {e}")
            return []

    def _extract_metadata(self, data: Dict[str, Any]) -> Dict[str, str]:
        """Extract metadata for S3 object tagging"""
        metadata = {}

        if 'model_name' in data:
            metadata['model-name'] = data['model_name']

        if 'timestamp' in data:
            metadata['timestamp'] = data['timestamp']

        if 'aggregate_metrics' in data:
            metrics = data['aggregate_metrics']
            if 'accuracy' in metrics:
                metadata['accuracy'] = str(round(metrics['accuracy'], 3))
            if 'total_cost' in metrics:
                metadata['total-cost'] = str(round(metrics['total_cost'], 4))

        return metadata

    def _matches_filters(self, metadata: Dict[str, str], 
                        filters: Dict[str, Any]) -> bool:
        """Check if metadata matches filters"""
        for filter_key, filter_value in filters.items():
            if filter_key == 'model_name':
                if metadata.get('model-name') != filter_value:
                    return False
            elif filter_key == 'min_accuracy':
                accuracy = float(metadata.get('accuracy', 0))
                if accuracy < filter_value:
                    return False
            # Add more filter conditions as needed

        return True

# Usage example
def use_custom_persistence():
    """Example of using custom S3 persistence backend"""

    # Initialize S3 backend
    s3_backend = S3PersistenceBackend(
        bucket_name='my-llm-evaluations',
        region='us-west-2',
        key_prefix='evaluations/production/'
    )

    # Use in persistence manager
    from llm_evaluation_framework.persistence import PersistenceManager

    persistence_manager = PersistenceManager({
        's3': s3_backend,
        'local': JsonStore('./local_backup/')  # Local backup
    })

    # Use in evaluation engine
    from llm_evaluation_framework import ModelInferenceEngine, ModelRegistry

    registry = ModelRegistry()
    engine = ModelInferenceEngine(registry, persistence_manager)

    # Run evaluation - results automatically saved to S3
    results = engine.evaluate_model("gpt-3.5-turbo", test_cases)

    # Query stored evaluations
    recent_evaluations = s3_backend.query({
        'model_name': 'gpt-3.5-turbo',
        'min_accuracy': 0.8
    })

    print(f"Found {len(recent_evaluations)} high-accuracy evaluations")

๐Ÿ“ Code Quality Standards

๐ŸŽฏ Quality Guidelines

We maintain **enterprise-grade code quality** through rigorous standards: #### **Code Style & Formatting** - **PEP 8**: Python Enhancement Proposal 8 compliance - **Black**: Automatic code formatting (line length: 88 characters) - **isort**: Import statement organization - **Line Length**: Maximum 88 characters (Black standard) #### **Type Safety** - **100% Type Hints**: All functions, methods, and variables must have type annotations - **mypy**: Static type checking with strict configuration - **Runtime Validation**: Input validation using type checking #### **Documentation Standards** - **Docstrings**: Google-style docstrings for all public APIs - **Type Documentation**: Document complex types and data structures - **Examples**: Include usage examples in docstrings - **API Documentation**: Auto-generated from code using Sphinx #### **Testing Requirements** - **Coverage**: Minimum 85% test coverage - **Test Types**: Unit, integration, and end-to-end tests - **Documentation**: Tests serve as living documentation - **Performance**: Include performance regression tests

๐Ÿ“‹ Code Review Checklist

#### **Before Submitting PR** - [ ] **Tests Pass**: All tests pass locally - [ ] **Coverage**: New code has appropriate test coverage - [ ] **Type Checking**: mypy passes without errors - [ ] **Linting**: flake8 passes without errors - [ ] **Formatting**: Code formatted with Black - [ ] **Documentation**: Public APIs documented - [ ] **Examples**: Complex features include usage examples - [ ] **Performance**: No significant performance regression #### **PR Description Requirements** - [ ] **Clear Title**: Descriptive PR title - [ ] **Problem Description**: What issue does this solve? - [ ] **Solution Overview**: How does this solve the issue? - [ ] **Testing**: How was this tested? - [ ] **Breaking Changes**: Any breaking changes noted - [ ] **Documentation**: Documentation updates included

๐Ÿ”ง Development Tools Configuration

#### **pyproject.toml Configuration**
[tool.black]
line-length = 88
target-version = ['py38']
include = '\.pyi?$'
extend-exclude = '''
/(
  # directories
  \.eggs
  | \.git
  | \.hg
  | \.mypy_cache
  | \.tox
  | \.venv
  | build
  | dist
)/
'''

[tool.isort]
profile = "black"
multi_line_output = 3
line_length = 88
known_first_party = ["llm_evaluation_framework"]

[tool.mypy]
python_version = "3.8"
warn_return_any = true
warn_unused_configs = true
disallow_untyped_defs = true
disallow_incomplete_defs = true
check_untyped_defs = true
no_implicit_optional = true
warn_redundant_casts = true
warn_unused_ignores = true
show_error_codes = true

[tool.pytest.ini_options]
minversion = "6.0"
addopts = "-ra -q --strict-markers --cov=llm_evaluation_framework"
testpaths = ["tests"]
markers = [
    "slow: marks tests as slow (deselect with '-m \"not slow\"')",
    "integration: marks tests as integration tests",
    "e2e: marks tests as end-to-end tests",
]

[tool.coverage.run]
source = ["llm_evaluation_framework"]
omit = [
    "*/tests/*",
    "*/test_*",
    "*/conftest.py",
]

[tool.coverage.report]
exclude_lines = [
    "pragma: no cover",
    "def __repr__",
    "if self.debug:",
    "if settings.DEBUG",
    "raise AssertionError",
    "raise NotImplementedError",
    "if 0:",
    "if __name__ == .__main__.:",
    "class .*\bProtocol\):",
    "@(abc\.)?abstractmethod",
]
#### **Pre-commit Configuration** (`.pre-commit-config.yaml`)
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.4.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-added-large-files

  - repo: https://github.com/psf/black
    rev: 23.1.0
    hooks:
      - id: black
        language_version: python3

  - repo: https://github.com/pycqa/isort
    rev: 5.12.0
    hooks:
      - id: isort

  - repo: https://github.com/pycqa/flake8
    rev: 6.0.0
    hooks:
      - id: flake8

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.0.1
    hooks:
      - id: mypy
        additional_dependencies: [types-all]

๐Ÿค Contributing Workflow

๐Ÿ”„ Contribution Process

#### **1. Issue Discovery & Planning**
# Check for existing issues
# https://github.com/isathish/LLMEvaluationFramework/issues

# Create new issue if needed
# Use issue templates for bugs, features, or documentation
#### **2. Development Setup**
# Fork repository on GitHub
# Clone your fork
git clone https://github.com/YOUR_USERNAME/LLMEvaluationFramework.git
cd LLMEvaluationFramework

# Add upstream remote
git remote add upstream https://github.com/isathish/LLMEvaluationFramework.git

# Create feature branch
git checkout -b feature/your-feature-name
#### **3. Development Cycle**
# Make changes
# Write tests
# Update documentation

# Check code quality
black .
isort .
flake8
mypy .

# Run tests
pytest --cov=llm_evaluation_framework

# Commit changes
git add .
git commit -m "feat: add new feature description"
#### **4. Submission Process**
# Sync with upstream
git fetch upstream
git rebase upstream/main

# Push to your fork
git push origin feature/your-feature-name

# Create Pull Request on GitHub
# Fill out PR template completely
#### **5. Review & Merge** - **Code Review**: Maintainers review code - **CI Checks**: Automated tests must pass - **Approval**: At least one maintainer approval required - **Merge**: Squash and merge to main branch

๐Ÿท๏ธ Commit Message Standards

We follow **Conventional Commits** specification: #### **Commit Format**
<type>[optional scope]: <description>

[optional body]

[optional footer(s)]
#### **Commit Types** - **feat**: New feature - **fix**: Bug fix - **docs**: Documentation only changes - **style**: Changes that do not affect the meaning of the code - **refactor**: Code change that neither fixes a bug nor adds a feature - **perf**: Performance improvement - **test**: Adding missing tests or correcting existing tests - **chore**: Changes to the build process or auxiliary tools #### **Examples**
feat(registry): add model validation with capability checking

fix(engine): resolve timeout issue in async evaluation

docs(api): update ModelRegistry docstrings with examples

test(scoring): add comprehensive tests for F1 scoring strategy

refactor(persistence): simplify database connection management

๐Ÿ“Š Project Structure & Architecture

๐Ÿ—๏ธ Directory Structure

LLMEvaluationFramework/
โ”œโ”€โ”€ ๐Ÿ“ llm_evaluation_framework/          # Main package
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ __init__.py                    # Package initialization
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ cli.py                         # Command-line interface
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ model_registry.py              # Model management
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ model_inference_engine.py      # Evaluation engine
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ test_dataset_generator.py      # Test case generation
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ auto_suggestion_engine.py      # AI recommendations
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ core/                          # Core interfaces
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ base_engine.py             # Base engine interface
โ”‚   โ”‚   โ””โ”€โ”€ ๐Ÿ“„ base_registry.py           # Base registry interface
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ engines/                       # Evaluation engines
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ __init__.py
โ”‚   โ”‚   โ””โ”€โ”€ ๐Ÿ“„ async_inference_engine.py  # Async evaluation
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ evaluation/                    # Scoring strategies
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ __init__.py
โ”‚   โ”‚   โ””โ”€โ”€ ๐Ÿ“„ scoring_strategies.py      # Evaluation metrics
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ persistence/                   # Data storage
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ persistence_manager.py     # Storage coordination
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ json_store.py              # JSON file storage
โ”‚   โ”‚   โ””โ”€โ”€ ๐Ÿ“„ db_store.py                # Database storage
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ registry/                      # Model registries
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ __init__.py
โ”‚   โ”‚   โ””โ”€โ”€ ๐Ÿ“„ model_registry.py          # Model configuration
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ utils/                         # Utilities
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ error_handler.py           # Error management
โ”‚   โ”‚   โ””โ”€โ”€ ๐Ÿ“„ logger.py                  # Logging utilities
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ ๐Ÿ“ cli/                           # CLI components
โ”‚       โ”œโ”€โ”€ ๐Ÿ“„ __init__.py
โ”‚       โ””โ”€โ”€ ๐Ÿ“„ main.py                    # CLI implementation
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ tests/                             # Test suite
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ __init__.py
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ conftest.py                    # Test configuration
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ unit/                          # Unit tests
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ integration/                   # Integration tests
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ e2e/                           # End-to-end tests
โ”‚   โ””โ”€โ”€ ๐Ÿ“ benchmarks/                    # Performance tests
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ docs/                              # Documentation
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ index.md                       # Main documentation
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ categories/                    # Documentation categories
โ”‚   โ””โ”€โ”€ ๐Ÿ“„ mkdocs.yml                     # Documentation config
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ examples/                          # Usage examples
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ basic_usage.py                 # Simple examples
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ advanced_async_usage.py        # Advanced patterns
โ”‚   โ””โ”€โ”€ ๐Ÿ“„ custom_scoring_and_persistence.py
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ scripts/                           # Development scripts
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ setup_dev.sh                   # Development setup
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ run_tests.sh                   # Test execution
โ”‚   โ””โ”€โ”€ ๐Ÿ“„ build_docs.sh                  # Documentation build
โ”‚
โ”œโ”€โ”€ ๐Ÿ“„ pyproject.toml                     # Project configuration
โ”œโ”€โ”€ ๐Ÿ“„ setup.py                           # Package setup
โ”œโ”€โ”€ ๐Ÿ“„ requirements.txt                   # Dependencies
โ”œโ”€โ”€ ๐Ÿ“„ requirements-dev.txt               # Development dependencies
โ”œโ”€โ”€ ๐Ÿ“„ README.md                          # Project overview
โ”œโ”€โ”€ ๐Ÿ“„ LICENSE                            # License file
โ”œโ”€โ”€ ๐Ÿ“„ CHANGELOG.md                       # Version history
โ””โ”€โ”€ ๐Ÿ“„ CONTRIBUTING.md                    # Contribution guidelines

๐ŸŽฏ Component Architecture

graph TB
    subgraph "User Interface Layer"
        CLI[CLI Interface]
        API[Python API]
    end

    subgraph "Core Framework Layer"
        Registry[Model Registry]
        Engine[Inference Engine]
        Generator[Dataset Generator]
        Suggestions[Auto Suggestions]
    end

    subgraph "Evaluation Layer"
        Scoring[Scoring Strategies]
        Validation[Result Validation]
        Analytics[Analytics Engine]
    end

    subgraph "Persistence Layer"
        Manager[Persistence Manager]
        JSON[JSON Store]
        DB[Database Store]
        Custom[Custom Backends]
    end

    subgraph "Infrastructure Layer"
        Logging[Logging System]
        Errors[Error Handling]
        Config[Configuration]
        Monitor[Monitoring]
    end

    CLI --> Engine
    API --> Engine
    Engine --> Registry
    Engine --> Generator
    Engine --> Scoring
    Engine --> Manager
    Scoring --> Validation
    Manager --> JSON
    Manager --> DB
    Manager --> Custom
    Engine --> Logging
    Engine --> Errors

๐Ÿš€ Release Process

๐Ÿ“ฆ Versioning Strategy

We follow **Semantic Versioning (SemVer)**: - **MAJOR.MINOR.PATCH** (e.g., 1.2.3) - **MAJOR**: Breaking changes - **MINOR**: New features (backward compatible) - **PATCH**: Bug fixes (backward compatible) #### **Release Schedule** - **Major Releases**: Quarterly (breaking changes) - **Minor Releases**: Monthly (new features) - **Patch Releases**: As needed (critical bug fixes) #### **Release Process** 1. **Planning**: Feature planning and roadmap review 2. **Development**: Feature implementation and testing 3. **Testing**: Comprehensive testing across environments 4. **Documentation**: Update documentation and examples 5. **Release**: Package build and distribution 6. **Announcement**: Release notes and community update

๐ŸŽฏ Getting Involved

#### **Ways to Contribute** | **Level** | **Activities** | **Time Commitment** | |-----------|----------------|-------------------| | **Beginner** | Bug reports, documentation fixes, typos | 15-30 minutes | | **Intermediate** | Feature requests, small features, tests | 1-3 hours | | **Advanced** | Major features, performance improvements | 3+ hours | | **Expert** | Architecture decisions, mentoring, releases | Ongoing | #### **Communication Channels** - **GitHub Issues**: Bug reports and feature requests - **GitHub Discussions**: Q&A and community chat - **Pull Requests**: Code contributions and reviews - **Discord** (Coming Soon): Real-time community chat #### **Recognition** - **Contributors**: Listed in CONTRIBUTORS.md - **Maintainers**: Core team with merge permissions - **Champions**: Community leaders and advocates

## ๐ŸŽ‰ Ready to Contribute? **Join our community of developers building the future of LLM evaluation!** [![๐Ÿ› Report Bug](https://img.shields.io/badge/๐Ÿ›_Report_Bug-GitHub%20Issues-ef4444?style=for-the-badge)](https://github.com/isathish/LLMEvaluationFramework/issues/new?template=bug_report.md) [![โœจ Request Feature](https://img.shields.io/badge/โœจ_Request_Feature-GitHub%20Issues-22c55e?style=for-the-badge)](https://github.com/isathish/LLMEvaluationFramework/issues/new?template=feature_request.md) [![๐Ÿค Start Contributing](https://img.shields.io/badge/๐Ÿค_Start_Contributing-Fork%20Now-6366f1?style=for-the-badge)](https://github.com/isathish/LLMEvaluationFramework/fork) [![๐Ÿ’ฌ Join Discussions](https://img.shields.io/badge/๐Ÿ’ฌ_Join_Discussions-GitHub-f59e0b?style=for-the-badge)](https://github.com/isathish/LLMEvaluationFramework/discussions) --- **Every contribution makes a difference! ๐ŸŒŸ** *Whether you're fixing a typo or building a major feature, your work helps thousands of developers build better AI systems.*