๐จโ๐ป Developer Guide¶
๐ฏ Developer Journey Map¶
๐ SetupEnvironment & tools | ๐งช TestingQuality & validation | ๐ง BuildingFeatures & extensions | ๐ค ContributingCollaboration & review |
๐ Quick Development Setup¶
โก Lightning Setup (5 minutes)¶
# 1๏ธโฃ Clone and enter directory
git clone https://github.com/isathish/LLMEvaluationFramework.git
cd LLMEvaluationFramework
# 2๏ธโฃ Create isolated environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3๏ธโฃ Install development dependencies
pip install -e ".[dev,docs,test]"
# 4๏ธโฃ Verify installation
python -c "from llm_evaluation_framework import ModelRegistry; print('โ
Setup successful!')"
# 5๏ธโฃ Run quick test
pytest tests/test_quick_setup.py -v
๐ ๏ธ Complete Development Environment¶
#### **Prerequisites** - **Python 3.8+** (recommended: 3.11) - **Git** for version control - **Visual Studio Code** (recommended IDE) - **Docker** (optional, for containerized development) #### **Development Dependencies** #### **IDE Configuration** **Visual Studio Code Settings** (`.vscode/settings.json`): #### **Pre-commit Hooks Setup**
# Install all development tools
pip install -e ".[dev]"
# This includes:
# - pytest (testing framework)
# - pytest-cov (coverage reporting)
# - black (code formatting)
# - flake8 (linting)
# - mypy (type checking)
# - pre-commit (git hooks)
# - sphinx (documentation)
{
"python.defaultInterpreterPath": "./venv/bin/python",
"python.formatting.provider": "black",
"python.linting.enabled": true,
"python.linting.flake8Enabled": true,
"python.linting.mypyEnabled": true,
"python.testing.pytestEnabled": true,
"python.testing.pytestArgs": [
"tests",
"--cov=llm_evaluation_framework",
"--cov-report=html"
],
"files.associations": {
"*.md": "markdown"
},
"markdown.extension.toc.levels": "1..3"
}
๐งช Comprehensive Testing Guide¶
๐ฏ Testing Philosophy¶
Our testing strategy ensures reliability, maintainability, and confidence in every release:
| **Level** | **Coverage** | **Purpose** | **Tools** | |-----------|--------------|-------------|-----------| | **Unit Tests** | Individual functions/classes | Logic validation | pytest, unittest.mock | | **Integration Tests** | Component interactions | Workflow validation | pytest, fixtures | | **End-to-End Tests** | Complete workflows | User experience | pytest, real API calls | | **Performance Tests** | Speed/memory usage | Optimization validation | pytest-benchmark |
๐ฌ Testing Commands¶
#### **Basic Testing** #### **Coverage Analysis** #### **Performance Testing** #### **Advanced Testing Options**
# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_model_registry.py
# Run specific test function
pytest tests/test_model_registry.py::test_register_model
# Run tests matching pattern
pytest -k "test_model" -v
# Generate coverage report
pytest --cov=llm_evaluation_framework
# Generate HTML coverage report
pytest --cov=llm_evaluation_framework --cov-report=html
# Coverage with specific threshold
pytest --cov=llm_evaluation_framework --cov-fail-under=85
# Show missing lines
pytest --cov=llm_evaluation_framework --cov-report=term-missing
๐งช Writing Quality Tests¶
#### **Test Structure Template** #### **Test Categories and Conventions** | **Test Type** | **Naming** | **Location** | **Purpose** | |---------------|------------|--------------|-------------| | **Unit Tests** | `test_` | `tests/unit/` | Individual function testing | | **Integration Tests** | `test__integration` | `tests/integration/` | Component interaction testing | | **E2E Tests** | `test__e2e` | `tests/e2e/` | Complete workflow testing | | **Performance Tests** | `test__performance` | `tests/benchmarks/` | Performance validation |
"""
Test template following AAA pattern (Arrange, Act, Assert)
"""
import pytest
from unittest.mock import Mock, patch
from llm_evaluation_framework import ModelRegistry
class TestModelRegistry:
"""Test class for ModelRegistry functionality"""
def setup_method(self):
"""Setup before each test method"""
self.registry = ModelRegistry()
self.sample_config = {
"provider": "openai",
"api_cost_input": 0.001,
"api_cost_output": 0.002,
"capabilities": ["reasoning"]
}
def test_register_model_success(self):
"""Test successful model registration"""
# Arrange
model_name = "test-model"
# Act
result = self.registry.register_model(model_name, self.sample_config)
# Assert
assert result is True
assert model_name in self.registry._models
assert self.registry._models[model_name] == self.sample_config
def test_register_model_invalid_config(self):
"""Test model registration with invalid configuration"""
# Arrange
model_name = "test-model"
invalid_config = {"provider": "unknown"}
# Act & Assert
with pytest.raises(ValueError, match="Invalid provider"):
self.registry.register_model(model_name, invalid_config)
@pytest.mark.parametrize("provider,expected", [
("openai", True),
("anthropic", True),
("invalid", False)
])
def test_validate_provider(self, provider, expected):
"""Test provider validation with multiple inputs"""
# Arrange
config = self.sample_config.copy()
config["provider"] = provider
# Act
result = self.registry._validate_provider(config)
# Assert
assert result == expected
@patch('llm_evaluation_framework.model_registry.requests.get')
def test_validate_api_key(self, mock_get):
"""Test API key validation with mocked HTTP calls"""
# Arrange
mock_get.return_value.status_code = 200
# Act
result = self.registry._validate_api_key("test-key", "openai")
# Assert
assert result is True
mock_get.assert_called_once()
# Integration test example
class TestModelRegistryIntegration:
"""Integration tests for ModelRegistry with other components"""
def test_registry_with_inference_engine(self):
"""Test registry integration with inference engine"""
# Arrange
registry = ModelRegistry()
engine = ModelInferenceEngine(registry)
# Register model
registry.register_model("gpt-3.5-turbo", self.sample_config)
# Act
available_models = engine.get_available_models()
# Assert
assert "gpt-3.5-turbo" in available_models
# Fixture examples
@pytest.fixture
def sample_registry():
"""Fixture providing a pre-configured registry"""
registry = ModelRegistry()
registry.register_model("test-model", {
"provider": "openai",
"capabilities": ["reasoning"]
})
return registry
@pytest.fixture
def sample_test_cases():
"""Fixture providing sample test cases"""
return [
{
"id": "test_1",
"prompt": "What is 2+2?",
"expected_output": "4",
"metadata": {"difficulty": "easy"}
},
{
"id": "test_2",
"prompt": "Explain quantum computing",
"expected_output": "Quantum computing uses quantum mechanics...",
"metadata": {"difficulty": "hard"}
}
]
๐ง Building Custom Extensions¶
๐๏ธ Extension Architecture¶
The framework is designed with **extensibility** as a core principle. Every major component can be extended or replaced:
graph TB
subgraph "Extension Points"
A[Scoring Strategies]
B[Persistence Backends]
C[Model Providers]
D[Test Generators]
E[Evaluation Hooks]
end
subgraph "Base Interfaces"
F[ScoringStrategy]
G[PersistenceBackend]
H[ModelProvider]
I[TestGenerator]
J[EvaluationHook]
end
A --> F
B --> G
C --> H
D --> I
E --> J
subgraph "Your Extensions"
K[CustomScorer]
L[CloudStorage]
M[CustomLLM]
N[DomainGenerator]
O[MetricsHook]
end
F --> K
G --> L
H --> M
I --> N
J --> O ๐ฏ Custom Scoring Strategy¶
"""
Example: Building a domain-specific scoring strategy
"""
from llm_evaluation_framework.evaluation.scoring_strategies import ScoringStrategy
from typing import List, Dict, Any
import re
import nltk
from textstat import flesch_reading_ease
class ReadabilityScorer(ScoringStrategy):
"""
Scoring strategy focused on text readability and comprehension
Perfect for educational content evaluation
"""
def __init__(self, target_grade_level: int = 8):
"""
Initialize readability scorer
Args:
target_grade_level: Target reading grade level (1-12)
"""
self.target_grade_level = target_grade_level
self.weights = {
'readability': 0.4,
'clarity': 0.3,
'completeness': 0.3
}
def calculate_score(self, predictions: List[str], references: List[str]) -> Dict[str, Any]:
"""Calculate readability-focused scores"""
scores = {
'readability_scores': [],
'clarity_scores': [],
'completeness_scores': []
}
for pred, ref in zip(predictions, references):
# Readability analysis
readability = self._analyze_readability(pred)
clarity = self._analyze_clarity(pred)
completeness = self._analyze_completeness(pred, ref)
scores['readability_scores'].append(readability)
scores['clarity_scores'].append(clarity)
scores['completeness_scores'].append(completeness)
# Calculate component averages
avg_scores = {
'readability': sum(scores['readability_scores']) / len(scores['readability_scores']),
'clarity': sum(scores['clarity_scores']) / len(scores['clarity_scores']),
'completeness': sum(scores['completeness_scores']) / len(scores['completeness_scores'])
}
# Calculate weighted overall score
overall_score = sum(
avg_scores[component] * self.weights[component]
for component in avg_scores
)
return {
'overall_score': overall_score,
'component_scores': avg_scores,
'detailed_metrics': {
'target_grade_level': self.target_grade_level,
'weights_used': self.weights,
'individual_scores': scores
}
}
def _analyze_readability(self, text: str) -> float:
"""Analyze text readability using multiple metrics"""
# Flesch Reading Ease Score
flesch_score = flesch_reading_ease(text)
# Convert Flesch score to grade level approximation
if flesch_score >= 90:
grade_level = 5
elif flesch_score >= 80:
grade_level = 6
elif flesch_score >= 70:
grade_level = 7
elif flesch_score >= 60:
grade_level = 8
elif flesch_score >= 50:
grade_level = 9
elif flesch_score >= 30:
grade_level = 10
else:
grade_level = 12
# Score based on proximity to target grade level
grade_diff = abs(grade_level - self.target_grade_level)
readability_score = max(0, 1 - (grade_diff / 6)) # Normalize to 0-1
return readability_score
def _analyze_clarity(self, text: str) -> float:
"""Analyze text clarity using linguistic features"""
sentences = nltk.sent_tokenize(text)
words = nltk.word_tokenize(text)
# Average sentence length (optimal: 15-20 words)
avg_sentence_length = len(words) / len(sentences)
if 15 <= avg_sentence_length <= 20:
length_score = 1.0
else:
length_score = max(0, 1 - abs(avg_sentence_length - 17.5) / 17.5)
# Transition words and phrases
transitions = [
'however', 'therefore', 'furthermore', 'moreover',
'in addition', 'for example', 'in conclusion', 'as a result'
]
transition_count = sum(1 for transition in transitions
if transition in text.lower())
transition_score = min(1.0, transition_count / 3)
# Active vs passive voice (prefer active)
passive_indicators = ['was', 'were', 'been', 'being']
passive_count = sum(1 for indicator in passive_indicators
if indicator in text.lower())
active_score = max(0, 1 - (passive_count / len(words)) * 10)
# Combined clarity score
clarity_score = (length_score + transition_score + active_score) / 3
return clarity_score
def _analyze_completeness(self, prediction: str, reference: str) -> float:
"""Analyze completeness compared to reference"""
# Key concepts extraction (simplified)
pred_concepts = set(re.findall(r'\b\w{4,}\b', prediction.lower()))
ref_concepts = set(re.findall(r'\b\w{4,}\b', reference.lower()))
if not ref_concepts:
return 1.0
# Calculate concept coverage
covered_concepts = pred_concepts.intersection(ref_concepts)
coverage_score = len(covered_concepts) / len(ref_concepts)
return coverage_score
# Plugin registration system
class ScorerRegistry:
"""Registry for custom scoring strategies"""
_scorers = {}
@classmethod
def register(cls, name: str, scorer_class: type):
"""Register a custom scorer"""
cls._scorers[name] = scorer_class
@classmethod
def get_scorer(cls, name: str, **kwargs):
"""Get a registered scorer instance"""
if name not in cls._scorers:
raise ValueError(f"Scorer '{name}' not registered")
return cls._scorers[name](**kwargs)
@classmethod
def list_scorers(cls) -> List[str]:
"""List all registered scorers"""
return list(cls._scorers.keys())
# Register the custom scorer
ScorerRegistry.register('readability', ReadabilityScorer)
# Usage example
def use_custom_scorer():
"""Example of using custom scoring strategy"""
# Create custom scorer
readability_scorer = ScorerRegistry.get_scorer('readability', target_grade_level=8)
# Use in evaluation
from llm_evaluation_framework import ModelInferenceEngine, ModelRegistry
registry = ModelRegistry()
engine = ModelInferenceEngine(registry)
# Register model
registry.register_model("gpt-3.5-turbo", {
"provider": "openai",
"capabilities": ["reasoning"]
})
# Generate educational test cases
test_cases = [
{
"prompt": "Explain photosynthesis to an 8th grader",
"expected_output": "Photosynthesis is how plants make food using sunlight, water, and carbon dioxide."
}
]
# Evaluate with custom scorer
results = engine.evaluate_model(
model_name="gpt-3.5-turbo",
test_cases=test_cases,
scoring_strategy=readability_scorer
)
# Display results
print(f"Overall Readability Score: {results['scores']['overall_score']:.2f}")
print(f"Component Scores: {results['scores']['component_scores']}")
๐๏ธ Custom Persistence Backend¶
"""
Example: Building a custom cloud storage backend
"""
import json
import boto3
from typing import Dict, Any, List, Optional
from llm_evaluation_framework.persistence.base_store import BaseStore
from botocore.exceptions import ClientError
class S3PersistenceBackend(BaseStore):
"""
AWS S3 persistence backend for scalable evaluation storage
"""
def __init__(self, bucket_name: str, region: str = 'us-east-1',
key_prefix: str = 'llm-evaluations/'):
"""
Initialize S3 backend
Args:
bucket_name: S3 bucket name
region: AWS region
key_prefix: Prefix for all keys
"""
self.bucket_name = bucket_name
self.key_prefix = key_prefix
# Initialize S3 client
self.s3_client = boto3.client('s3', region_name=region)
# Verify bucket access
self._verify_bucket_access()
def _verify_bucket_access(self):
"""Verify bucket exists and is accessible"""
try:
self.s3_client.head_bucket(Bucket=self.bucket_name)
except ClientError as e:
if e.response['Error']['Code'] == '404':
raise ValueError(f"Bucket {self.bucket_name} does not exist")
else:
raise ValueError(f"Cannot access bucket {self.bucket_name}: {e}")
def save(self, key: str, data: Dict[str, Any]) -> bool:
"""Save evaluation results to S3"""
try:
full_key = f"{self.key_prefix}{key}.json"
# Add metadata for search/indexing
metadata = self._extract_metadata(data)
# Upload to S3
self.s3_client.put_object(
Bucket=self.bucket_name,
Key=full_key,
Body=json.dumps(data, default=str),
ContentType='application/json',
Metadata=metadata,
ServerSideEncryption='AES256' # Enable encryption
)
return True
except Exception as e:
print(f"S3 save failed: {e}")
return False
def load(self, key: str) -> Dict[str, Any]:
"""Load evaluation results from S3"""
try:
full_key = f"{self.key_prefix}{key}.json"
response = self.s3_client.get_object(
Bucket=self.bucket_name,
Key=full_key
)
return json.loads(response['Body'].read())
except ClientError as e:
if e.response['Error']['Code'] == 'NoSuchKey':
raise KeyError(f"Key not found: {key}")
raise
def query(self, filters: Dict[str, Any] = None) -> List[Dict[str, Any]]:
"""Query evaluations with optional filters"""
try:
# List objects with prefix
response = self.s3_client.list_objects_v2(
Bucket=self.bucket_name,
Prefix=self.key_prefix
)
if 'Contents' not in response:
return []
results = []
for obj in response['Contents']:
try:
# Get object metadata
head_response = self.s3_client.head_object(
Bucket=self.bucket_name,
Key=obj['Key']
)
metadata = head_response.get('Metadata', {})
# Apply filters if provided
if filters and not self._matches_filters(metadata, filters):
continue
# Load full object if it matches filters
key = obj['Key'].replace(self.key_prefix, '').replace('.json', '')
full_data = self.load(key)
results.append(full_data)
except Exception as e:
print(f"Error processing object {obj['Key']}: {e}")
continue
return results
except Exception as e:
print(f"S3 query failed: {e}")
return []
def delete(self, key: str) -> bool:
"""Delete evaluation from S3"""
try:
full_key = f"{self.key_prefix}{key}.json"
self.s3_client.delete_object(
Bucket=self.bucket_name,
Key=full_key
)
return True
except Exception as e:
print(f"S3 delete failed: {e}")
return False
def list_evaluations(self, limit: int = 100) -> List[Dict[str, str]]:
"""List available evaluations with metadata"""
try:
response = self.s3_client.list_objects_v2(
Bucket=self.bucket_name,
Prefix=self.key_prefix,
MaxKeys=limit
)
if 'Contents' not in response:
return []
evaluations = []
for obj in response['Contents']:
key = obj['Key'].replace(self.key_prefix, '').replace('.json', '')
evaluations.append({
'key': key,
'last_modified': obj['LastModified'].isoformat(),
'size': obj['Size']
})
return evaluations
except Exception as e:
print(f"S3 list failed: {e}")
return []
def _extract_metadata(self, data: Dict[str, Any]) -> Dict[str, str]:
"""Extract metadata for S3 object tagging"""
metadata = {}
if 'model_name' in data:
metadata['model-name'] = data['model_name']
if 'timestamp' in data:
metadata['timestamp'] = data['timestamp']
if 'aggregate_metrics' in data:
metrics = data['aggregate_metrics']
if 'accuracy' in metrics:
metadata['accuracy'] = str(round(metrics['accuracy'], 3))
if 'total_cost' in metrics:
metadata['total-cost'] = str(round(metrics['total_cost'], 4))
return metadata
def _matches_filters(self, metadata: Dict[str, str],
filters: Dict[str, Any]) -> bool:
"""Check if metadata matches filters"""
for filter_key, filter_value in filters.items():
if filter_key == 'model_name':
if metadata.get('model-name') != filter_value:
return False
elif filter_key == 'min_accuracy':
accuracy = float(metadata.get('accuracy', 0))
if accuracy < filter_value:
return False
# Add more filter conditions as needed
return True
# Usage example
def use_custom_persistence():
"""Example of using custom S3 persistence backend"""
# Initialize S3 backend
s3_backend = S3PersistenceBackend(
bucket_name='my-llm-evaluations',
region='us-west-2',
key_prefix='evaluations/production/'
)
# Use in persistence manager
from llm_evaluation_framework.persistence import PersistenceManager
persistence_manager = PersistenceManager({
's3': s3_backend,
'local': JsonStore('./local_backup/') # Local backup
})
# Use in evaluation engine
from llm_evaluation_framework import ModelInferenceEngine, ModelRegistry
registry = ModelRegistry()
engine = ModelInferenceEngine(registry, persistence_manager)
# Run evaluation - results automatically saved to S3
results = engine.evaluate_model("gpt-3.5-turbo", test_cases)
# Query stored evaluations
recent_evaluations = s3_backend.query({
'model_name': 'gpt-3.5-turbo',
'min_accuracy': 0.8
})
print(f"Found {len(recent_evaluations)} high-accuracy evaluations")
๐ Code Quality Standards¶
๐ฏ Quality Guidelines¶
We maintain **enterprise-grade code quality** through rigorous standards: #### **Code Style & Formatting** - **PEP 8**: Python Enhancement Proposal 8 compliance - **Black**: Automatic code formatting (line length: 88 characters) - **isort**: Import statement organization - **Line Length**: Maximum 88 characters (Black standard) #### **Type Safety** - **100% Type Hints**: All functions, methods, and variables must have type annotations - **mypy**: Static type checking with strict configuration - **Runtime Validation**: Input validation using type checking #### **Documentation Standards** - **Docstrings**: Google-style docstrings for all public APIs - **Type Documentation**: Document complex types and data structures - **Examples**: Include usage examples in docstrings - **API Documentation**: Auto-generated from code using Sphinx #### **Testing Requirements** - **Coverage**: Minimum 85% test coverage - **Test Types**: Unit, integration, and end-to-end tests - **Documentation**: Tests serve as living documentation - **Performance**: Include performance regression tests
๐ Code Review Checklist¶
#### **Before Submitting PR** - [ ] **Tests Pass**: All tests pass locally - [ ] **Coverage**: New code has appropriate test coverage - [ ] **Type Checking**: mypy passes without errors - [ ] **Linting**: flake8 passes without errors - [ ] **Formatting**: Code formatted with Black - [ ] **Documentation**: Public APIs documented - [ ] **Examples**: Complex features include usage examples - [ ] **Performance**: No significant performance regression #### **PR Description Requirements** - [ ] **Clear Title**: Descriptive PR title - [ ] **Problem Description**: What issue does this solve? - [ ] **Solution Overview**: How does this solve the issue? - [ ] **Testing**: How was this tested? - [ ] **Breaking Changes**: Any breaking changes noted - [ ] **Documentation**: Documentation updates included
๐ง Development Tools Configuration¶
#### **pyproject.toml Configuration** #### **Pre-commit Configuration** (`.pre-commit-config.yaml`)
[tool.black]
line-length = 88
target-version = ['py38']
include = '\.pyi?$'
extend-exclude = '''
/(
# directories
\.eggs
| \.git
| \.hg
| \.mypy_cache
| \.tox
| \.venv
| build
| dist
)/
'''
[tool.isort]
profile = "black"
multi_line_output = 3
line_length = 88
known_first_party = ["llm_evaluation_framework"]
[tool.mypy]
python_version = "3.8"
warn_return_any = true
warn_unused_configs = true
disallow_untyped_defs = true
disallow_incomplete_defs = true
check_untyped_defs = true
no_implicit_optional = true
warn_redundant_casts = true
warn_unused_ignores = true
show_error_codes = true
[tool.pytest.ini_options]
minversion = "6.0"
addopts = "-ra -q --strict-markers --cov=llm_evaluation_framework"
testpaths = ["tests"]
markers = [
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
"integration: marks tests as integration tests",
"e2e: marks tests as end-to-end tests",
]
[tool.coverage.run]
source = ["llm_evaluation_framework"]
omit = [
"*/tests/*",
"*/test_*",
"*/conftest.py",
]
[tool.coverage.report]
exclude_lines = [
"pragma: no cover",
"def __repr__",
"if self.debug:",
"if settings.DEBUG",
"raise AssertionError",
"raise NotImplementedError",
"if 0:",
"if __name__ == .__main__.:",
"class .*\bProtocol\):",
"@(abc\.)?abstractmethod",
]
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
- repo: https://github.com/psf/black
rev: 23.1.0
hooks:
- id: black
language_version: python3
- repo: https://github.com/pycqa/isort
rev: 5.12.0
hooks:
- id: isort
- repo: https://github.com/pycqa/flake8
rev: 6.0.0
hooks:
- id: flake8
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.0.1
hooks:
- id: mypy
additional_dependencies: [types-all]
๐ค Contributing Workflow¶
๐ Contribution Process¶
#### **1. Issue Discovery & Planning** #### **2. Development Setup** #### **3. Development Cycle** #### **4. Submission Process** #### **5. Review & Merge** - **Code Review**: Maintainers review code - **CI Checks**: Automated tests must pass - **Approval**: At least one maintainer approval required - **Merge**: Squash and merge to main branch
# Check for existing issues
# https://github.com/isathish/LLMEvaluationFramework/issues
# Create new issue if needed
# Use issue templates for bugs, features, or documentation
# Fork repository on GitHub
# Clone your fork
git clone https://github.com/YOUR_USERNAME/LLMEvaluationFramework.git
cd LLMEvaluationFramework
# Add upstream remote
git remote add upstream https://github.com/isathish/LLMEvaluationFramework.git
# Create feature branch
git checkout -b feature/your-feature-name
๐ท๏ธ Commit Message Standards¶
We follow **Conventional Commits** specification: #### **Commit Format** #### **Commit Types** - **feat**: New feature - **fix**: Bug fix - **docs**: Documentation only changes - **style**: Changes that do not affect the meaning of the code - **refactor**: Code change that neither fixes a bug nor adds a feature - **perf**: Performance improvement - **test**: Adding missing tests or correcting existing tests - **chore**: Changes to the build process or auxiliary tools #### **Examples**
๐ Project Structure & Architecture¶
๐๏ธ Directory Structure¶
LLMEvaluationFramework/
โโโ ๐ llm_evaluation_framework/ # Main package
โ โโโ ๐ __init__.py # Package initialization
โ โโโ ๐ cli.py # Command-line interface
โ โโโ ๐ model_registry.py # Model management
โ โโโ ๐ model_inference_engine.py # Evaluation engine
โ โโโ ๐ test_dataset_generator.py # Test case generation
โ โโโ ๐ auto_suggestion_engine.py # AI recommendations
โ โ
โ โโโ ๐ core/ # Core interfaces
โ โ โโโ ๐ __init__.py
โ โ โโโ ๐ base_engine.py # Base engine interface
โ โ โโโ ๐ base_registry.py # Base registry interface
โ โ
โ โโโ ๐ engines/ # Evaluation engines
โ โ โโโ ๐ __init__.py
โ โ โโโ ๐ async_inference_engine.py # Async evaluation
โ โ
โ โโโ ๐ evaluation/ # Scoring strategies
โ โ โโโ ๐ __init__.py
โ โ โโโ ๐ scoring_strategies.py # Evaluation metrics
โ โ
โ โโโ ๐ persistence/ # Data storage
โ โ โโโ ๐ __init__.py
โ โ โโโ ๐ persistence_manager.py # Storage coordination
โ โ โโโ ๐ json_store.py # JSON file storage
โ โ โโโ ๐ db_store.py # Database storage
โ โ
โ โโโ ๐ registry/ # Model registries
โ โ โโโ ๐ __init__.py
โ โ โโโ ๐ model_registry.py # Model configuration
โ โ
โ โโโ ๐ utils/ # Utilities
โ โ โโโ ๐ __init__.py
โ โ โโโ ๐ error_handler.py # Error management
โ โ โโโ ๐ logger.py # Logging utilities
โ โ
โ โโโ ๐ cli/ # CLI components
โ โโโ ๐ __init__.py
โ โโโ ๐ main.py # CLI implementation
โ
โโโ ๐ tests/ # Test suite
โ โโโ ๐ __init__.py
โ โโโ ๐ conftest.py # Test configuration
โ โโโ ๐ unit/ # Unit tests
โ โโโ ๐ integration/ # Integration tests
โ โโโ ๐ e2e/ # End-to-end tests
โ โโโ ๐ benchmarks/ # Performance tests
โ
โโโ ๐ docs/ # Documentation
โ โโโ ๐ index.md # Main documentation
โ โโโ ๐ categories/ # Documentation categories
โ โโโ ๐ mkdocs.yml # Documentation config
โ
โโโ ๐ examples/ # Usage examples
โ โโโ ๐ basic_usage.py # Simple examples
โ โโโ ๐ advanced_async_usage.py # Advanced patterns
โ โโโ ๐ custom_scoring_and_persistence.py
โ
โโโ ๐ scripts/ # Development scripts
โ โโโ ๐ setup_dev.sh # Development setup
โ โโโ ๐ run_tests.sh # Test execution
โ โโโ ๐ build_docs.sh # Documentation build
โ
โโโ ๐ pyproject.toml # Project configuration
โโโ ๐ setup.py # Package setup
โโโ ๐ requirements.txt # Dependencies
โโโ ๐ requirements-dev.txt # Development dependencies
โโโ ๐ README.md # Project overview
โโโ ๐ LICENSE # License file
โโโ ๐ CHANGELOG.md # Version history
โโโ ๐ CONTRIBUTING.md # Contribution guidelines
๐ฏ Component Architecture¶
graph TB
subgraph "User Interface Layer"
CLI[CLI Interface]
API[Python API]
end
subgraph "Core Framework Layer"
Registry[Model Registry]
Engine[Inference Engine]
Generator[Dataset Generator]
Suggestions[Auto Suggestions]
end
subgraph "Evaluation Layer"
Scoring[Scoring Strategies]
Validation[Result Validation]
Analytics[Analytics Engine]
end
subgraph "Persistence Layer"
Manager[Persistence Manager]
JSON[JSON Store]
DB[Database Store]
Custom[Custom Backends]
end
subgraph "Infrastructure Layer"
Logging[Logging System]
Errors[Error Handling]
Config[Configuration]
Monitor[Monitoring]
end
CLI --> Engine
API --> Engine
Engine --> Registry
Engine --> Generator
Engine --> Scoring
Engine --> Manager
Scoring --> Validation
Manager --> JSON
Manager --> DB
Manager --> Custom
Engine --> Logging
Engine --> Errors ๐ Release Process¶
๐ฆ Versioning Strategy¶
We follow **Semantic Versioning (SemVer)**: - **MAJOR.MINOR.PATCH** (e.g., 1.2.3) - **MAJOR**: Breaking changes - **MINOR**: New features (backward compatible) - **PATCH**: Bug fixes (backward compatible) #### **Release Schedule** - **Major Releases**: Quarterly (breaking changes) - **Minor Releases**: Monthly (new features) - **Patch Releases**: As needed (critical bug fixes) #### **Release Process** 1. **Planning**: Feature planning and roadmap review 2. **Development**: Feature implementation and testing 3. **Testing**: Comprehensive testing across environments 4. **Documentation**: Update documentation and examples 5. **Release**: Package build and distribution 6. **Announcement**: Release notes and community update
๐ฏ Getting Involved¶
#### **Ways to Contribute** | **Level** | **Activities** | **Time Commitment** | |-----------|----------------|-------------------| | **Beginner** | Bug reports, documentation fixes, typos | 15-30 minutes | | **Intermediate** | Feature requests, small features, tests | 1-3 hours | | **Advanced** | Major features, performance improvements | 3+ hours | | **Expert** | Architecture decisions, mentoring, releases | Ongoing | #### **Communication Channels** - **GitHub Issues**: Bug reports and feature requests - **GitHub Discussions**: Q&A and community chat - **Pull Requests**: Code contributions and reviews - **Discord** (Coming Soon): Real-time community chat #### **Recognition** - **Contributors**: Listed in CONTRIBUTORS.md - **Maintainers**: Core team with merge permissions - **Champions**: Community leaders and advocates
## ๐ Ready to Contribute? **Join our community of developers building the future of LLM evaluation!** [](https://github.com/isathish/LLMEvaluationFramework/issues/new?template=bug_report.md) [](https://github.com/isathish/LLMEvaluationFramework/issues/new?template=feature_request.md) [](https://github.com/isathish/LLMEvaluationFramework/fork) [](https://github.com/isathish/LLMEvaluationFramework/discussions) --- **Every contribution makes a difference! ๐** *Whether you're fixing a typo or building a major feature, your work helps thousands of developers build better AI systems.*