๐ฅ๏ธ CLI Usage Guide¶
๐ CLI Overview¶
The LLM Evaluation Framework provides a comprehensive command-line interface (llm-eval) that enables automation, batch processing, and integration with CI/CD pipelines. All framework functionality is available through intuitive CLI commands.
๐ฏ Core Commands¶
๐ Quick Start Commands¶
# Get help for any command
llm-eval --help
llm-eval evaluate --help
# Check CLI version and status
llm-eval --version
# List available capabilities and models
llm-eval list
๐ Evaluate Command¶
The evaluate command runs comprehensive model evaluations with customizable parameters.
๐ Basic Syntax¶
โ๏ธ Command Options¶
๐ฏ Usage Examples¶
๐ Output Format¶
Console Output:
๐ Starting evaluation for gpt-3.5-turbo...
โ
Model registered successfully
๐งช Generated 10 test cases for reasoning
โก Running evaluation...
๐ Evaluation Results:
โข Model: gpt-3.5-turbo
โข Capability: reasoning
โข Test Cases: 10
โข Accuracy: 85.0%
โข Total Cost: $0.0045
โข Total Time: 3.2s
โข Average Response Time: 0.32s/case
โ
Evaluation completed successfully!
๐พ Results saved to: evaluation_results.json
JSON Output Structure:
{
"model_name": "gpt-3.5-turbo",
"capability": "reasoning",
"aggregate_metrics": {
"accuracy": 0.85,
"total_cost": 0.0045,
"total_time": 3.2,
"test_count": 10
},
"test_results": [
{
"test_id": 1,
"prompt": "...",
"expected": "...",
"actual": "...",
"score": 1.0,
"cost": 0.0004,
"time": 0.3
}
]
}
๐งช Generate Command¶
The generate command creates synthetic test datasets for model evaluation.
๐ Basic Syntax¶
โ๏ธ Command Options¶
๐ฏ Usage Examples¶
# Generate basic reasoning tests
llm-eval generate --capability reasoning
# Output:
# ๐งช Generating test dataset...
# โ
Generated 5 test cases for reasoning
#
# 1. Solve this logic puzzle: If all roses are flowers...
# Criteria: Logical reasoning, step-by-step thinking
#
# 2. What comes next in the sequence: 2, 4, 8, 16...
# Criteria: Pattern recognition, mathematical reasoning
# Generate large coding dataset
llm-eval generate \
--capability coding \
--count 100 \
--domain "web development" \
--output coding_tests.json
# Output:
# ๐งช Generating 100 test cases for coding...
# ๐ Domain: web development
# โ
Generated 100 test cases successfully
# ๐พ Saved to: coding_tests.json
๐ Available Capabilities¶
| Capability | Description | Example Use Cases |
|---|---|---|
| reasoning | Logical thinking, problem-solving | Logic puzzles, math problems |
| creativity | Creative writing, ideation | Story writing, brainstorming |
| coding | Programming, code generation | Algorithm implementation, debugging |
| factual | Knowledge recall, fact checking | Historical facts, scientific data |
| instruction | Following complex instructions | Multi-step tasks, procedures |
๐ Score Command¶
The score command evaluates predictions against reference answers using various metrics.
๐ Basic Syntax¶
โ๏ธ Command Options¶
๐ฏ Usage Examples¶
๐ Available Metrics¶
| Metric | Description | Best For |
|---|---|---|
| accuracy | Exact match percentage | Classification, exact answers |
| f1 | Harmonic mean of precision/recall | Text similarity, partial matches |
๐ List Command¶
The list command displays available resources and configurations.
๐ Basic Syntax¶
โ๏ธ Command Options¶
๐ฏ Usage Examples¶
๐ง Advanced CLI Usage¶
๐ Automation & Scripting¶
Batch Evaluation Script:
#!/bin/bash
# batch_evaluate.sh
models=("gpt-3.5-turbo" "gpt-4" "claude-3")
capabilities=("reasoning" "creativity" "coding")
for model in "${models[@]}"; do
for capability in "${capabilities[@]}"; do
echo "๐ Evaluating $model on $capability..."
llm-eval evaluate \
--model "$model" \
--capability "$capability" \
--test-cases 20 \
--output "results/${model}_${capability}.json" \
--verbose
echo "โ
Completed $model/$capability"
done
done
echo "๐ Batch evaluation completed!"
Performance Monitoring:
#!/bin/bash
# monitor_performance.sh
# Daily performance check
today=$(date +%Y-%m-%d)
output_dir="monitoring/$today"
mkdir -p "$output_dir"
# Run evaluation suite
llm-eval evaluate \
--model gpt-3.5-turbo \
--test-cases 50 \
--output "$output_dir/daily_check.json"
# Check for performance regression
python check_performance.py "$output_dir/daily_check.json"
๐ CI/CD Integration¶
GitHub Actions Example:
# .github/workflows/llm-evaluation.yml
name: LLM Model Evaluation
on:
push:
branches: [main]
schedule:
- cron: '0 8 * * *' # Daily at 8 AM
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install framework
run: pip install llm-evaluation-framework
- name: Run evaluation
run: |
llm-eval evaluate \
--model gpt-3.5-turbo \
--test-cases 25 \
--output evaluation_results.json
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: evaluation-results
path: evaluation_results.json
๐ Data Pipeline Integration¶
Apache Airflow DAG:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-team',
'depends_on_past': False,
'start_date': datetime(2024, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'llm_evaluation_pipeline',
default_args=default_args,
description='Daily LLM evaluation pipeline',
schedule_interval=timedelta(days=1)
)
# Generate test data
generate_tests = BashOperator(
task_id='generate_test_data',
bash_command='''
llm-eval generate \
--capability reasoning \
--count 100 \
--output /data/test_cases.json
''',
dag=dag
)
# Run evaluation
run_evaluation = BashOperator(
task_id='run_evaluation',
bash_command='''
llm-eval evaluate \
--model gpt-3.5-turbo \
--test-cases 100 \
--output /data/results.json
''',
dag=dag
)
generate_tests >> run_evaluation
๐ ๏ธ Configuration & Environment¶
๐ Configuration Files¶
The CLI supports configuration files for default settings:
~/.llm-eval/config.yaml:
# Default CLI configuration
defaults:
model: gpt-3.5-turbo
test_cases: 10
capability: reasoning
verbose: false
# Model configurations
models:
gpt-3.5-turbo:
provider: openai
api_cost_input: 0.0015
api_cost_output: 0.002
gpt-4:
provider: openai
api_cost_input: 0.03
api_cost_output: 0.06
# Output settings
output:
format: json
directory: ./results
timestamp: true
๐ Environment Variables¶
| Variable | Description | Example |
|---|---|---|
LLM_EVAL_CONFIG | Config file path | /path/to/config.yaml |
LLM_EVAL_LOG_LEVEL | Logging level | DEBUG |
LLM_EVAL_OUTPUT_DIR | Default output directory | ./evaluation_results |
# Set environment variables
export LLM_EVAL_CONFIG="$HOME/.llm-eval/config.yaml"
export LLM_EVAL_LOG_LEVEL="INFO"
export LLM_EVAL_OUTPUT_DIR="./results"
๐จ Error Handling & Troubleshooting¶
Common CLI Issues¶
# Reinstall framework
pip install --upgrade llm-evaluation-framework
# Check PATH
which llm-eval
echo $PATH
# Check available models
llm-eval list --type models
# Register model first
python -c "
from llm_evaluation_framework import ModelRegistry
registry = ModelRegistry()
registry.register_model('unknown-model', {...})
"
๐ Debug Mode¶
Enable verbose logging for troubleshooting:
# Enable debug output
llm-eval evaluate \
--model gpt-3.5-turbo \
--verbose
# Or set environment variable
export LLM_EVAL_LOG_LEVEL=DEBUG
llm-eval evaluate --model gpt-3.5-turbo
๐ CLI Reference Summary¶
Quick Command Reference¶
# Installation & Setup
pip install llm-evaluation-framework
llm-eval --version
llm-eval --help
# Core Operations
llm-eval evaluate --model MODEL [OPTIONS]
llm-eval generate --capability CAPABILITY [OPTIONS]
llm-eval score --predictions [...] --references [...] [OPTIONS]
llm-eval list [--type TYPE]
# Common Workflows
llm-eval evaluate --model gpt-3.5-turbo --test-cases 10
llm-eval generate --capability coding --count 50 --output tests.json
llm-eval score --predictions "a" "b" --references "a" "c" --metric accuracy
Exit Codes¶
| Code | Meaning | Description |
|---|---|---|
0 | Success | Command completed successfully |
1 | General Error | Command failed due to error |
2 | Invalid Usage | Invalid command line arguments |