🖥️ CLI Usage Guide¶

📋 CLI Overview¶

The LLM Evaluation Framework provides a comprehensive command-line interface (llm-eval) that enables automation, batch processing, and integration with CI/CD pipelines. All framework functionality is available through intuitive CLI commands.

🎯 Core Commands¶

| Command | Purpose | Use Case | |---------|---------|----------| | **🔍 `evaluate`** | Run model evaluations | Performance testing, benchmarking | | **🧪 `generate`** | Create test datasets | Data preparation, synthetic data | | **📊 `score`** | Score predictions | Accuracy measurement, comparison | | **📋 `list`** | Show available resources | Discovery, configuration |

🚀 Quick Start Commands¶

# Get help for any command
llm-eval --help
llm-eval evaluate --help

# Check CLI version and status
llm-eval --version

# List available capabilities and models
llm-eval list

🔍 Evaluate Command¶

The evaluate command runs comprehensive model evaluations with customizable parameters.

📝 Basic Syntax¶

llm-eval evaluate [OPTIONS]

⚙️ Command Options¶

| Option | Type | Required | Description | Example | |--------|------|----------|-------------|---------| | `--model` | string | ✅ | Model name to evaluate | `gpt-3.5-turbo` | | `--test-cases` | integer | ❌ | Number of test cases (default: 5) | `10` | | `--capability` | string | ❌ | Capability to test (default: reasoning) | `coding` | | `--output` | path | ❌ | Output file for results | `results.json` | | `--verbose` | flag | ❌ | Enable detailed logging | N/A |

🎯 Usage Examples¶

Basic EvaluationCustom Test CountSave Results to FileMultiple Capabilities

# Simple model evaluation
llm-eval evaluate --model gpt-3.5-turbo

# Output:
# 🚀 Starting evaluation for gpt-3.5-turbo...
# ✅ Generated 5 test cases for reasoning
# ⚡ Running evaluation...
# 📊 Results: 80% accuracy, $0.0032 cost, 2.5s time

# Evaluate with more test cases
llm-eval evaluate \
  --model gpt-4 \
  --test-cases 20 \
  --capability creativity

# Output:
# 🚀 Starting evaluation for gpt-4...
# ✅ Generated 20 test cases for creativity
# ⚡ Running evaluation...
# 📊 Results: 92% accuracy, $0.0156 cost, 8.2s time

# Save detailed results
llm-eval evaluate \
  --model claude-3 \
  --test-cases 15 \
  --capability reasoning \
  --output claude_evaluation.json \
  --verbose

# Creates: claude_evaluation.json with full results

# Test different capabilities
for capability in reasoning creativity coding; do
  llm-eval evaluate \
    --model gpt-3.5-turbo \
    --capability $capability \
    --output "results_${capability}.json"
done

📊 Output Format¶

Console Output:

🚀 Starting evaluation for gpt-3.5-turbo...
✅ Model registered successfully
🧪 Generated 10 test cases for reasoning
⚡ Running evaluation...

📊 Evaluation Results:
   • Model: gpt-3.5-turbo
   • Capability: reasoning
   • Test Cases: 10
   • Accuracy: 85.0%
   • Total Cost: $0.0045
   • Total Time: 3.2s
   • Average Response Time: 0.32s/case

✅ Evaluation completed successfully!
💾 Results saved to: evaluation_results.json

JSON Output Structure:

{
  "model_name": "gpt-3.5-turbo",
  "capability": "reasoning",
  "aggregate_metrics": {
    "accuracy": 0.85,
    "total_cost": 0.0045,
    "total_time": 3.2,
    "test_count": 10
  },
  "test_results": [
    {
      "test_id": 1,
      "prompt": "...",
      "expected": "...",
      "actual": "...",
      "score": 1.0,
      "cost": 0.0004,
      "time": 0.3
    }
  ]
}

🧪 Generate Command¶

The generate command creates synthetic test datasets for model evaluation.

📝 Basic Syntax¶

llm-eval generate [OPTIONS]

⚙️ Command Options¶

| Option | Type | Required | Description | Example | |--------|------|----------|-------------|---------| | `--capability` | string | ❌ | Target capability (default: reasoning) | `coding` | | `--count` | integer | ❌ | Number of test cases (default: 5) | `50` | | `--domain` | string | ❌ | Domain context (default: general) | `medical` | | `--output` | path | ❌ | Output file path | `dataset.json` | | `--format` | string | ❌ | Output format (json/csv) | `json` |

🎯 Usage Examples¶

Basic Dataset GenerationLarge Custom DatasetMultiple Capabilities

# Generate basic reasoning tests
llm-eval generate --capability reasoning

# Output:
# 🧪 Generating test dataset...
# ✅ Generated 5 test cases for reasoning
# 
# 1. Solve this logic puzzle: If all roses are flowers...
#    Criteria: Logical reasoning, step-by-step thinking
# 
# 2. What comes next in the sequence: 2, 4, 8, 16...
#    Criteria: Pattern recognition, mathematical reasoning

# Generate large coding dataset
llm-eval generate \
  --capability coding \
  --count 100 \
  --domain "web development" \
  --output coding_tests.json

# Output:
# 🧪 Generating 100 test cases for coding...
# 📊 Domain: web development
# ✅ Generated 100 test cases successfully
# 💾 Saved to: coding_tests.json

# Generate tests for all capabilities
for cap in reasoning creativity coding factual instruction; do
  llm-eval generate \
    --capability $cap \
    --count 25 \
    --output "tests_${cap}.json"
done

📋 Available Capabilities¶

Capability	Description	Example Use Cases
reasoning	Logical thinking, problem-solving	Logic puzzles, math problems
creativity	Creative writing, ideation	Story writing, brainstorming
coding	Programming, code generation	Algorithm implementation, debugging
factual	Knowledge recall, fact checking	Historical facts, scientific data
instruction	Following complex instructions	Multi-step tasks, procedures

📊 Score Command¶

The score command evaluates predictions against reference answers using various metrics.

📝 Basic Syntax¶

llm-eval score [OPTIONS]

⚙️ Command Options¶

| Option | Type | Required | Description | Example | |--------|------|----------|-------------|---------| | `--predictions` | list | ✅ | Model predictions | `"pred1" "pred2"` | | `--references` | list | ✅ | Reference answers | `"ref1" "ref2"` | | `--metric` | string | ❌ | Scoring metric (default: accuracy) | `f1` | | `--output` | path | ❌ | Save scores to file | `scores.json` |

🎯 Usage Examples¶

Basic Accuracy ScoringF1 Score EvaluationBatch Scoring from Files

# Simple accuracy calculation
llm-eval score \
  --predictions "The sky is blue" "Paris is in France" \
  --references "The sky is blue" "Paris is the capital of France"

# Output:
# 📊 Scoring Results
# Average Score
# Accuracy score: 0.5000 (50%)

# Use F1 scoring metric
llm-eval score \
  --predictions "Machine learning is AI" "Python is great" \
  --references "ML is artificial intelligence" "Python is awesome" \
  --metric f1

# Output:
# 📊 Scoring Results
# Average Score
# F1 score: 0.7500 (75%)

# Score predictions from files (using shell)
predictions=$(cat predictions.txt | tr '\n' ' ')
references=$(cat references.txt | tr '\n' ' ')

llm-eval score \
  --predictions $predictions \
  --references $references \
  --metric accuracy \
  --output batch_scores.json

📈 Available Metrics¶

Metric	Description	Best For
accuracy	Exact match percentage	Classification, exact answers
f1	Harmonic mean of precision/recall	Text similarity, partial matches

📋 List Command¶

The list command displays available resources and configurations.

📝 Basic Syntax¶

llm-eval list [OPTIONS]

⚙️ Command Options¶

| Option | Type | Required | Description | Example | |--------|------|----------|-------------|---------| | `--type` | string | ❌ | Resource type to list | `models` | | `--format` | string | ❌ | Output format (table/json) | `json` |

🎯 Usage Examples¶

List All ResourcesList Specific TypesJSON Output

# Show all available resources
llm-eval list

# Output:
# Available Capabilities
#   - reasoning
#   - creativity
#   - factual
#   - instruction
#   - coding

# List models only
llm-eval list --type models

# List scoring strategies
llm-eval list --type strategies

# List capabilities
llm-eval list --type capabilities

# Get machine-readable output
llm-eval list --format json

# Output:
# {
#   "capabilities": ["reasoning", "creativity", "coding"],
#   "models": [],
#   "strategies": ["accuracy", "f1"]
# }

🔧 Advanced CLI Usage¶

🚀 Automation & Scripting¶

Batch Evaluation Script:

#!/bin/bash
# batch_evaluate.sh

models=("gpt-3.5-turbo" "gpt-4" "claude-3")
capabilities=("reasoning" "creativity" "coding")

for model in "${models[@]}"; do
  for capability in "${capabilities[@]}"; do
    echo "🚀 Evaluating $model on $capability..."

    llm-eval evaluate \
      --model "$model" \
      --capability "$capability" \
      --test-cases 20 \
      --output "results/${model}_${capability}.json" \
      --verbose

    echo "✅ Completed $model/$capability"
  done
done

echo "🎉 Batch evaluation completed!"

Performance Monitoring:

#!/bin/bash
# monitor_performance.sh

# Daily performance check
today=$(date +%Y-%m-%d)
output_dir="monitoring/$today"
mkdir -p "$output_dir"

# Run evaluation suite
llm-eval evaluate \
  --model gpt-3.5-turbo \
  --test-cases 50 \
  --output "$output_dir/daily_check.json"

# Check for performance regression
python check_performance.py "$output_dir/daily_check.json"

🔄 CI/CD Integration¶

GitHub Actions Example:

# .github/workflows/llm-evaluation.yml
name: LLM Model Evaluation

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 8 * * *'  # Daily at 8 AM

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3

    - name: Setup Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'

    - name: Install framework
      run: pip install llm-evaluation-framework

    - name: Run evaluation
      run: |
        llm-eval evaluate \
          --model gpt-3.5-turbo \
          --test-cases 25 \
          --output evaluation_results.json

    - name: Upload results
      uses: actions/upload-artifact@v3
      with:
        name: evaluation-results
        path: evaluation_results.json

📊 Data Pipeline Integration¶

Apache Airflow DAG:

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'retries': 1,
    'retry_delay': timedelta(minutes=5)
}

dag = DAG(
    'llm_evaluation_pipeline',
    default_args=default_args,
    description='Daily LLM evaluation pipeline',
    schedule_interval=timedelta(days=1)
)

# Generate test data
generate_tests = BashOperator(
    task_id='generate_test_data',
    bash_command='''
    llm-eval generate \
      --capability reasoning \
      --count 100 \
      --output /data/test_cases.json
    ''',
    dag=dag
)

# Run evaluation
run_evaluation = BashOperator(
    task_id='run_evaluation', 
    bash_command='''
    llm-eval evaluate \
      --model gpt-3.5-turbo \
      --test-cases 100 \
      --output /data/results.json
    ''',
    dag=dag
)

generate_tests >> run_evaluation

🛠️ Configuration & Environment¶

📁 Configuration Files¶

The CLI supports configuration files for default settings:

~/.llm-eval/config.yaml:

# Default CLI configuration
defaults:
  model: gpt-3.5-turbo
  test_cases: 10
  capability: reasoning
  verbose: false

# Model configurations
models:
  gpt-3.5-turbo:
    provider: openai
    api_cost_input: 0.0015
    api_cost_output: 0.002

  gpt-4:
    provider: openai
    api_cost_input: 0.03
    api_cost_output: 0.06

# Output settings
output:
  format: json
  directory: ./results
  timestamp: true

🌍 Environment Variables¶

Variable	Description	Example
`LLM_EVAL_CONFIG`	Config file path	`/path/to/config.yaml`
`LLM_EVAL_LOG_LEVEL`	Logging level	`DEBUG`
`LLM_EVAL_OUTPUT_DIR`	Default output directory	`./evaluation_results`

# Set environment variables
export LLM_EVAL_CONFIG="$HOME/.llm-eval/config.yaml"
export LLM_EVAL_LOG_LEVEL="INFO"
export LLM_EVAL_OUTPUT_DIR="./results"

🚨 Error Handling & Troubleshooting¶

Common CLI Issues¶

#### ❌ **Command not found: llm-eval** **Solution:**

# Reinstall framework
pip install --upgrade llm-evaluation-framework

# Check PATH
which llm-eval
echo $PATH

#### ❌ **Invalid model name** **Error:** `Model 'unknown-model' not found` **Solution:**

# Check available models
llm-eval list --type models

# Register model first
python -c "
from llm_evaluation_framework import ModelRegistry
registry = ModelRegistry()
registry.register_model('unknown-model', {...})
"

#### ❌ **Insufficient test cases** **Error:** `Number of test cases must be positive` **Solution:**

# Use positive integer
llm-eval evaluate --test-cases 5  # ✅ Valid
llm-eval evaluate --test-cases 0  # ❌ Invalid

#### ❌ **File permission errors** **Solution:**

# Check permissions
ls -la output_directory/

# Create directory with proper permissions
mkdir -p results && chmod 755 results

📝 Debug Mode¶

Enable verbose logging for troubleshooting:

# Enable debug output
llm-eval evaluate \
  --model gpt-3.5-turbo \
  --verbose

# Or set environment variable
export LLM_EVAL_LOG_LEVEL=DEBUG
llm-eval evaluate --model gpt-3.5-turbo

📚 CLI Reference Summary¶

Quick Command Reference¶

# Installation & Setup
pip install llm-evaluation-framework
llm-eval --version
llm-eval --help

# Core Operations
llm-eval evaluate --model MODEL [OPTIONS]
llm-eval generate --capability CAPABILITY [OPTIONS]  
llm-eval score --predictions [...] --references [...] [OPTIONS]
llm-eval list [--type TYPE]

# Common Workflows
llm-eval evaluate --model gpt-3.5-turbo --test-cases 10
llm-eval generate --capability coding --count 50 --output tests.json
llm-eval score --predictions "a" "b" --references "a" "c" --metric accuracy

Exit Codes¶

Code	Meaning	Description
`0`	Success	Command completed successfully
`1`	General Error	Command failed due to error
`2`	Invalid Usage	Invalid command line arguments

## 🎯 Master the CLI **You now have complete knowledge of the CLI interface!** **Ready to explore more?** [![Examples](https://img.shields.io/badge/Try-Advanced%20Examples-green?style=for-the-badge)](examples.md) [![API Reference](https://img.shields.io/badge/View-API%20Reference-blue?style=for-the-badge)](api-reference.md) [![Core Concepts](https://img.shields.io/badge/Learn-Core%20Concepts-orange?style=for-the-badge)](core-concepts.md) --- *Automate your LLM evaluations with powerful CLI workflows! 🚀*