Skip to content

๐Ÿ–ฅ๏ธ CLI Usage Guide

![CLI Usage](https://img.shields.io/badge/CLI%20Usage-Master%20the%20Command%20Line-orange?style=for-the-badge&logo=terminal) **Complete reference for all CLI commands and automation workflows**

๐Ÿ“‹ CLI Overview

The LLM Evaluation Framework provides a comprehensive command-line interface (llm-eval) that enables automation, batch processing, and integration with CI/CD pipelines. All framework functionality is available through intuitive CLI commands.

๐ŸŽฏ Core Commands

| Command | Purpose | Use Case | |---------|---------|----------| | **๐Ÿ” `evaluate`** | Run model evaluations | Performance testing, benchmarking | | **๐Ÿงช `generate`** | Create test datasets | Data preparation, synthetic data | | **๐Ÿ“Š `score`** | Score predictions | Accuracy measurement, comparison | | **๐Ÿ“‹ `list`** | Show available resources | Discovery, configuration |

๐Ÿš€ Quick Start Commands

# Get help for any command
llm-eval --help
llm-eval evaluate --help

# Check CLI version and status
llm-eval --version

# List available capabilities and models
llm-eval list

๐Ÿ” Evaluate Command

The evaluate command runs comprehensive model evaluations with customizable parameters.

๐Ÿ“ Basic Syntax

llm-eval evaluate [OPTIONS]

โš™๏ธ Command Options

| Option | Type | Required | Description | Example | |--------|------|----------|-------------|---------| | `--model` | string | โœ… | Model name to evaluate | `gpt-3.5-turbo` | | `--test-cases` | integer | โŒ | Number of test cases (default: 5) | `10` | | `--capability` | string | โŒ | Capability to test (default: reasoning) | `coding` | | `--output` | path | โŒ | Output file for results | `results.json` | | `--verbose` | flag | โŒ | Enable detailed logging | N/A |

๐ŸŽฏ Usage Examples

# Simple model evaluation
llm-eval evaluate --model gpt-3.5-turbo

# Output:
# ๐Ÿš€ Starting evaluation for gpt-3.5-turbo...
# โœ… Generated 5 test cases for reasoning
# โšก Running evaluation...
# ๐Ÿ“Š Results: 80% accuracy, $0.0032 cost, 2.5s time
# Evaluate with more test cases
llm-eval evaluate \
  --model gpt-4 \
  --test-cases 20 \
  --capability creativity

# Output:
# ๐Ÿš€ Starting evaluation for gpt-4...
# โœ… Generated 20 test cases for creativity
# โšก Running evaluation...
# ๐Ÿ“Š Results: 92% accuracy, $0.0156 cost, 8.2s time
# Save detailed results
llm-eval evaluate \
  --model claude-3 \
  --test-cases 15 \
  --capability reasoning \
  --output claude_evaluation.json \
  --verbose

# Creates: claude_evaluation.json with full results
# Test different capabilities
for capability in reasoning creativity coding; do
  llm-eval evaluate \
    --model gpt-3.5-turbo \
    --capability $capability \
    --output "results_${capability}.json"
done

๐Ÿ“Š Output Format

Console Output:

๐Ÿš€ Starting evaluation for gpt-3.5-turbo...
โœ… Model registered successfully
๐Ÿงช Generated 10 test cases for reasoning
โšก Running evaluation...

๐Ÿ“Š Evaluation Results:
   โ€ข Model: gpt-3.5-turbo
   โ€ข Capability: reasoning
   โ€ข Test Cases: 10
   โ€ข Accuracy: 85.0%
   โ€ข Total Cost: $0.0045
   โ€ข Total Time: 3.2s
   โ€ข Average Response Time: 0.32s/case

โœ… Evaluation completed successfully!
๐Ÿ’พ Results saved to: evaluation_results.json

JSON Output Structure:

{
  "model_name": "gpt-3.5-turbo",
  "capability": "reasoning",
  "aggregate_metrics": {
    "accuracy": 0.85,
    "total_cost": 0.0045,
    "total_time": 3.2,
    "test_count": 10
  },
  "test_results": [
    {
      "test_id": 1,
      "prompt": "...",
      "expected": "...",
      "actual": "...",
      "score": 1.0,
      "cost": 0.0004,
      "time": 0.3
    }
  ]
}


๐Ÿงช Generate Command

The generate command creates synthetic test datasets for model evaluation.

๐Ÿ“ Basic Syntax

llm-eval generate [OPTIONS]

โš™๏ธ Command Options

| Option | Type | Required | Description | Example | |--------|------|----------|-------------|---------| | `--capability` | string | โŒ | Target capability (default: reasoning) | `coding` | | `--count` | integer | โŒ | Number of test cases (default: 5) | `50` | | `--domain` | string | โŒ | Domain context (default: general) | `medical` | | `--output` | path | โŒ | Output file path | `dataset.json` | | `--format` | string | โŒ | Output format (json/csv) | `json` |

๐ŸŽฏ Usage Examples

# Generate basic reasoning tests
llm-eval generate --capability reasoning

# Output:
# ๐Ÿงช Generating test dataset...
# โœ… Generated 5 test cases for reasoning
# 
# 1. Solve this logic puzzle: If all roses are flowers...
#    Criteria: Logical reasoning, step-by-step thinking
# 
# 2. What comes next in the sequence: 2, 4, 8, 16...
#    Criteria: Pattern recognition, mathematical reasoning
# Generate large coding dataset
llm-eval generate \
  --capability coding \
  --count 100 \
  --domain "web development" \
  --output coding_tests.json

# Output:
# ๐Ÿงช Generating 100 test cases for coding...
# ๐Ÿ“Š Domain: web development
# โœ… Generated 100 test cases successfully
# ๐Ÿ’พ Saved to: coding_tests.json
# Generate tests for all capabilities
for cap in reasoning creativity coding factual instruction; do
  llm-eval generate \
    --capability $cap \
    --count 25 \
    --output "tests_${cap}.json"
done

๐Ÿ“‹ Available Capabilities

Capability Description Example Use Cases
reasoning Logical thinking, problem-solving Logic puzzles, math problems
creativity Creative writing, ideation Story writing, brainstorming
coding Programming, code generation Algorithm implementation, debugging
factual Knowledge recall, fact checking Historical facts, scientific data
instruction Following complex instructions Multi-step tasks, procedures

๐Ÿ“Š Score Command

The score command evaluates predictions against reference answers using various metrics.

๐Ÿ“ Basic Syntax

llm-eval score [OPTIONS]

โš™๏ธ Command Options

| Option | Type | Required | Description | Example | |--------|------|----------|-------------|---------| | `--predictions` | list | โœ… | Model predictions | `"pred1" "pred2"` | | `--references` | list | โœ… | Reference answers | `"ref1" "ref2"` | | `--metric` | string | โŒ | Scoring metric (default: accuracy) | `f1` | | `--output` | path | โŒ | Save scores to file | `scores.json` |

๐ŸŽฏ Usage Examples

# Simple accuracy calculation
llm-eval score \
  --predictions "The sky is blue" "Paris is in France" \
  --references "The sky is blue" "Paris is the capital of France"

# Output:
# ๐Ÿ“Š Scoring Results
# Average Score
# Accuracy score: 0.5000 (50%)
# Use F1 scoring metric
llm-eval score \
  --predictions "Machine learning is AI" "Python is great" \
  --references "ML is artificial intelligence" "Python is awesome" \
  --metric f1

# Output:
# ๐Ÿ“Š Scoring Results
# Average Score
# F1 score: 0.7500 (75%)
# Score predictions from files (using shell)
predictions=$(cat predictions.txt | tr '\n' ' ')
references=$(cat references.txt | tr '\n' ' ')

llm-eval score \
  --predictions $predictions \
  --references $references \
  --metric accuracy \
  --output batch_scores.json

๐Ÿ“ˆ Available Metrics

Metric Description Best For
accuracy Exact match percentage Classification, exact answers
f1 Harmonic mean of precision/recall Text similarity, partial matches

๐Ÿ“‹ List Command

The list command displays available resources and configurations.

๐Ÿ“ Basic Syntax

llm-eval list [OPTIONS]

โš™๏ธ Command Options

| Option | Type | Required | Description | Example | |--------|------|----------|-------------|---------| | `--type` | string | โŒ | Resource type to list | `models` | | `--format` | string | โŒ | Output format (table/json) | `json` |

๐ŸŽฏ Usage Examples

# Show all available resources
llm-eval list

# Output:
# Available Capabilities
#   - reasoning
#   - creativity
#   - factual
#   - instruction
#   - coding
# List models only
llm-eval list --type models

# List scoring strategies
llm-eval list --type strategies

# List capabilities
llm-eval list --type capabilities
# Get machine-readable output
llm-eval list --format json

# Output:
# {
#   "capabilities": ["reasoning", "creativity", "coding"],
#   "models": [],
#   "strategies": ["accuracy", "f1"]
# }

๐Ÿ”ง Advanced CLI Usage

๐Ÿš€ Automation & Scripting

Batch Evaluation Script:

#!/bin/bash
# batch_evaluate.sh

models=("gpt-3.5-turbo" "gpt-4" "claude-3")
capabilities=("reasoning" "creativity" "coding")

for model in "${models[@]}"; do
  for capability in "${capabilities[@]}"; do
    echo "๐Ÿš€ Evaluating $model on $capability..."

    llm-eval evaluate \
      --model "$model" \
      --capability "$capability" \
      --test-cases 20 \
      --output "results/${model}_${capability}.json" \
      --verbose

    echo "โœ… Completed $model/$capability"
  done
done

echo "๐ŸŽ‰ Batch evaluation completed!"

Performance Monitoring:

#!/bin/bash
# monitor_performance.sh

# Daily performance check
today=$(date +%Y-%m-%d)
output_dir="monitoring/$today"
mkdir -p "$output_dir"

# Run evaluation suite
llm-eval evaluate \
  --model gpt-3.5-turbo \
  --test-cases 50 \
  --output "$output_dir/daily_check.json"

# Check for performance regression
python check_performance.py "$output_dir/daily_check.json"

๐Ÿ”„ CI/CD Integration

GitHub Actions Example:

# .github/workflows/llm-evaluation.yml
name: LLM Model Evaluation

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 8 * * *'  # Daily at 8 AM

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3

    - name: Setup Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'

    - name: Install framework
      run: pip install llm-evaluation-framework

    - name: Run evaluation
      run: |
        llm-eval evaluate \
          --model gpt-3.5-turbo \
          --test-cases 25 \
          --output evaluation_results.json

    - name: Upload results
      uses: actions/upload-artifact@v3
      with:
        name: evaluation-results
        path: evaluation_results.json

๐Ÿ“Š Data Pipeline Integration

Apache Airflow DAG:

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'retries': 1,
    'retry_delay': timedelta(minutes=5)
}

dag = DAG(
    'llm_evaluation_pipeline',
    default_args=default_args,
    description='Daily LLM evaluation pipeline',
    schedule_interval=timedelta(days=1)
)

# Generate test data
generate_tests = BashOperator(
    task_id='generate_test_data',
    bash_command='''
    llm-eval generate \
      --capability reasoning \
      --count 100 \
      --output /data/test_cases.json
    ''',
    dag=dag
)

# Run evaluation
run_evaluation = BashOperator(
    task_id='run_evaluation', 
    bash_command='''
    llm-eval evaluate \
      --model gpt-3.5-turbo \
      --test-cases 100 \
      --output /data/results.json
    ''',
    dag=dag
)

generate_tests >> run_evaluation

๐Ÿ› ๏ธ Configuration & Environment

๐Ÿ“ Configuration Files

The CLI supports configuration files for default settings:

~/.llm-eval/config.yaml:

# Default CLI configuration
defaults:
  model: gpt-3.5-turbo
  test_cases: 10
  capability: reasoning
  verbose: false

# Model configurations
models:
  gpt-3.5-turbo:
    provider: openai
    api_cost_input: 0.0015
    api_cost_output: 0.002

  gpt-4:
    provider: openai
    api_cost_input: 0.03
    api_cost_output: 0.06

# Output settings
output:
  format: json
  directory: ./results
  timestamp: true

๐ŸŒ Environment Variables

Variable Description Example
LLM_EVAL_CONFIG Config file path /path/to/config.yaml
LLM_EVAL_LOG_LEVEL Logging level DEBUG
LLM_EVAL_OUTPUT_DIR Default output directory ./evaluation_results
# Set environment variables
export LLM_EVAL_CONFIG="$HOME/.llm-eval/config.yaml"
export LLM_EVAL_LOG_LEVEL="INFO"
export LLM_EVAL_OUTPUT_DIR="./results"

๐Ÿšจ Error Handling & Troubleshooting

Common CLI Issues

#### โŒ **Command not found: llm-eval** **Solution:**
# Reinstall framework
pip install --upgrade llm-evaluation-framework

# Check PATH
which llm-eval
echo $PATH
#### โŒ **Invalid model name** **Error:** `Model 'unknown-model' not found` **Solution:**
# Check available models
llm-eval list --type models

# Register model first
python -c "
from llm_evaluation_framework import ModelRegistry
registry = ModelRegistry()
registry.register_model('unknown-model', {...})
"
#### โŒ **Insufficient test cases** **Error:** `Number of test cases must be positive` **Solution:**
# Use positive integer
llm-eval evaluate --test-cases 5  # โœ… Valid
llm-eval evaluate --test-cases 0  # โŒ Invalid
#### โŒ **File permission errors** **Solution:**
# Check permissions
ls -la output_directory/

# Create directory with proper permissions
mkdir -p results && chmod 755 results

๐Ÿ“ Debug Mode

Enable verbose logging for troubleshooting:

# Enable debug output
llm-eval evaluate \
  --model gpt-3.5-turbo \
  --verbose

# Or set environment variable
export LLM_EVAL_LOG_LEVEL=DEBUG
llm-eval evaluate --model gpt-3.5-turbo

๐Ÿ“š CLI Reference Summary

Quick Command Reference

# Installation & Setup
pip install llm-evaluation-framework
llm-eval --version
llm-eval --help

# Core Operations
llm-eval evaluate --model MODEL [OPTIONS]
llm-eval generate --capability CAPABILITY [OPTIONS]  
llm-eval score --predictions [...] --references [...] [OPTIONS]
llm-eval list [--type TYPE]

# Common Workflows
llm-eval evaluate --model gpt-3.5-turbo --test-cases 10
llm-eval generate --capability coding --count 50 --output tests.json
llm-eval score --predictions "a" "b" --references "a" "c" --metric accuracy

Exit Codes

Code Meaning Description
0 Success Command completed successfully
1 General Error Command failed due to error
2 Invalid Usage Invalid command line arguments

## ๐ŸŽฏ Master the CLI **You now have complete knowledge of the CLI interface!** **Ready to explore more?** [![Examples](https://img.shields.io/badge/Try-Advanced%20Examples-green?style=for-the-badge)](examples.md) [![API Reference](https://img.shields.io/badge/View-API%20Reference-blue?style=for-the-badge)](api-reference.md) [![Core Concepts](https://img.shields.io/badge/Learn-Core%20Concepts-orange?style=for-the-badge)](core-concepts.md) --- *Automate your LLM evaluations with powerful CLI workflows! ๐Ÿš€*