Answers 1.6

Theory

Evaluating LLM answers is necessary to understand effectiveness, alignment with goals, and areas to improve. Evaluate accuracy, relevance, and completeness.
Key metrics: accuracy, recall, F1, and user satisfaction ratings. These guide product development and release decisions.
The path to production is iterative: start with quick prototypes, find gaps, gradually increase complexity and dataset coverage. Practical value matters more than perfection.
High‑stakes scenarios (medicine, law, finance) require stricter validation, bias detection/mitigation, and ethical review.
Best practices: start small, iterate quickly, automate testing and quality checks.
Automated tests speed up gold‑standard comparisons, surface errors, and provide continuous feedback.
Choose metrics and rigor to match the application’s goals and risks; use heightened rigor for high stakes.
A full evaluation framework includes a rubric, protocols (who/what/how), and gold‑standard comparison when needed.
Advanced techniques: semantic similarity (embeddings), crowd evaluation, automated coherence/logic checks, and adaptive schemes tailored to the domain.
Continuous evaluation and diverse test cases increase reliability and relevance across scenarios.

Practice (sketches)

Rubric‑based evaluation function:

def evaluate_response(response: str, rubric: dict) -> dict:
    results = {}
    total_weight = sum(rubric[c]['weight'] for c in rubric)
    total_score = 0
    for criteria, details in rubric.items():
        score = details.get('weight', 1)  # stub — replace with real logic
        feedback = f"Stub feedback for {criteria}."
        results[criteria] = {'score': score, 'feedback': feedback}
        total_score += score * details['weight']
    results['overall'] = {
        'weighted_average_score': total_score / total_weight,
        'feedback': 'Overall feedback based on the rubric.'
    }
    return results

Rubric template:

rubric = {
    'accuracy': {'weight': 3},
    'relevance': {'weight': 2},
    'completeness': {'weight': 3},
    'coherence': {'weight': 2},
}

The ideal (gold) answer serves as a comparison point for weighted scoring and textual feedback.