1.6 Building and Evaluating LLM Applications

Building applications powered by large language models (LLMs) requires more than clean integration — it needs a systematic quality evaluation that covers both objective and subjective aspects. In practice, you combine accuracy, recall, and F1 (when gold answers are available) with user ratings and satisfaction metrics (CSA), while also tracking operational indicators like cost and latency. This blend exposes weak spots, informs release decisions, and guides targeted improvements.

The typical path to production starts with simple prompts and a small dataset for quick iteration; then you broaden coverage, complicate scenarios, refine metrics and quality criteria — remembering that perfection isn’t always necessary. It’s often enough to consistently solve the target tasks within quality and budget constraints. In high‑stakes scenarios (medicine, law enforcement, finance), stricter validation becomes essential: random sampling and hold‑out tests, bias and error checks, and attention to ethical and legal issues — preventing harm, ensuring explainability, and enabling audit.

Good engineering style emphasizes modularity and fast iteration, automated regression tests and measurements, thoughtful metric selection aligned with business goals, and mandatory bias/fairness analysis with regular reviews.

To make evaluation reproducible, use rubrics and evaluation protocols: define criteria in advance — relevance to user intent and context, factual correctness, completeness, and coherence/fluency — as well as the process, scales, and thresholds. For subjective tasks, use multiple independent raters and automatic consistency checks. Where possible, compare answers to ideal (expert) responses — a “gold standard” provides an anchor for more objective judgments. Here’s a small environment scaffold and call function for reproducible experiments and evaluations:

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI()

def fetch_llm_response(prompts, model="gpt-4o-mini", temperature=0, max_tokens=500):
    response = client.chat.completions.create(
        model=model,
        messages=prompts,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return response.choices[0].message["content"]

Next, formalize rubric‑based evaluation and assign weights to compute an overall score with detailed feedback. Below is a template where the model produces an assessment according to given criteria; the parsing is a stub and should be replaced with logic suited to your model’s output format:

def evaluate_response_against_detailed_rubric(test_data, llm_response):
    """
    Evaluate the answer on accuracy, relevance, completeness, and coherence.
    Return an overall score and detailed feedback.
    """
    rubric_criteria = {
        'accuracy': {'weight': 3, 'score': None, 'feedback': ''},
        'relevance': {'weight': 2, 'score': None, 'feedback': ''},
        'completeness': {'weight': 3, 'score': None, 'feedback': ''},
        'coherence': {'weight': 2, 'score': None, 'feedback': ''}
    }
    total_weight = sum(c['weight'] for c in rubric_criteria.values())

    system_prompt = "Assess the support agent’s answer given the provided context."
    evaluation_prompt = f"""\
    [Question]: {test_data['customer_query']}
    [Context]: {test_data['context']}
    [Expected answers]: {test_data.get('expected_answers', 'N/A')}
    [LLM answer]: {llm_response}

    Evaluate the answer on accuracy, relevance, completeness, and coherence.
    Provide scores (0–10) for each criterion and specific feedback.
    """

    evaluation_results = fetch_llm_response([
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": evaluation_prompt},
    ])

    # Parsing stub — replace with real parsing of your model’s output
    for k in rubric_criteria:
        rubric_criteria[k]['score'] = 8
        rubric_criteria[k]['feedback'] = "Good performance on this criterion."

    overall = sum(v['score'] * v['weight'] for v in rubric_criteria.values()) / total_weight
    detailed = {k: {"score": v['score'], "feedback": v['feedback']} for k, v in rubric_criteria.items()}
    return {"overall_score": overall, "detailed_scores": detailed}

When you need a gold‑standard comparison, explicitly compare the model’s answer with the ideal expert answer and score high‑priority criteria (factual accuracy, alignment, completeness, coherence). Here’s a skeleton that returns both an aggregate score and the raw comparison text for audit:

def detailed_evaluation_against_ideal_answer(test_data, llm_response):
    criteria = {
        'factual_accuracy': {'weight': 4, 'score': None, 'feedback': ''},
        'alignment_with_ideal': {'weight': 3, 'score': None, 'feedback': ''},
        'completeness': {'weight': 3, 'score': None, 'feedback': ''},
        'coherence': {'weight': 2, 'score': None, 'feedback': ''}
    }
    total = sum(c['weight'] for c in criteria.values())

    system_prompt = "Compare the LLM answer to the ideal answer, focusing on factual content and alignment."
    comparison_prompt = f"""\
    [Question]: {test_data['customer_query']}
    [Ideal answer]: {test_data['ideal_answer']}
    [LLM answer]: {llm_response}
    """

    evaluation_text = fetch_llm_response([
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": comparison_prompt},
    ])

    # Parsing stub
    for k in criteria:
        criteria[k]['score'] = 8
        criteria[k]['feedback'] = "Good alignment with the gold answer."

    score = sum(v['score'] * v['weight'] for v in criteria.values()) / total
    return {"overall_score": score, "details": criteria, "raw": evaluation_text}

On top of these basics, add advanced techniques: evaluate semantic similarity via embeddings and similarity metrics (not just surface overlap), bring in independent reviewers for crowd evaluation, include automated checks for coherence and logic, and build adaptive evaluation frameworks tailored to your domain and task types. In production, continuous evaluation is crucial: track version and metric history; close the loop from user feedback back to development; include diverse cases, edge cases, and cultural/linguistic variation; involve experts (including blind reviews to reduce bias); compare with alternative models; and employ specialized “judges” to detect contradictions and factual errors. Together, rigorous methods and constant iteration — plus rubrics, gold standards, expert reviews, and automated checks — help you build reliable and ethical systems.

Theory Questions

Why evaluate LLM answers, and along which dimensions?
Give examples of metrics and explain their role in development.
What does the iterative path from development to production look like?
Why do high‑stakes scenarios require stricter rigor? Give examples.
List best practices for bootstrapping, iteration, and automated testing.
How do automated tests help development?
Why should metrics be tuned to the specific task?
How do you build a rubric and evaluation protocols?
Which advanced evaluation techniques apply and why?
How do continuous evaluation and broad test coverage improve reliability?

Practical Tasks

Write a function that reads the API key from the environment, queries the LLM, and measures runtime and tokens used.