One of the main questions evaluators have when using AI is: "How can I know if it performs well?" To answer this question, we have developed a guidance note aimed at capturing what we have learned so far about integrating AI into evaluation and to offer a framework for further exploration.
Since we began our experiments with Large Language Models (LLMs) in the Spring of 2023, we have made significant progress in using Generative Artificial Intelligence (GenAI) for processing and analyzing large volumes of text data in our evaluations. Currently, we are running multiple experiments across IEG to test the various capabilities of LLMs in different evaluation use cases.
As we continue to explore the strengths, limitations, and risks of this technology, we have partnered with others in the evaluation community to learn from each other and identify promising practices. The guidance note is the result of our collaboration with the Independent Office of Evaluation of the International Fund for Agricultural Development.
Our guidance note establishes clear criteria to judge AI performance and good practices for designing meaningful experiments. We use these criteria to test AI on key tasks such as text classification, summarization, and synthesis.
Assessing AI’s Performance
LLMs do not always provide appropriate or veracious responses, and their output must be thoroughly validated before use, especially in evaluation practice where analytical rigor is crucial. The guidance note proposes several performance dimensions depending on the task at hand.
For classification tasks, standard machine learning metrics like accuracy, precision, recall, balanced accuracy, and F1 scores are used (please see the guidance note for a description of metrics and scores). These metrics measure the overlap between machine-annotated 'predicted' labels and human-annotated 'ground-truth' labels. Separate training, validation, testing, and prediction sets are necessary to compute useful performance metrics and reduce bias.
For text summarization, text synthesis, and information extraction tasks, producing human-annotated data as ground-truth is not always feasible. In such cases, we assess the quality of model responses using criteria from the field of Natural Language Generation, a subfield of AI: faithfulness, relevance, and coherence. Faithfulness checks if the AI-generated information is factually consistent with the source, relevance assesses if the content selected by AI is the most important, and coherence evaluates the overall quality of the generated sentences.
To determine the minimum acceptable value for each metric, we take a context-specific approach. For example, in a text classification task for a structured literature review, recall and precision scores of 0.75 and 0.60 respectively may be sufficient for the initial literature identification, especially with highly class-imbalanced data. This was the case in our applied experiments within an ongoing IEG evaluation. However, for tasks with higher evaluative stakes, such as judging the extent to which a portfolio of activities has reached key performance targets, higher metric values are required. The guidance note provides a framework of assessment criteria with clear definitions.
Other Good Practices
Identifying use cases where LLMs can add significant value compared to traditional approaches is crucial. Not every task will have (net) benefits from the use of LLMs, so aligning experiments with tasks that can truly leverage their capabilities is essential. Planning workflows within use cases involves breaking down tasks into granular steps to understand where and how LLMs can be effectively applied. Translating typical evaluation flows into AI-enabled workflows is a critical step. Incorporating a modular design for the AI-enabled workflows can further enhance the process since it allows for the reuse of successful components within and across use cases.
Clear understanding and agreement on necessary resources and expected outcomes are vital within multidisciplinary teams needed to carry out such work. This includes human resources, technology, timeline, and defining success for each experiment. A robust sampling strategy is needed to parse datasets into training, validation, testing, and prediction sets, facilitating effective prompt development, model evaluation, and most importantly, model responses that are helpful for the task. Iterative prompt development and validation involve testing and refining prompts, including requests for justification to gauge the model's reasoning, which itself helps with prompt refinement.
Going Forward
Experimenting with AI in evaluation practice involves thoughtful risk-taking, continuous learning, and adaptation. It is a continuous process of questioning, testing, learning, and refining, analogous to how models like GPT themselves learn during their training phase: through constant trial and error. Our guidance note focuses on defining and adapting evaluation workflows to include LLMs where they fit best and building trust through thorough performance testing. Further research, experimentation, and collaboration are needed to standardize and expand frameworks for assessing LLM performance in evaluation. Sharing experiences and findings from experiments across organizations and contexts is essential.
Much has been written about the potential and perils of leveraging LLMs in research and analytical tasks. However, it is through concrete, practical, context-specific experimentation that we can discover what works, what does not, and under what circumstances. We are committed to exploring and sharing our findings as widely as possible.
Additional resources:
Blog series Experimenting with GPT and Generative AI for Evaluation:
- Setting up Experiments to Test GPT for Evaluation
- Fulfilled Promises: Using GPT for Analytical Tasks
- Unfulfilled Promises: Using GPT for Synthetic Tasks