Setting up Experiments to Test GPT for Evaluation

Since OpenAI’s “ChatGPT” caused a frenzy with its entrance into the world at the end of 2022, a lot of hype has developed around generative artificial intelligence (AI) and large language models (LLMs), including in the evaluation community. Oscillating between awe and catastrophism, many opinions have been voiced. At IEG, the Methods Team has been key to introducing and scaling up the use of data science, machine learning, and artificial intelligence for evaluation. We wanted to act as a compass for our colleagues and other evaluation offices when it comes to incorporating generative AI in our practice. We thus decided to take a dispassionate approach by doing what we do best: setting up clear experiments to assess the models as objectively as possible. In this blog series, we will share how we designed the experiments (blog one); what we found worked well (blog two); and what did not work so well (blog three).

By: Estelle Raimondo

Harsh Anuj

Virginia Ziulu

August 16, 2023

Comment

A cyber punk version of The Creation of Adam with one robot hand and one human hand

Like many of you, when we started thinking about testing GPT, we were excited by the prospect of what we would find, but we also had a healthy dose of skepticism.

Models like GPT are essentially deep learning models that can generate output by mimicking the data on which they were trained and can acquire quite a bit of knowledge in the process. Among the deep learning models, LLMs, such as GPT, and multimodal learning implementations, such as DALL-E, stand out because they have been trained on such a plethora of data that they can generate plausible sounding text, images, or programming code, based on pretty much anything. At IEG, until now, we have only used discriminative AI (as opposed to generative), which models decisions about boundaries between different classes of text or images (for example, text classification and image segmentation) but does not generate anything new.

Now, what are we hoping to get out of using GPT or other LLMs for our evaluation practice? Essentially, speed, enhanced capabilities, new insight, and improved quality. For anyone who has already toyed with GPT, the desire for speed won’t surprise you. You can, for example, generate hundreds of lines of code in minutes instead of days. GPT’s ability to perform code writing and analytical tasks through prompts can greatly enhance capability. Imagine an evaluation team leader with rusty Stata skills working with GPT to arrive at an analysis without having to write a line of code themselves, or a data scientist leveraging a chatbot to quickly develop code to test a new modeling technique. New insight can be drawn because of the vastness of data that LLMs have been trained on and their summarization capabilities. We were the most doubtful about quality improvement. Yet, for example, the quality of the code generated with clear functions and explanations were quite impressive at first glance.

We were also very aware of several perils presented by generative AI. The most obvious are ethical issues and biases stemming from the fact that these models have been trained on data that include explicit and implicit biases (for example, toxic texts or images that are not representative of the overall population), even if their developers are controlling the outputs. For our use cases, we were particularly concerned about the lack of transparency of some deep-learning models and the limited information on models’ architecture or the training data. Finally, the issue of truthfulness was top of mind for us. It has been shown quite extensively that when facing complicated situations, the models tend to provide false information and make up responses that sound an awful lot like reality. We’ll share several instances of “hallucination” in our experiments in our next blog!

We followed several principles when designing our experiments.

We represented the various profiles of our IEG colleagues: (i) data scientists, who already have specialized programming skills and would use LLMs to enhance natural language processing applications and gain speed mostly through the use of the model’s application programming interface; (ii) analysts, who have the ability to interpret and verify the output of LLMs, can navigate GPT Playground and manipulate plugins, and can be trained in prompt engineering; and (iii) evaluation team leaders who would mostly use the chatbot function to accomplish tasks in natural language.
We compared the output generated by the various models with existing outputs produced by IEG colleagues in our “conventional” way.
We tested various models: mostly GPT-4, the World Bank’s enterprise version called mAI (which at the time was powered by GPT-3.5 via Microsoft), and for only a couple of applications, Google’s LaMDA and Bard chatbot. We also tested various approaches, OpenAI’s application programming interface, and certain ChatGPT plugins (Webpilot, Code Interpreter, and ScholarAI).
We allowed some level of probing and interaction, but we also kept a close eye on the time, as one of our objectives was to test for the efficiency (time gained) of the approach.
Finally, we used only publicly available information.

For our first batch of experiments, we zeroed in on nine applications, spanning the three typical phases of the evaluation workflow: pre-analysis, analysis, and post-analysis.

In the next blog, we’ll show you the experiments that worked well.

This blog is part of the series Experimenting with GPT and Generative AI for Evaluation: