These experiments by no means represent all potential generative AI applications for evaluation practice, but they provide a nice sample that spans various stages of the evaluation process (pre-analysis, analysis, post-analysis), user profile needs (data scientists, analysts, and team leaders), and output types (text, images, and programming code). And, for all these experiments, we could compare GPT’s output with output already generated by an evaluation team.

As you read, keep three things in mind: First, these models evolve constantly, and the results of the experiments run in June 2023 could soon be obsolete. Second, output quality varied quite a bit by model. Third, the nature of generative models makes replicability impossible: running the same prompts several times with the same models obtained slightly different results every time—not very satisfying for rigorous evaluators, but this is what we have.

1. Writing Code for Preprocessing Textual Data

At IEG, we routinely incorporate text analytics and machine learning into our products, including for portfolio identification, portfolio analysis, and evaluative synthesis. These applications require the relatively tedious and time-consuming process of text preprocessing. We asked ChatGPT to generate Python code to complete a text preprocessing pipeline on the training set used for text classification for IEG’s annual Results and Performance report. Although it required several iterations to fix some issues, GPT produced correct code that met our needs. ChatGPT can be very useful in more quickly writing code for standard yet variable tasks such as this one; however, the prompt must be very specific, and the user must know what the correct output looks like.

2. Explaining Programming Code

As data science skills grow in IEG, we are decentralizing data science practices. This requires having code that is sharable, reproducible, understandable, and reusable by others. Ask any data scientist—it takes time to write code that functions and is reusable. We asked ChatGPT to add comments to R scripts, which we use to bulk download documents, to provide a high-level summary of the code and detailed descriptions of a user-defined function within the code. ChatGPT did a great job. We strongly recommend using large language models and chatbots to make your code more understandable or to interpret others’ code.

3. Conducting a Simple Classification

IEG routinely uses machine learning to identify complex portfolios (those that resist simple classifications by existing sector or theme attributes). We asked ChatGPT, its application programming interface (API), and the World Bank’s enterprise version, mAI (powered by GPT-3.5), to classify textual data (the project development objectives that characterize every World Bank project) according to whether they were related to disaster risk reduction. We assessed the accuracy of the model against manual work already done for an evaluation. ChatGPT and the GPT-4 API performed very well. Without specific training data (a “zero-shot application”), they reached an accurate rate (>76% accuracy), which is similar to results from a supervised learning model after spending time and resources developing a training set. Conversely, mAI performed with an accuracy rate (~57%) hardly better than the toss of a coin. We recommend using GPT-4 for simple classifications, especially when using the API, because only 25 entries could be done at a time with the chatbot. Keeping in mind that the output should be validated.

4. Conducting Sentiment Analysis

GPT-4 performed even better when conducting sentiment analysis, which we sometimes use to classify whether factors are positively or negatively associated with a desired outcome or to classify the tone of lessons from large repositories of evaluative evidence. We asked GPT-4 to provide the sentiment (positive or negative) of sentences we had already manually coded as a training set for the 2023 RAP. Here, GPT-4 outperformed our best model by 8 percentage points (94.5% for GPT versus 86.8% for SIEBERT). GPT-4’s API is a solid option for sentiment analysis. ChatGPT can be useful for just few sentences (it could only process about 50 entries at a time) but must be scrutinized thoroughly: in our case, it started hallucinating new sentences halfway through. Starting a new window for each prompt may help with this issue.

5. Conducting Econometric Analysis

IEG sometimes uses econometric analysis to test the association between World Bank interventions and desired outcomes. We asked ChatGPT to provide R code to replicate a multivariate regression analysis conducted for early-stage evaluation of the World Bank’s economic response to the pandemic. We provided ChatGPT with very specific instructions, and the code it generated performed well, allowing us to replicate the results of the study, including generating the plots. This seems like a good application of ChatGPT, reducing the need for writing code—or at least gaining significant time on that task.

6. Summarizing Individual Documents

We tested how well ChatGPT summarized a single document using the recently published  Morocco Country Program Evaluation. ChatGPT produced an accurate, well-written high-level summary of the document based on its key topics. It thus seems that ChatGPT can help in the process of producing an evaluation synthesis by generating high-level summaries of long reports that can then be synthesized by humans.


Clearly, some of AI’s promises were fulfilled, and we are enthusiastic about using GPT-4’s API for some pre-analysis and analysis tasks, especially those that revolve around generating code. Of course, users need to exercise caution and ensure that whoever uses GPT-4 or other large language models also has the ability and domain expertise to judge and verify the output.

Read the final blog to learn about our less-successful experiments.

This blog is part of the series Experimenting with GPT and Generative AI for Evaluation:

  1. Setting up Experiments to Test GPT for Evaluation
  2. Fulfilled Promises: Using GPT for Analytical Tasks
  3. Unfulfilled Promises: Using GPT for Synthetic Tasks



This is really interesting, and I am glad to see the attempt to take advantage of AI technology. We can learn a lot from this. Our biggest "concern" deals with security. Much of our data are from confidential interviews, where we promise anonymity. While the AI sites use high-level security, we worry that the data might be harvested for future analysis. For now, we are only experimenting with analyzing public documents-- such as public statements and project documents.

Add new comment

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.