1. Generating Synthetic Images for Data Augmentation in the Use of Geospatial Analysis

IEG has begun incorporating geospatial analysis into its evaluation practice (for example, using satellite data, geo-tagged data, and drone images). We also recently experimented with deep learning and computer vision applications to assess the effectiveness of World Bank interventions. These techniques require a large amount of of medium- or high-resolution images, which can restrict applicability. Several approaches can remedy this, including transfer learning and data augmentation; they all have their strengths and weaknesses as well as specific recommended use cases. We tested DALL-E on its ability to help with data augmentation through synthetic image generation by asking it to generate urban images in a style of Bathore, Albania (the site of a geospatial study). We were not impressed! Only four images could be produced per prompt, and it was hard to assess the similarity between the generated images and real images. Other data augmentation techniques, such as the use of generative adversarial networks, perform much better at this, and thus we do not recommend this approach for the time being.

2. Conducting a Literature Review

Structured literature reviews (SLRs) are a building block of IEG’s evaluations. Depending on the intended use, they vary in breadth and rigor, but they are unavoidable. We asked ChatGPT and the World Bank’s enterprise version, mAI, to conduct a literature review using the following prompt: “Please write a short literature review of the advantages and challenges of the use of Doing Business Indicators.” IEG had recently completed a robust structured literature review on this topic, so we could check GPT’s output against IEG’s work. Here our findings are a bit more ambivalent. While both ChatGPT and mAI provided plausible responses that overlapped with IEG’s findings—and ChatGPT provided a quite long and detailed response—ascertaining the veracity of the responses and mitigating the risk of hallucination was difficult. For instance, many of the references in the reference list produced by ChatGPT were entirely made up. They sounded like real journal articles, but a web search for them yielded no results. Using the ScholarAI plug-in did not fix this. Thus, we do not recommend using these tools to conduct a literature review. However, they can still be used—with caution—to obtain background knowledge on a specific topic.

3. Conducting an Evaluation Synthesis

IEG’s new product line, Evaluation Insight Notes, is designed to produce new insights by synthesizing existing evidence from IEG’s evaluative work. We tested ChatGPT’s capacity to synthesize information from a well-defined set of reports. Using the WebPilot plugin, we asked ChatGPT to ingest the text from six project evaluations (Project Performance Assessment Reports) on domestic resource mobilization (the topic of a recent Evaluation Insight Note) and produce an evaluative synthesis based on this evidence. The synthesis was produced iteratively with multiple rounds of prompts and interactions between our data scientist and the chatbot. While the writing produced sounded very good and the high-level messages (the insights) were appropriate, the model fabricated evidence and examples. For example, it invented examples of interventions that did not exist in the projects and text that did not exist in the reports. This eroded our trust in the whole exercise. Worse, when called out on these fabrications, the chatbot denied it! We do not recommend using chatbots to synthesize evidence from multiple sources, especially if attempting to generate specific examples or evidence to back-up overarching themes.


On the basis of these nine experiments, we think using AI for basic tasks can be very helpful. However, at this stage, we advise staying away from using GPT for more complex tasks, where the risk of untruthful answers becomes higher and harder to detect. When engaging with the chatbot particularly, exercising caution is very important. One colleague’s reaction to our findings stuck with us: “Is anyone else alarmed that some of the behaviors observed in AI, such as lying, hallucination and the belief it can’t be wrong, would be diagnosed as personality disorders or psychoses in a colleague? If we had an analyst who lied, hallucinated, and had unquestioning self-confidence in their conclusions no matter the evidence, would we accept them on the grounds that they were fast at Stata?”

In figure 1, we’ve captured our findings from the nine experiments, which we hope can be helpful to your own practice. We will continue to cautiously experiment with GPT and other large language models in the coming months and look forward to sharing our progress.

This blog is part of the series Experimenting with GPT and Generative AI for Evaluation:

  1. Setting up Experiments to Test GPT for Evaluation
  2. Fulfilled Promises: Using GPT for Analytical Tasks
  3. Unfulfilled Promises: Using GPT for Synthetic Tasks


Submitted by Dr Frans L Leeuw on Sun, 09/03/2023 - 09:54


I very much appreciate these three blogs. Great work!! If you add to your insights, insights from evaluations in fields like medicine, development, service delivery were (field)experiments and even sometimes theory-driven studies are done to sort out what the benefits (and 'costs;) are of 'algorithmization''/ combining human intelligence and AI , then these contributions help the evaluation community to become the Luddites of the knowledge industry in the 21st century.

Add new comment