Unfulfilled Promises: Using GPT for Synthetic Tasks

In our previous blogs, we shared our findings and recommendations for six experiments that ended up working quite well. For the next three experiments, we wanted to test GPT’s performance on more advanced tasks that didn’t revolve around simply producing code or summarizing documents but rather required the synthesis of information. Here the perils that we had anticipated materialized more vividly, especially on issues of truthfulness.

By: Estelle Raimondo

Harsh Anuj

Virginia Ziulu

August 30, 2023

Comment

1. Generating Synthetic Images for Data Augmentation in the Use of Geospatial Analysis

IEG has begun incorporating geospatial analysis into its evaluation practice (for example, using satellite data, geo-tagged data, and drone images). We also recently experimented with deep learning and computer vision applications to assess the effectiveness of World Bank interventions. These techniques require a large amount of of medium- or high-resolution images, which can restrict applicability. Several approaches can remedy this, including transfer learning and data augmentation; they all have their strengths and weaknesses as well as specific recommended use cases. We tested DALL-E on its ability to help with data augmentation through synthetic image generation by asking it to generate urban images in a style of Bathore, Albania (the site of a geospatial study). We were not impressed! Only four images could be produced per prompt, and it was hard to assess the similarity between the generated images and real images. Other data augmentation techniques, such as the use of generative adversarial networks, perform much better at this, and thus we do not recommend this approach for the time being.

2. Conducting a Literature Review

Structured literature reviews (SLRs) are a building block of IEG’s evaluations. Depending on the intended use, they vary in breadth and rigor, but they are unavoidable. We asked ChatGPT and the World Bank’s enterprise version, mAI, to conduct a literature review using the following prompt: “Please write a short literature review of the advantages and challenges of the use of Doing Business Indicators.” IEG had recently completed a robust structured literature review on this topic, so we could check GPT’s output against IEG’s work. Here our findings are a bit more ambivalent. While both ChatGPT and mAI provided plausible responses that overlapped with IEG’s findings—and ChatGPT provided a quite long and detailed response—ascertaining the veracity of the responses and mitigating the risk of hallucination was difficult. For instance, many of the references in the reference list produced by ChatGPT were entirely made up. They sounded like real journal articles, but a web search for them yielded no results. Using the ScholarAI plug-in did not fix this. Thus, we do not recommend using these tools to conduct a literature review. However, they can still be used—with caution—to obtain background knowledge on a specific topic.

3. Conducting an Evaluation Synthesis

IEG’s new product line, Evaluation Insight Notes, is designed to produce new insights by synthesizing existing evidence from IEG’s evaluative work. We tested ChatGPT’s capacity to synthesize information from a well-defined set of reports. Using the WebPilot plugin, we asked ChatGPT to ingest the text from six project evaluations (Project Performance Assessment Reports) on domestic resource mobilization (the topic of a recent Evaluation Insight Note) and produce an evaluative synthesis based on this evidence. The synthesis was produced iteratively with multiple rounds of prompts and interactions between our data scientist and the chatbot. While the writing produced sounded very good and the high-level messages (the insights) were appropriate, the model fabricated evidence and examples. For example, it invented examples of interventions that did not exist in the projects and text that did not exist in the reports. This eroded our trust in the whole exercise. Worse, when called out on these fabrications, the chatbot denied it! We do not recommend using chatbots to synthesize evidence from multiple sources, especially if attempting to generate specific examples or evidence to back-up overarching themes.

Conclusions

On the basis of these nine experiments, we think using AI for basic tasks can be very helpful. However, at this stage, we advise staying away from using GPT for more complex tasks, where the risk of untruthful answers becomes higher and harder to detect. When engaging with the chatbot particularly, exercising caution is very important. One colleague’s reaction to our findings stuck with us: “Is anyone else alarmed that some of the behaviors observed in AI, such as lying, hallucination and the belief it can’t be wrong, would be diagnosed as personality disorders or psychoses in a colleague? If we had an analyst who lied, hallucinated, and had unquestioning self-confidence in their conclusions no matter the evidence, would we accept them on the grounds that they were fast at Stata?”

In figure 1, we’ve captured our findings from the nine experiments, which we hope can be helpful to your own practice. We will continue to cautiously experiment with GPT and other large language models in the coming months and look forward to sharing our progress.

This blog is part of the series Experimenting with GPT and Generative AI for Evaluation:

Kristin Strohecker, IEG Program Manager for Data, Systems, and Staff Learning, Mari Noelle Roquiz, IEG Monitoring and Evaluation Specialist, and Tao Tao, IEG Data Scientist.

The Independent Evaluation Group data busters

Data dashboards in poverty, education, infrastructure, agriculture, energy, jobs, and SDGs

What can we learn from the Independent Evaluation Group’s project…

Dashboard with data on education poverty and infrastructure

Harnessing data for better development: The Independent Evaluation…

View from the space of planet earth and a satellite.

Unlocking the potential of geospatial analysis for evaluation

Add new comment

Comments

I very much appreciate these…

I very much appreciate these three blogs. Great work!! If you add to your insights, insights from evaluations in fields like medicine, development, service delivery were (field)experiments and even sometimes theory-driven studies are done to sort out what the benefits (and 'costs;) are of 'algorithmization''/ combining human intelligence and AI , then these contributions help the evaluation community to become the Luddites of the knowledge industry in the 21st century.

Article
Blog
comment compare
Custom decscriptions
Data
Evaluation
Multimedia
Event
Expert
General Documents
Homepage spotlight feature
Home page content spotlight
ICRR Reports
IEG Timeline
MAR
News
Basic page
Podcast
Reader chapter
Reader publication
Reports
Series
Survey Banner
Topic
Upcoming Report
Upload Mar
Xml Import

Unfulfilled Promises: Using GPT for Synthetic Tasks

Unfulfilled Promises: Using GPT for Synthetic Tasks

1. Generating Synthetic Images for Data Augmentation in the Use of Geospatial Analysis

2. Conducting a Literature Review

3. Conducting an Evaluation Synthesis

Conclusions

About the Author

FILTER BY

The Independent Evaluation Group data busters

What can we learn from the Independent Evaluation Group’s project…

Harnessing data for better development: The Independent Evaluation…

Unlocking the potential of geospatial analysis for evaluation

Comments

I very much appreciate these…

Add new comment

Restricted HTML