Back to cover

Machine Learning in Evaluative Synthesis

Chapter 1 | Machine Learning Applications in Evaluation

What Is Machine Learning?

Machine learning is based on pattern recognition and the theory that computers can autonomously learn to perform certain well-defined tasks (Samuel 1959). The procedure employed usually relies on algorithms, a set of unambiguous mathematical rules used to perform classification and data processing and draw basic inferences. At its core, machine learning is a Bayesian endeavor in which prior beliefs are updated based on new data introduced into analysis. Though the philosophy underlying this approach dates back to the eighteenth century, recent improvements in the efficiency and accessibility of computational methods have allowed scholars and practitioners to apply the tools of machine learning methods to a wide array of complex problems.

Training on a subset of the data, a machine learning algorithm extracts generalizable lessons from new data, becoming more precise as more information is inputted. Human classification often faces an upper limit on both efficiency and scalability. There are also limits on how perceptive human coders can be in regard to patterns hidden in very large data sets; given the complexity of the underlying phenomena under observation, more nuanced insights based on fewer observations might be lost in the sea of available data. The same shortcomings that limit the performance of manual methods can, however, serve as a source of strength in automated content analysis. Automated classification tends to become more accurate as the quantity of information increases and does not neglect more nuanced patterns, provided that the training data used are sufficiently well ordered.

Machine learning applications can involve both supervised and unsupervised methods, as well as a mixture of the two. Supervised-learning algorithms rely on human-coded training sets to train a classification tool to generate predictions from a broader sample of data. Such algorithms are given a set of latent parameters to search for a priori, classifying raw data into categories according to those parameters. Among other uses, they can be trained to categorize text, detect spam, diagnose health issues, and discover fraudulent spending activity. The accuracy of supervised methods relates to how well the parameters for information classification are vetted and the quality of the manual classification of information that is used as a training set for the algorithms employed. In short, supervised classification methods require essential inputs from human sources to function properly. However, they tend to make up for the initial time investment needed to provide these inputs once they have been properly calibrated, parsing and categorizing textual data that are relevant to a particular topic of interest faster and more accurately than manual approaches.

Unsupervised approaches, conversely, do not rely on human input. Instead, they independently search input data for potential correlates and clusters based on different underlying features. Both approaches offer unique advantages specific to different applications. Unsupervised methods can best be thought of as tools that support a Popperian “logic of discovery,” serving as an exploratory probe for detecting clusters and patterns in texts (Aggarwal and Zhai 2012).1 However, though unsupervised classification tools can successfully detect patterns in complex and multidimensional data corpora, they can also be susceptible to misclassification errors and overfitting.

In rare cases, unsupervised approaches may unintentionally extrapolate substantively meaningless but statistically “significant” quirks in the data they are analyzing. Not every hidden association within data is useful in regard to a particular research topic. Human intervention is therefore needed to ensure that unsupervised training algorithms generate results that are substantively meaningful and not driven by stochastic noise in the underlying data. Such intervention becomes more pertinent as the complexity of the data increases. In practice, unsupervised learning can often be used with great success to detect hitherto unclassified clusters in data, highlight potential outliers in data sets, or reduce dimensionality within a complex framework.2 But practitioners should not rely on unsupervised learning to produce consistent and meaningful outputs without some degree of vetting by those with substantive knowledge of the underlying phenomena of interest.

Previous Applications

Practical applications of machine learning and text analytics in the realm of evaluation have primarily focused on three areas: automatic coding of key implementation challenges, risk identification, and impact evaluation. Though different machine learning methods can offer a variety of efficiencies related to the practice of evaluation, arguably the most pertinent method has involved supervised or semi supervised classification of large quantities of text. Previous applications have taken advantage of tools for such supervised classification in several different contexts. A variety of studies that have applied machine learning to data in health care, pharmaceutical research, transportation, energy, and labor, among other areas, have noted the benefits of such an approach.

Cimiano et al. (2005) use machine learning to categorize a large corpus of heterogeneous data, extracting common text features and examining interrelationships among the various terms identified. Tanguy et al. (2016) use support vector machine learning to classify and evaluate safety event records and archival documents, which enables them to categorize incident reports in the aviation sector. The resulting output improves the accuracy and reliability of analysis conducted by aviation experts, providing insights relevant to facets of aviation incidents. Schmidt, Schnitzer, and Rensing (2015) similarly take advantage of an automated classification algorithm for text-heavy source data, in this case a catalog of job offers based on hours of work, modes of employment, and functional work areas. The resulting output consists of a domain-specific search engine that enables subject-specific knowledge to be exploited more efficiently using a set of supervised subject filters.

Plmanabhan (2015) applies a battery of supervised multilabel classifiers and natural-language-processing techniques to analyze policy documents and survey data on psychological counseling for military servicemembers. He then uses the resulting output as a framework for explaining how the policies of the United States’ Military Health System influence servicemembers’ access to psychological services. Burscher, Vliegenthart, and De Vreese (2015) use a supervised-learning algorithm to categorize policy issues, political articles, and parliamentary discourse by salience and topic. The authors then use the results to investigate the generalizability of policy issue classifiers, testing the relevance of different machine-coded topics relative to those yielded by hand-coded training sets.

In regard to risk assessments, machine learning can help policy makers identify category-specific risk factors and quantify their impact, drawing on insights from challenges and obstacles encountered in earlier projects. In this context, Rona-Tas et al. (2019) use supervised learning in the field of food safety to assess the two main issues related to food hazards, helping practitioners better understand underlying ambiguities and emergent risks related to monitoring and inspection practices. Quantification of risk factors provides specific benefits in this context, as the output of the model employed (assessing the need for potential safety warnings and recalls) demands accurate and timely assessments of food risk parameters. Similarly, Abdellatif et al. (2015) and Ali (2007) use neural networks to assess flood risks and river water quality, generating output that helps manage urban water systems and minimize loss of life and property after water-based disasters. Galindo and Tamayo (2000) apply supervised-learning algorithms such as classification and regression tree models and neural networks to evaluate risk among financial intermediaries, generating an important diagnostic tool for assessing institutional risks and volatility.

Okori and Obua (2011) apply machine learning techniques to predict famines in Uganda, using data from the country’s northern region to train their tool on inputs from other regions. They employ a combination of support vector machine, k-nearest neighbors, naïve Bayes, and decision tree analysis to highlight meaningful relationships related to food security and famines, yielding output beneficial for evaluating causal variables related to theorized causes of food scarcity. Ofli et al. (2016) combine crowdsourcing and real-time supervised machine learning to evaluate large quantities of aerial and satellite imagery for time-sensitive disaster response. Jean et al. (2016) similarly apply machine learning to survey data and satellite imagery from Malawi, Nigeria, Rwanda, Tanzania, and Uganda, training a convolutional neural network to identify variations in local economic outcomes. The resulting output offers a scalable tool for predicting poverty according to a combination of data sources. Likewise, McBride and Nichols (2015) implement stochastic ensemble methods such as quantile regression forests to improve the accuracy of beneficiary targeting in poverty reduction, generating economies in areas in which conventional means testing can be prohibitively costly.

Impact evaluation has also benefited from advances in applied machine learning techniques. Counterfactual designs determine the effect of a policy intervention by comparing a treatment group with a control group over time, using experimental or quasi-experimental techniques to control for observable and non-observable causal factors. However, this type of comparison is not always feasible or desirable. In practice, achieving a proper balance among treatment and control groups is no easy feat, particularly when the active samples (such as specific social groups or geographical areas) tend to be structurally diverse. Matching techniques, including unsupervised learning, can be used in this area (see, for example, Gertler et al. 2016). In one example, Ruz, Varas, and Villena (2013) use k-means clustering algorithms to identify the common characteristics of households lacking internet access as a means of evaluating whether an unconditioned broadband and subsidiary campaign had a significant effect on broadband penetration in Chile.

Zheng, Zheng, and Ye (2016) also use machine learning methods to assess the development impact of environmental tax reform in China. Niu, Wang, and Duan (2009) rely on support vector machine analysis to evaluate the impact of power plant construction projects in China, and Burlig et al. (2017) examine, via machine learning, the impact of energy efficiency upgrades in primary and secondary schools. Machine learning can also yield useful meta-analytical insights. Mueller, Gaus, and Konradt (2016) note that progress in evaluation research depends on establishing a productive cycle of scholarly knowledge generation, dissemination, and implementation. Examining the uneven proliferation of scholastic work on evaluation, they employ a cross-national design for predicting evaluation research output, assessing the relative impact of country-specific research output in evaluation research.

In recent years, applications of machine learning and (more complex) deep learning models in the practice of evaluation have become more widespread. For example, the Independent Evaluation Group (IEG), one of the early adopters of data science applications in evaluation, has applied these tools in the analysis of textual data in portfolio identification exercises and content analysis (for example, Franzen et al. 2022), as well as of imagery data in poverty mapping and geospatial impact evaluation (for example, Ziulu et al. 2022).

Potential in Evaluation

The use of machine learning approaches in evaluation is still in its early stages but shows significant potential, not only as part of advanced text analytics but also in the use of other data such as imagery data. Regarding advanced text analytics, machine learning techniques can be used to process and analyze text documents by automatically coding and categorizing key issues in the documents. For example, machine learning can be used to extract common challenges across various sectors studied and map the evolution of obstacles over time. Machine learning applications can provide at least two significant advantages over manual approaches in the context of evaluation. First, they can systematically explore large or growing data sources (such as, archives or document repositories), analyzing quantities of information that would be prohibitively time consuming for human coders. They can do this systematically, without a bias toward or against certain issues over others. The impact of various traits these applications discover in the data will therefore be directly related to the presence or absence of those traits in the data. This attribute of machine learning applications is quite valuable in evaluation, as assessments should reflect as closely as possible the underlying features of the evidence examined, without subjective biases or unintended variations of the type different human coders might introduce.

Second, automated machine learning applications can continue to improve their assessments as new evaluative data are introduced. As a result, their output represents a “living” classifier: new categories and implementation challenges will be added, updated, and removed as the body of data assessed changes over time. In the case of the work presented in this paper, for example (see chapter 2), the use of machine learning applications allows real-time learning and adaptation by the model in response to evaluator output and the integration of project lessons in practice. Over time, as new data are integrated into supervised analysis, a positive feedback loop can develop between evaluation and practice, allowing future projects to integrate generalizable and context-specific lessons into their design and implementation. This ability to learn and adapt can provide notable efficiency gains relative to manual coding.

The application presented in this paper focuses on the extraction and classification of implementation challenges from private sector evaluation reports using machine learning techniques. In many ways it is similar to the Delivery Challenges in Operations for Development Effectiveness platform developed for public sector operations by the Global Delivery Initiative. The Delivery Challenges data set uses Implementation Completion and Results Reports from completed projects to generate a taxonomy of common issues that have an impact on project performance. Practitioners can then use insights from the data set to improve implementation and supervision outcomes.3 The experiment outlined in this paper offers a similar output for private sector operations, generating a set of implementation challenges representing specific obstacles encountered in the project cycle.

  1. For example, one particular type of unsupervised method (topic modeling) can be used to extract central themes and topics from documents, something that can be useful for parsing as well as classification (Blei 2012).
  2. For example, unsupervised methods can be used to identify a latent construct represented in clusters of text that contain common words related to a particular construct, such as women’s empowerment, poverty, or democracy.
  3. For more on the taxonomy, see Ortega Nieto, Hagh, and Agarwal (2022).