Back to cover

Meta-Evaluation of IEG Evaluations

Chapter 4 | In-Depth Review of Evaluations

Evaluation question 4. What are the results of the in-depth review of the eight selected IEG evaluations?

This chapter presents the results of the in-depth review of the eight IEG evaluations selected in the sample. The evaluations were appraised according to the seven attributes distinguished in the framework. The results from this analysis are laid out below.1

Attribute 1: Scope and Focus

The first attribute in the in-depth review of evaluations focuses on the delimitation of the scope, focus, and context in which the evaluations operated. The attribute examines the evaluations’ rationale and the clarity with which evaluation questions are formulated. Particular attention is given to issues of complexity (including the complexity of the evaluand). Given that IEG evaluations often address portfolios of up to hundreds of projects and interventions in multiple countries—portfolios that are often multilevel, multiactor, and multisite in nature—it is crucial that evaluations carefully specify the rationale, scope, and questions studied.2

This attribute also gauges the extent to which evaluation questions are clear and focused instead of manifesting a “bag-of-questions” approach.3 To assess the focus and clarity of questions used in the sample of evaluations, the meta-evaluation drew on previous literature to distinguish between the types of questions typically employed in the context of evaluation.4 Such questions can be disaggregated into five categories: descriptive, exploratory, evaluative, explanatory, and design oriented.

Descriptive questions provide a summary of the state of affairs in a given field, society, or organization. Exploratory questions focus on garnering a better understanding of a topic or development. Evaluative questions deal with the development, implementation, and consequences of policies, programs, or interventions of major organizations. Such questions typically focus on the relevance, effectiveness, or efficiency of interventions. Explanatory questions focus on clarifying the impact and effectiveness of programs or policies, including any side effects that may arise from such interventions. Finally, design-oriented evaluation questions address the development of new intervention designs, including the characteristics of programs, evaluation systems, common property regimes, common pool resources, and so forth. Appendix F categorizes the evaluation questions listed in the evaluations from the sample according to these categories.

Most of the overarching questions cited in the sampled evaluations were descriptive, evaluative, or (to a lesser extent) design oriented. Evaluation questions were almost never formulated in the exploratory or explanatory style. Some questions turned an explicit eye to the future, delineating the design-oriented steps the Bank Group could take, whereas others did not. Though the evaluations reviewed in the sample generally fared well in clearly outlining their scope, the meta-evaluation nonetheless found that evaluation questions were not always brought together in a cohesive manner. Some evaluations did not integrate questions in an accessible section or paragraph. In other cases, it was not immediately clear which questions were more central or how the questions related to one another.5 The issue was raised in several interviews with IEG staff, who noted that the bag-of-questions approach was a suboptimal means of focusing the scope of evaluations.6

All eight Approach Papers were rated as adequate with respect to this attribute. Six of the evaluation reports were rated as adequate, and two received a score of partial. The vignettes below provide greater detail on the ratings and how specific projects fared with respect to this attribute.

The International Finance Corporation’s Approach to Engaging Clients for Increased Development Impact (FY18) provides a useful example of adequate scope and focus considerations (World Bank 2018f). The evaluation distinguished between the three complementary modalities the International Finance Corporation (IFC) has employed: client-focused partnerships, programmatic interventions, and country-focused interventions.7 The report investigated the effectiveness of IFC’s approaches to client engagement between FY04 and 2016, providing a clear delineation of the evaluation’s scope: “Given the importance of the first modality, the report’s focus is on client-focused partnerships” (5). This was justified according to IFC’s engagement with long-term clients, helping them enter new markets and enhance their contribution to the organization’s strategic priorities. The central outcome was likewise clearly defined as “increasing its developmental impact” (7).

World Bank Group Support to Health Services: Achievements and Challenges (FY18) provides another useful example of adequate scope (World Bank 2018g). The evaluation aimed to fill “an evaluative evidence gap in the health sector” (xi) and was the first comprehensive health sector evaluation carried out by IEG since 2009. In laying out its scope, the evaluation made sure to clearly delineate the many complexities of the health field, its myriad actors, as well as the interconnected systems and operations within it. In particular, it recognized and responded to the political economy of health systems and the challenges in using monitoring data to interpret progress toward health outcomes.

Conversely, Higher Education for Development: An Evaluation of the World Bank Group’s Support (FY17) listed the following as its overarching question: “How has the World Bank Group’s support to higher education contributed to its twin goals of poverty reduction and shared prosperity?” (59). Per the bag-of-questions approach, this was then divided into three subquestions (for example, “Is the World Bank Group’s support for higher education consistent and well articulated?”), and 13 subsequent components. A somewhat similar situation was found in Growing the Rural Nonfarm Economy to Alleviate Poverty (FY17), which cited two overarching questions, four subquestions, and eight subcomponents. Both examples resemble the bag-of-questions approach noted above.

Overall, the meta-evaluation found that all reports and Approach Papers provided a good range of evaluation questions. The sheer number of questions and subquestions listed in some reports (over 50 in the sample of eight evaluations) in some instances led to a fragmentation of focus. For example, at times 1 or more overarching questions were followed by 10 or more subquestions.

The assessment of evaluation focus also demanded a brief examination of the role of portfolio review and analysis in structuring the scope of IEG evaluations. Portfolio review (to a large extent) is a standardized (if not routine) activity in IEG evaluations. While portfolio-based work has its merits, in certain cases it can reduce the focus and specificity of evaluations. IEG evaluation teams tend to spend a significant amount time on the identification and description of the portfolio.8 In addition, due to the sheer number of projects (and underlying interventions), effectiveness analysis often focuses on project performance indicators instead of developing a causal analysis of impact. Weaknesses in the system (such as poor-quality outcome indicators)9 can reduce the utility of this type of analysis.

Taken together, the meta-evaluation noted that the information presented in reports and Approach Papers was rather elaborate and relevant: as such, nearly all evaluations scored adequately on this attribute. All reports and Approach Papers paid attention to evaluation questions to guide their assessment: the reports examined in the sample of eight evaluations listed more than 50 evaluation questions and subquestions in total. Usually 1 or more overarching questions were formulated, but certain evaluations subsequently added more than 10 subquestions, resembling a bag-of-questions approach to scoping. Portfolio analysis was used as a standard operation in characterizing and structuring the scope and focus of evaluations.

However, the scope of some IEG evaluations tended to be overambitious and diluted due to two aspects: First, the complexity of the evaluand, especially in terms of the number of and diversity in countries and projects in the portfolio, motivated a broadening of the scope in some instances. Second, this complexity was further amplified due to the multisite, multilevel, and multiactor nature of the interventions supported by the Bank Group (especially in case of the World Bank).

Attribute 2: Reliability

In an IEG blog post by Vaessen (2018), reliability is described as “the idea that if one would repeat the analysis it would lead to the same findings. Even though replicability would be too ambitious a goal in many (especially multilevel, multisite, multiactor) evaluative exercises, at the very least transparency and clarity on research design … should be ensured to enhance the verifiability and defensibility of knowledge claims.”10 The meta-evaluation focused on six sections related to evaluation reliability: evaluation design, data collection, data analysis, synthesis, limitations discussed, and limitations addressed. Of the eight Approach Papers, two were rated adequate, five partial, and one inadequate with respect to this attribute. Of the corresponding evaluation reports, three were rated adequate, four partial, and one inadequate.

The meta-evaluation specifically focused on four topics pertinent to reliability: use of the evaluation design matrix (EDM), the number of methods used in each evaluation, discussions of possible limitations, and the triangulation and synthesis of evaluative evidence. These will now be explored in sequence.

The first topic examines the way in which the EDM is used in evaluations. Relative to the attention paid to methodological approaches, the introduction of the EDM has been quite important, contributing to more transparent and structured evaluations. This view was also reflected in several of the interviews conducted for the meta-evaluation. The EDM provides an essential structure to the evaluation’s questions, methods, rationales, and sources, incentivizing evaluators to think through the methods and sources that should be used in evaluative analysis.

The evaluation on health services provides an illustrative example of the benefits of the EDM. The report adequately specifies key facets of data collection and analysis, addressing the relevant data architecture used, the theory of change (including intervention-specific theories of change), systematic reviews of existing research, and the range of methods required to address the evaluand. These include document analysis, case studies, interviews, statistical modeling, and social network analysis. The EDM proves particularly useful in justifying the use of specific methods, indicating how they are to be used and the ways in which evaluative evidence from each will be triangulated and synthesized. This was noted across country case studies, cross-validating findings from country-level findings with those from the portfolio and literature reviews.

However, in certain cases the EDM was treated as little more than a list of “evaluative instruments” such as questionnaires, interview topic lists, consultations, project portfolio reviews, statistics, and similar tools. Such reports often do not make a distinction between “instruments” used in data collection and data analysis. They also seldom discuss evaluation design, instead focusing largely on individual methods. White (2013) discusses these distinctions in detail. “Although the terms ‘research methods’ and ‘research design’ are often used interchangeably, there are important differences between the two. The essence of developing a research design is making decisions about the kinds of evidence required to address your research questions (de Vaus 2001). Research design is not about the logistics of research—how the data are collected, for example— but rather about the logic of inquiry, the links between questions, data and conclusions.”11

The Learning and Results in World Bank Operations: Toward a New Learning Strategy (FY15) provides an example of this (World Bank 2015b). In this report, IEG developed a survey instrument to assess the type and quality of evidence on project efficacy, applying it to implementation completion and results reports that discussed experiments, quasi-experimental approaches, and other approaches in line with the literature on evidence hierarchies. The evaluation appendix referred to a “results framework” and several “evaluation instruments” such as seven country case studies, surveys, and semistructured interviews with 50 World Bank staff.12 In addition, the evaluation listed a series of other methods, including an analysis of staff mobility across sectors and regions (using roughly 20,000 individual records from the World Bank’s Time Recording System), as well as a content analysis of responses to an open-ended question in the first Global Practices and Cross-Cutting Solutions Areas Rapid Survey. However, the evaluation made no mention of how insights from this rather large battery of methods and data were synthesized or triangulated.

The second topic addresses the number of methods used in each evaluation. In some cases, up to 10 methodological approaches were deployed, some of which were obtrusive (interviews, surveys, focus groups, consultations) and "others" unobtrusive (documentary evidence, basic statistics, country-focused evaluations, review of project-level evaluations, and so on). This raised concerns that the proliferation of methodological approaches may not be addressing the question of which methods are more appropriate or useful in terms of each evaluation’s scope and context.13

The third topic addresses the extent to which the limitations of evaluations (including “shoestring” conditions) were discussed.14 A well-developed discussion of limitations can positively impact the scope, breadth, and depth of the evaluation. Most of the evaluations examined in the sample fared well with respect to this factor, addressing limitations in a meaningful and convincing manner. The evaluation on Carbon Markets for Greenhouse Gas Emission Reduction in a Warming World (FY18) presents a good example of this (World Bank 2018a). The report lists six potential limitations, taking care to address the ways each was addressed in the evaluation. The evaluation further addressed specific limitations related to each of the methods used, including portfolio analysis (appendix B of the report), causal analysis (appendix C of the report), and econometric analysis (appendix D of the report).

Finally, the fourth topic addresses the triangulation and synthesis of evaluative evidence. The combination of different methodological approaches can facilitate the corroboration of findings. However, a multifaceted research design can expose unforeseen contradictions and nuance. Though triangulation and synthesis are essential to both, the meta-evaluation noted that coverage of this facet could be improved. The point was further raised in several of the interviews. With this in mind, several of the reviewed evaluations showed an excellent integration of triangulation and synthesis. For instance, in the essential health care services evaluation report, “triangulation [was] applied at multiple levels, first by cross-checking evidence sources within a given methodological component. For instance, within country case studies interview findings were compared across types of stakeholders (Bank Group staff, government officials, academia, health experts, and other development partners). Second, triangulation across evaluation components—for example, cross-validating findings from country-level case studies with findings from portfolio analysis and literature reviews” (World Bank 2018g, 77). The evaluation also took steps to triangulate evidence across the portfolio analysis, the country case studies, and the intervention case studies of delivery mechanisms for the case of the World Bank’s response to pandemics. The evaluation on the rural nonfarm economy also provided an example of triangulation, pointing out that the structured literature reviews played a central role in guiding the analysis of project documents and data.

Taken together, the meta-evaluation found that most evaluations in the sample performed relatively well in terms of the attributes of reliability outlined above. The integration of the evaluation design matrix was touted as a major improvement in design, clarifying the role of individual methods and enhancing the general reliability of evaluations. The meta-evaluation also found that the use of the EDM had increased in recent years, indicating a positive development with respect to reliability. While the large number of methods used in certain evaluations raised some questions about the adequate use of triangulation and synthesis of findings, in other evaluations this issue was handled in a clear and satisfactory manner.

Attribute 3: Construct Validity

The concept of construct validity initially began in psychological research. However, as Strauss and Smith (2009) have shown, this concept has been broadened to cover the operationalization of key concepts and relationships in other forms of research.15 In the context of evaluation, construct validity among other things relates to the theory of change or intervention logic used in the conceptualization and delimitation of the evaluand. Bamberger et al. (2004) define construct validity as “the adequacy of the constructs used to define processes, outcomes and impacts,” including “the indicators of outputs, impacts and contextual variables.”16 Specifically, the assessment focuses on three facets of construct validity: attention paid to the identification and operationalization of core concepts or variables, the ways in which theories of change or intervention logics are used, and the integration of existing (academic) research through structured reviews.17 Of the eight Approach Papers reviewed, three were rated as adequate and five as partial. Of the corresponding evaluation reports, four were rated adequate and four as partial.

Most evaluations pay attention to the identification of core concepts, usually defining them in a supplemental glossary. Relatively fewer evaluations provide a dedicated operationalization of core concepts. The learning and results evaluation presents an interesting example of this discrepancy. The evaluation drew heavily on World Development Report 2015: Mind, Society, and Behavior, which incorporated insights from cognitive, social, psychological, and neuroscience studies to better understand learning in Bank Group operations. The evaluation defines the various types of learning and knowledge used in the analysis of operations. The evaluation also outlines the EAST principles to encourage behavior change, along with some behavioral reactions like forming, storming, and norming.18 Some concepts like signaling are not formally operationalized but can be deduced from the context in which they are used.19

Turning to theories of change and intervention logics, the meta-evaluation noted that all evaluations in the sample included some type of theory. Three main approaches to the use of theories of change were identified in the review.

The first approach involved the presentation of an overarching “causal” framework, often distinguishing among inputs, activities, outputs, and outcomes. The framework often directed or restricted the analysis to specific instruments, their intended results, and (at a high level) related economic, sociological, or policy factors. While the exact relationships between the steps of the theory were usually not fully articulated or empirically tested, the theory nevertheless offered a sense-making framework aimed at deconstructing the complex evaluand under consideration.20

Two examples illustrate this approach. The higher education evaluation presented a conceptual model (the “evaluation framework for higher education”) of Bank Group support in this field (World Bank 2017d, 73). In practice the model resembled a logic model, distinguishing among inputs, outputs, and outcomes without delving into the mechanisms explaining the occurrence of events.21 While the logic model structured the evaluation, it did not serve as a full conceptual model in terms of testing, validating, and assessing points of departure. Similarly, Mobile Metropolises: Urban Transport Matters: An IEG Evaluation of the World Bank Group’s Support for Urban Transport (FY17) provided a theory of change visualizing the links between activities, outputs, intermediate outcomes, and development outcomes (World Bank 2017e). The theory of change also listed eight “enabling factors” such as culture, human capacity, and macro stability; however, the specific relationships between these factors and outcomes were not explicitly specified. Once again, the theory of change resembled a logic model, “reflecting how the World Bank Group’s strategy and sectoral leadership posited that its interventions would contribute to desired outcomes and impact. The emergent elements became focal points of the evaluation, reflected in its chapter organization” (60).22

The second approach to formulating and using theories of change involved presenting a substantive intervention logic, often expanding on the underlying package of interventions in a more rigorous empirical manner. Particular attention is paid to mechanisms (behavioral, cognitive, economic, institutional) that can alter the impact of projects, investments, and other interventions. In the sample of IEG evaluations selected for review, three were identified as employing such an intervention theory.

In the evaluation on IFC client engagement, the theory of change reconstructed how “the objectives sought by IFC’s approach to client engagement were expected to improve client outcomes and IFC’s development impact, as the concept evolved over a series of IFC strategy documents” (World Bank 2018f, 55). The theory of change was then tested, with special focus placed on mechanisms like the targeting of selected companies as long-term partners. IFC supported these entities “with dedicated client relationship teams to provide them with … specialized local knowledge and contacts [to] assist with regulatory issues and mitigation of political risk” (59). Such interventions helped develop transactions that advanced IFC’s strategic objectives, triggering behavioral changes and promoting intangible benefits such as a deeper understanding of client needs and improved access to key client decision-makers.23

In the health services evaluation, the approach relied on a search of relevant literature to develop four specific intervention-related theories of change: conditional cash transfers (CCT), performance-based financing, pandemic preparedness and control, and public-private partnerships (World Bank 2018g). Next, these intervention theories were supported with evidence from Bank Group sources (portfolio data) and existing evaluation literature. For the CCT theory of change, the analysis addressed the degree to which Bank Group support for CCTs in health services had effectively contributed to the achievement of relevant health services-related goals (see figure E.1).

The framework integrated the following assumptions:

  1. The beneficiaries of CCT programs are currently underusing existing health services.
  2. The existing supply of services is sufficient to accommodate increasing demand.
  3. The beneficiaries of CCT programs are aware of the program and correctly informed about eligibility and available benefits.
  4. The cash transfers received are used to finance health services and improve food consumption as opposed to detrimental products like tobacco and alcohol.
  5. The transfers are sufficiently generous to incentivize compliance with the required conditionalities.
  6. The design features of the CCT (enrollment, verification of conditionalities, cash transfer management) are credible means of producing the desired behavioral changes. The theory was tested against existing literature including some 30 impact evaluation studies on CCT programs.

The health services evaluation also featured a pandemic preparedness and control theory of change, which was used to structure Bank Group activities conducive to the realization of effective pandemic preparedness and mitigation strategies (World Bank 2018g; see figure E.11). The theory of change noted that such responses required a collective global health response aimed at fulfilling four critical conditions: surveillance, protection of the population, effective outbreak response, and communication.24 Like the analysis of CCTs, the theory of change laid out several assumptions necessary for the achievement of the desired outcomes:

  1. Frontline human resources would continue to provide essential health services even under increasing risk of contagion.
  2. The population and the health workforce would respond to behavior change interventions (for example, information and incentives). Having laid out a framework of interventions and assumptions, outcomes from the Bank Group portfolio were then compared with the theory of change.

Finally, the urban transport evaluation paid attention to the “two lenses” of behavior change and service delivery in an appendix (World Bank 2017e). For the topic of behavior change, a model rooted in neoclassical and behavioral economics was developed, showing that such change is dependent on communication, availability of resources, information on incentives, social factors, and psychological factors.25 The model was then tested on a random sample of World Bank urban transport projects, drawn from the larger urban transport portfolio under review. The main objectives of this review were to (i) explore the extent to which information on behavior change is available in project documents, (ii) analyze how behavior change is described and operationalized in project documents, and (iii) assess the quality of the information provided in project documents (140). Likewise, the issue of service delivery was assessed using a theoretical framework applied to a random sample of 68 World Bank investment operations drawn from core World Bank operations identified by the urban transport evaluation (149).

The third approach to formulating and using theories of change involved a combination of a general theory of change underlying a “macro-level” complex evaluand (that is, a thematic or sectoral portfolio) and one or more “nested” theories within this broader theoretical framework. Given its expansive scope, the broader theory of change is not a testable theory and serves as a broad sense-making framework (see previous discussion). As such, only the nested theory is empirically tested in this approach. The carbon finance evaluation provides an excellent example of this approach (World Bank 2018a). The overarching theory of change was “developed around the four main roles of carbon finance (CF), shaped by the changes in global needs and priorities, with a focus on the following components: (i) creating and developing markets, (ii) innovating carbon finance; (iii) building capacity of the clients; and (iv) thought leadership and convening” (85). The approach resembles a more general or synthetic theory of change, listing outputs and outcomes that could emerge from CF interventions in relation to the four listed key components listed (see figure 1.1 on page 6 of that report).

The evaluation also offered a nested theory on Emission Reduction Purchase Agreements (ERPA) under the general assessment of carbon markets (World Bank 2018a). The ERPA theory of change “fits squarely the logic of what Trochim (1985) popularized as Pattern Matching” (125; figure C.1). The nested ERPA theory was “tested based on new empirical evidence. The empirical strategy retained for this study consisted of a combination of two case-based methods that have a comparative advantage in providing robust evidence for causal analysis: process tracing and QCA applied to 16 cases of ERPAs. For each case, the evaluation team traced the contribution of the Bank Group, the project entity, and other critical actors throughout the process of development, implementation, and follow-through of each ERPA. Data collection was broadly meant to include document review, field visits, and a series of interviews with the key stakeholders engaged throughout the ERPA cycle and beyond. Patterns of convergence and divergence across cases were systematically analyzed, using the logic of QCA, ultimately forming a robust empirical base” (125).

The meta-evaluation’s assessment of construct validity concluded with an appraisal of the integration of existing (academic) research through structured reviews. Several excellent examples were found among the eight reports assessed. In appendix J of World Bank Group Support to Electricity Access (FY15), a structured literature review was presented on “access to electricity for improving health, education and welfare in low- and middle-income countries” (World Bank 2015d, 128). The review served the primary objective of critically analyzing and synthesizing existing evidence to answer the following question: What is the impact of electricity access on health, education, and welfare outcomes in low- and middle-income countries?

In the health services evaluation, existing research was integrated through an evidence gap map (World Bank 2018g). “The evaluation used [evidence gap maps] EGMs to identify knowledge gaps on the effects of selected interventions on expected health outputs and outcomes commonly targeted by World Bank Group projects according to portfolio review evidence… The searches resulted in a total of 5,506 citations coming from the Cochrane Database of Systematic Reviews and others” (73).26

The carbon finance evaluation also made use of this method, using it to better understand the function of the Clean Development Mechanism (CDM), “the major international offset mechanism within the broader world of carbon finance” (World Bank 2018a, 164). The CDM was designed to lead to significant emission reductions “that will help reduce the cost of climate mitigation in countries with commitments as well as contribute to sustainable development in the host countries” (164). As background for the evaluation, IEG carried out a structured literature review on the generation of local community co-benefits from CDM projects.

While the examples listed above showcased the integration of existing research in evaluations, it should be noted that the use of structured literature reviews was not considered standard practice during the period examined (FY15–19). For instance, the higher education evaluation referred to the use of literature in only one section, reviewing “the existing academic and policy literature to provide a better understanding of current thinking about the sector” (World Bank 2017d, 73). Evidence from interviews indicates that structured literature reviews have become more widely used since their “introduction” in 2016.

In summary, the meta-evaluation noted adequate coverage of construct validity issues in the sample of evaluations appraised. The evaluations paid close attention to the definition of key concepts and took steps to outline a meaningful theory of change. At the same time, more attention could be paid to the operationalization of concepts (including the key variables and measurement instruments used): coverage of this facet was less visible in the eight reports reviewed.

As noted above, the reports generally took one of three approaches to formulating a theory of change guiding evaluations. In the first approach, a conceptual framework was used to delineate the inputs, activities, and outputs that enable or restrict outcomes of interest. The frameworks usually served as sense-making frameworks to better understand the often-complex elements underlying the evaluand (for example, as a result of the time period assessed, number of projects examined, and so on). The second approach involved the development of a substantive theory of change that underlies more specific interventions, confronting that theory with evidence from the empirical part of the evaluation. Particular attention was paid to the mechanisms underlying particular interventions. The third approach combined a more general theory of change (covering Bank Group activities on a macro level) with one or more nested theories of change, the latter of which were empirically tested.

The coverage of theoretical frameworks illuminated a potential area of growth for future IEG evaluations: while all the evaluations outlined their underlying intervention logics, more could have been done to link them to the empirical part of the studies.27 Furthermore, capturing insights from existing research and evidence through the adoption of structured literature reviews as a standard practice in evaluation seems to be gaining ground in IEG’s evaluative work. The sample provided several excellent examples highlighting the benefits of this practice.

Attribute 4: Internal Validity

In IEG’s self-evaluation systems evaluation, internal validity was defined as “how well an assessment tool measures what it is intended to measure” (World Bank 2016a, viii). Like accuracy, the concept of internal validity also refers to the degree of confidence in the causal or contributory relationship being evaluated, as well as the assurance that findings were not influenced by external factors. Internal validity concerns the extent to which a study establishes a trustworthy causal relationship (or attribution). Alternatively, it assesses the establishment of a trustworthy contributory relationship between interventions and outcomes. This includes an evaluation of the degree to which studies address and explore possible alternative explanations.

Internal validity is particularly important given the scope and complexity of IEG evaluations. Conventional threats to internal validity (for example, attrition, maturation) can be exacerbated by the inherent complexity of the evaluand, a notable concern given that the evaluations covered by the meta-evaluation each (often) covered hundreds of projects spread over dozens of countries. The meta-evaluation’s assessment of internal validity focused on four attributes: the extent to which issues of causality, attribution, and contribution were discussed, the degree to which causal questions were adequately addressed by the methods employed, the level of attention paid to unintended effects, and the discussion of internal validity concerns relative to the validity of findings.

Of the eight Approach Papers reviewed, two were scored as adequate, three as partial, and three as inadequate. Of the corresponding evaluation reports, two were rated adequate, five as partial, and one as inadequate. Some of the strengths and weaknesses related to internal validity are outlined through the examples highlighted below.

As noted in the discussion of construct validity above, the carbon finance evaluation included a well-developed nested theory of change, along with a pattern-matching exercise and a case study design for causal analysis (World Bank 2018a). The case study design consisted of the following steps assuring internal validity:

First, for each of the 16 cases, we traced the process of change at play throughout the 15 steps of the theory of change (developed in detail in a separate common template for data collection; the main steps are shown in appendix C.1) and the causal contribution of the World Bank Group and other contributory actors and factors, with rich and deep description.

Second, a systematic analysis of patterns of convergence and divergence across cases for each step of the causal chain was performed.

Third, the empirical patterns emerging from the cross-case comparison were linked to the theory of change, checking for match and mismatch.

Fourth, given the causal complexity underlying the explanation of the five main outcomes of interest, the team resorted to crisp-set QCA to formally test the theory of change. Crisp-set QCA is a well-established technique which resorts to Boolean minimization to ‘simplify complex data structures in a logical and holistic manner.’” (World Bank 2018a, 126)

The structured literature review on the CDM also produced relevant insights on causality and contribution (World Bank 2018a). Finally, the econometric study assessed the Bank Group’s effectiveness “in reducing greenhouse gas emissions through its support to the Clean Development Mechanism (CDM) interventions” (144). The evaluation combined several approaches and empirical strategies that constituted a convincing causal narrative, supporting the internal validity of the findings.

In the health services evaluation, the complexity of assessing internal validity was discussed in depth:

“Although overall portfolio analysis exploited the breadth of the evaluable material, IEG acknowledges that the assessment of project effectiveness through outcomes ratings challenges the internal validity of the evaluation findings. First, outcome ratings used in the portfolio analyses are based on incomplete samples of closed projects. Second, when available, outcome ratings tend to be a biased measure of the overall projects’ success. Third, the team recognizes that IFC [investment services] IS, IFC [advisory services] AS and World Bank project financing define and monitor objectives differently, therefore direct comparison between interventions with regards to the ratings of project outcomes and [project development objective] PDO’s efficacy should be considered with caution.” (World Bank 2018g, 78)

Though not focusing on internal validity per se, the evaluation took pains to ensure the validity of findings, “including consultations with World Bank Group staff, use of specific protocols and coding templates … and intercoder reliability and quality control measures to guarantee a consistent approach to coding and analysis across evaluation components and across team members” (World Bank 2018g, 77).

The report also noted that the use of outcome ratings in intervention-type case studies presented additional challenges related to the complexity of health projects (World Bank 2018g). Given that health projects are usually composed of multiple overlapping interventions, project outcome ratings can become a rather imperfect measure of the effectiveness of each specific intervention. The evaluation was further complicated by the fact that relatively few closed projects were available for assessment, offering a limited sample for the inference of Bank Group contributions to health outcomes.

The evaluation on growing the rural nonfarm economy presented another interesting vignette with respect to internal validity (World Bank 2017c). An appendix on community-based approaches reviews interventions in terms of their objectives, targeting, metrics, and results. The review is critical with regard to the design of a number of projects, what was measured (often unclear), the completeness of data (often incomplete), how data were treated, and which methods were used. Some of the criteria evaluated were in line with “evidence or design hierarchies” that evaluators use to separate the valuable from the useless when addressing internal validity.28

The IFC client engagement evaluation took several steps to ensure that a consistent approach was taken by the evaluation team members—for example, using a case study template and interview protocols to ensure a common framework and evaluative lens across studies (World Bank 2018f). The evaluation also demonstrated empirically (through an econometric analysis of client learning versus selection) a self-reinforcing selection effect through which client quality and strategic fit promoted a gradual deepening of relationships into a de facto strategic engagement.

It should be noted that several of the evaluations examined in the sample were less successful in addressing issues related to internal validity, engaging in a limited discussion of causality or contribution. For example, the electricity access evaluation made numerous references to effectiveness and impact, but there was never an explicit discussion of causality or contribution issues (World Bank 2015d). Self-reported achievement of project objectives (some measured at output or direct outcome levels) was equated with impact, establishing a line of argumentation that does not apply in situations where human behavior is crucial to making the infrastructure work (for example, through interactions with human dimensions such as awareness, education, gender responsiveness, accessibility, and so on).

While the higher education evaluation made the limitations of the underlying evidence base explicit, the report still drew largely unfounded higher-order causal claims (World Bank 2017d). Though the evaluators’ instincts may be correct with respect to the conclusions drawn, the mechanisms underpinning causal analysis were nonetheless weakly formulated. Similar conclusions were drawn from interviews with the learning and results evaluation team.

Taken together, the meta-evaluation’s assessment of internal validity yielded mixed results on this attribute, making it an important area for improvement for the credibility and quality of IEG evaluations. More could be done to address conventional threats to validity. Although evaluations need not engage in causal analysis, triangulation of evidence across different sources and a more explicit acknowledgment of potential limitations would strengthen the internal validity of findings in future evaluations.

Attribute 5: External Validity

External validity (or generalizability) refers to how well the findings from an evaluation can be expected to apply in other settings. For instance, do the findings apply to other people, organizations, situations, and time periods? The meta-evaluation focused on five facets related to the generalizability of findings: the extent to which generalizability was discussed, whether external validity concerns affected the validity of findings, whether attention was paid to population validity, how issues of ecological validity were addressed, and the coverage of temporal validity. Population validity is here defined as the extent to which reports pay attention to the ability to generalize results to other individuals or targeted groups. Ecological validity refers to the level of attention paid to generalizability across different settings. Finally, temporal validity refers to the ability to generalize findings across time. Of the eight Approach Papers reviewed, five were rated as partial and three as inadequate. Of the corresponding evaluation reports, two were rated adequate, four as partial, and two as inadequate.

The assessment found that the coverage of external validity was subject to certain weaknesses among the five facets explored, resulting in partial ratings for several of the reports reviewed. For instance, several reports provided limited discussion of the limitations on generalizability.29 Other reports provided a relatively narrow sample of country-level assessments with limited attempts to systematically establish the causal underpinnings of change observed in relation to the overarching evaluation questions.

While aspects of temporal and ecological validity were well covered, there was no explicit discussion of the generalization of findings in the higher education evaluation (World Bank 2017d). The carbon finance evaluation identified certain weaknesses related to external validity but did not expand on specific mitigation strategies (World Bank 2018a). This was also the case in the IFC client engagement evaluation (World Bank 2018f). However, the rural nonfarm economy evaluation explicitly focused on the way in which variations in country conditions limited the generalizability of findings, aligning with the report’s goal of formulating a holistic understanding of Bank Group engagement in this area (World Bank 2017c).

Although the evaluation questions can guide the evaluation toward generating generalizable findings, there are rare instances when (given the institutional context) the nature of external validity can vary from the intent of the evaluation.30 The urban transport evaluation operationalized urban mobility through four variables, but two of the four were based on evidence from country case studies in Africa (World Bank 2017e, 14–15). The lack of representativeness in cases (relative to the rest of the Bank Group portfolio) may have affected the ecological validity of the results across other relevant contexts.

However, several evaluations provided excellent coverage of external validity issues. For instance, the evaluation on learning and results in World Bank operations was explicit about the representativeness and randomness of the sample of evidence used (World Bank 2015b, 3–4). The evaluation also made clear its focus on ecological (as opposed to population) validity, specifically for the case studies chosen to reflect the diversity in contexts. Finally, the evaluation noted an intention to arrive at conclusions that would prove useful for the World Bank, incorporating a discussion of how the results should be interpreted to ensure temporal validity (2–3).

To conclude, while the ratings indicate a mixed picture on external validity, the discussion and approach to this attribute were generally consistent with the nature of the evaluations. Aspects of ecological and temporal validity were generally well covered. Some evaluations explicitly spelled out the limitations of generalizability across contexts but provided limited mitigation strategies. This did not always constrain the inferences made from specific findings to broad conclusions for Bank Group interventions.

Attribute 6: Data Analysis Validity

Hedges (2017) distinguishes between data analysis validity and the more narrowly defined statistical conclusion validity, which gauges whether the conclusions of a study are founded on robust statistical inferences. Data analysis validity is a broader concept that also addresses issues such as whether the evaluation has paid attention to risks of bias (unreliable data, improper choice of methods, incorrect use of methods) and has indicated ways to address risks associated with these issues. Three factors are considered in the meta-evaluation’s assessment of this attribute: whether attention is paid to risks of bias (from unreliable data, incorrect use of methods, and so on), whether the evaluation indicates ways to address risks of bias, and indications of data analysis concerns related to validity. Of the eight Approach Papers reviewed, three were scored adequate, three as partial, and two as inadequate. Of the corresponding evaluation reports, one was rated adequate, six as partial, and one as inadequate.

While the quality of the data analysis was generally found to be good across the sample, two common challenges were noted for this attribute, relating to issues of transparency and triangulation. First, some evaluations faced difficulties in clearly demonstrating the stream of evidence that supported some of the key findings. Second, triangulation of evidence was found to be insufficient in certain contexts. However, certain evaluations proved very successful with respect to both challenges. The carbon finance evaluation took care to ensure data sources were validated at every stage (World Bank 2018a). Likewise, the higher education evaluation effectively addressed the risk of bias in a transparent manner, triangulating evidence from multiple sources to reach a cohesive and convincing assessment (World Bank 2017d). The use of triangulation was evident in the latter evaluation’s assessment of the Bank Group’s support to access, retention, and equity in its higher education portfolio. Evidence from interviews and case studies was explicitly compared with the Country Partnership Frameworks, the country strategy analysis, and portfolio analysis. Both the range of methods used and the transparency with which the output was synthesized reflected a high standard of research.

The evaluations examined in the sample also took steps to discuss the potential limitations of the input data. However, in some instances the data analysis did not go far enough to expand on the quality of the underlying data. The electricity access evaluation provides an example. In this case, the assessment of results drew primarily on the reporting of indicators derived from the projects under review (World Bank 2015d). While these indicators were transparently reported, the risk of bias underpinning the data was not discussed. This contrasted strongly with the explicit consideration of bias in the external literature informing the evaluation. The reliance on secondary data sources had the additional effect of reducing the strength of evidence where reporting was weak; indicators on welfare outcomes (including gender-related outcomes) were more likely to be missing, poorly defined, or inadequately followed up during project implementation.

Overall, while the evaluations examined in the sample were generally robust in addressing data analysis validity, data quality concerns and strategies to mitigate potential biases resulting from weaker data came up as areas of concern under this attribute. Expanded focus on these facets would generally improve the validity of findings in future evaluations.

Attribute 7: Consistency

Consistency refers to the need for a logical flow between the evaluation rationale, questions, design, data collection, analysis, findings, and recommendations. It is, thereby, only applicable to evaluation reports, given that Approach Papers (by definition) do not integrate any findings. Of the eight reports examined, four were scored as adequate and four as partial. The reports examined fared relatively well with respect to this attribute. As such, the challenges listed below mainly apply to areas in which further improvements can be achieved from an already strong baseline.

There was a generally strong fit between the use of methods and data sources used to address evaluation questions. However, more could be done to provide a consolidated explanation of how specific methods advanced the evaluation and what each approach was designed to contribute to the analysis under each evaluation question. An example of good practice on this can be found in the IFC client engagement evaluation (World Bank 2018f): the report outlined each of the methods used and why in each case.31 This provided the reader with a clear view of how they should expect each method to contribute to the evidence base and the overarching objectives of the evaluation.

While the findings presented in evaluation reports generally related well to the evaluation questions, two related challenges were noted in the sample. First, subtle (but potentially significant) shifts in the interpretation of evaluation questions could alter the course of the evaluation, particularly if the central questions are paraphrased within the report.32 Second, the danger of findings “overreaching” relative to the data analysis can hinder the effectiveness of the prescriptions or generalizations derived from an evaluation. In the electricity access evaluation, the report states that “the World Bank’s performance in the electricity sector is somewhat lower than its performance in other infrastructure sectors combined” (World Bank 2015d, 23). However, it is then suggested that “the complexity and diversity of energy sector activities and operations compared with those of other infrastructure sectors may partly explain this difference” (23–24). This latter claim is neither substantiated nor explored further.

In most cases, recommendations from the report followed logically from the evidence and findings presented. For instance, the carbon finance evaluation presented a clear and explicit flow between the evaluation logic, methods deployed, and findings derived (World Bank 2018a). The chapter “Effectiveness of World Bank Group Roles” was structured in accordance with the theory of change (see figure 1.1 of that report). This itself was clearly justified with the roles of the Bank Group in this sphere (see pp. 3–4, 6). Statements were transparently related to the evidence stream from which they were derived. In addition, endnotes in the chapter provided additional evidence for many of the points made (see pp. 56–60). The flow from the intervention logic to arguments, evidence, and findings presented a clear and compelling case to support the evaluation’s findings.

At a minimum, there was generally a good multitiered depiction of links between different levels of intervention and different levels of outcomes in the evaluations. However, the meta-evaluation did not find examples where this framing was then worked into a model to help better understand and probe the underlying issues identified. This is surprising given that the nature of the evaluand often had strong features of dependency between actions taken at different levels. Yet how such links were investigated was not always sufficiently clear. Exploring and understanding these links in a selective and targeted way is critical, particularly where assumptions of linearity do not hold or else apply only under certain restrictions.33

The higher education evaluation provides an example of this point (World Bank 2017d). The evaluation posed three central questions. First, was World Bank support to higher education consistent and well articulated? Second, did the World Bank contribute to higher education systems? Third, did support for higher education contribute to improved socioeconomic outcomes? To address the third question in a robust way, attention must be paid to what may be dubbed “macro-meso-micro” links: How does World Bank support influence or contribute to what the evaluation framework calls “broader outcomes” like skills and impacts (poverty reduction, employment, productivity)? Such broader outcomes must be measured at the level of beneficiaries. However, the links between the elements in the evaluation framework and micro-level behavior were not addressed.

Several macro-level variables referred to in the visualization of the evaluation’s logic model invoked concepts like political economy, business climate, environmental and social conditions, and so on (World Bank 2017d). But the evaluation did not clearly articulate how these were linked to the meso- (Bank Group support for higher education) and micro- (outcomes impact) levels. The evaluation noted that micro-level interventions “to improve equity, teaching and learning, employability, and research outcomes are all amenable to rigorous piloting and evaluation, unlike systemwide reform, which is more difficult to measure” (34). Elsewhere, the evaluation notes, “although the World Bank supervised the grants, there is little evidence that it provided support or direction to project staff of beneficiaries in the form of evidence on ‘what works’ in higher education pedagogy” (43–44). This presents yet another indicator of the importance of paying closer attention to macro-meso-micro links.

The nature of macro-meso-micro links could also be more explicitly elaborated. Such links can be defined as the way in which Bank Group interventions trickle down to individual decision-makers and beneficiaries. Frameworks such as the Coleman Boat Model are particularly effective at emphasizing such links (Coleman 1990). The model distinguishes between three types of mechanism that are jointly required to explain the existence of a relationship between macro situations and the characteristics and outcomes of individual behavioral choices. The first (situational mechanisms) operate at the macro-to-micro level. They show how specific social situations shape the beliefs and opportunities of individual actors.34 The second (action-formation mechanisms) operate at the micro-to-micro level. This mechanism assesses how individual choices and actions are influenced by specific combinations of (individual) behavioral characteristics, capacities, opportunities, and limitations.35 The third (transformation mechanisms) operate at the micro-to-macro level and show how individuals generate macro-level outcomes through their actions and interactions.36

To conclude, the evaluations performed well on this attribute, presenting a strong fit between the use of methods and data sources for each evaluation question. Less clearly evident or articulated was the link between methods and the scope for inference (from the evidence generated by the evaluation’s methods of inquiry). Overall, most of the recommendations logically followed from the evidence presented. The acknowledgment or assessment of interlevel links tended to be implicit rather than explicit.

  1. The scores are based on a combination of ratings assigned by the external experts to each respective evaluation reviewed in the sample.
  2. For the sake of parsimony, issues related to institutional complexity within the Bank Group itself will not be discussed in this meta-evaluation.
  3. The evaluation questions listed in the evaluations from the sample are summarized in appendix F. While Kane’s (1984) suggestion that all evaluation questions should be posed as a single sentence is an exaggeration, the assessment framework takes steps to assess cases in which evaluation questions are insufficiently focused. Per Goethe’s proverb that “in der Beschränkung zeigt sich erst der Meister,” the scope of an evaluation can become unclear if it is approached via a set of unstructured questions. When an overarching research problem includes some 10–15 (or more) questions and subquestions, it becomes increasingly difficult to see how each specific question relates to the rest, reducing the overall utility and effectiveness of the queries. Such a failure can also occur in the opposite direction. As an example, Epstein and Martin (2014, 23) cite the question, “what leads people to obey the law?” Though it presents an interesting problem, it is impossible to answer without further disaggregation. Finding the correct balance between these extremes requires careful calibration, something that was appraised in this component of the meta-evaluation. See also White (2010) and Leeuw and Schmeets (2016; chapter 3).
  4. See White (2010), Bunge (1997), Ultee (2001), and Leeuw and Schmeets (2016).
  5. In his article “Who’s afraid of research questions? The neglect of research questions in the methods literature and a call for question-led methods teaching,” White (2013) discusses this issue in the context of the educational sciences. Appendix G addresses potential failures when formulating evaluation questions.
  6. Issues of question clarity and focus could also be addressed in the evaluation design matrix. The “bag of questions” approach can also be characterized by substantial variations in the focus of evaluation questions. At times, the questions discuss high-level strategic issues. In others, the subquestions address rather specific topics (such as the source, operationalization, and description of service delivery in project appraisal documents).
  7. Furthermore, the report defines two mechanisms for scoping: a self-reinforcing selection mechanism and a demonstration mechanism.
  8. For example, the higher education evaluation portfolio analysis examined the following documents (World Bank 2017d): Implementation Completion and Results Reports, Implementation Completion and Results Reports Reviews, and Project Performance Assessment Reports. Furthermore, “a standard quantitative portfolio review was conducted of IFC’s higher education portfolio detailing the number of new investment projects committed between FY03 and April 30, 2016, and the volume of investments committed” (74–75). In the absence of an identified portfolio, the rural nonfarm economy evaluation “used the theme code ‘rural nonfarm income generation,’ which was applied by the World Bank to 152 projects between 2004 and 2014” (World Bank 2017c, 8). After disaggregating the activities collected under the code, the evaluation “identified 529 World Bank projects, valued at $35 billion, which have directly supported rural nonfarm income generating activities during the same period” (213). In the urban transport evaluation, the portfolio covered 73 community-based projects (plus 32 additional financing), of which 44 (valued at $8.3 billion) were closed and evaluated (World Bank 2017e). “IEG filtered and identified projects approved between 2004 and 2014 that were within the Transport sector board, were rural themed, and that had a ‘Rural and Inter-Urban Roads and Highways’ code or a ‘Roads and Highways’ code (n = 162). It then filtered and identified projects within the Agricultural and Rural Development sector board that included a ‘Rural,’ an ‘Inter-Urban Roads and Highways’ (TI), or a ‘Roads and Highways’ (TA) sector code (n = 70)” (214). Finally, the electricity access evaluation “assessed both quantitative and qualitative results for individual projects during FY2000–2014. The portfolio review covered all projects for the World Bank, IFC, and MIGA that were approved or closed/matured during [this period]” (see table 1.2 of that report).
  9. See the higher education evaluation report (xi) for an example of this.
  10. This definition is in line with many methodological handbooks and guidance publications. See Vaessen (2018).
  11. See also White (2010), Gorard (2010), Leeuw and Schmeets (2016), and de Vaus (2001).
  12. The interviews asked staff to relate the ways in which the World Bank’s new organizational structure was likely to impact learning and knowledge-sharing in operations.
  13. In this regard, Janesick (1998) refers to such proliferation as “methodolatry.” See also White (2013; 219–20).
  14. See Bamberger et al. (2004), who coined this term. Basically, they refer to the time, data, and budget constraints under which evaluations are implemented.
  15. See Strauss and Smith (2009) and Dfid (2012).
  16. Bamberger et al. (2012, 219ff). Such conceptualization was first presented in Campbell and Stanley (1963) and later revised by Cook and Campbell (1979) and Shadish (2002). Construct validity is here defined as “the degree to which inferences are warranted from the observed persons, settings, and cause-and-effect operations included in a study to the constructs that these instances might represent” (Shadish et al. 2002, 38). For more on the Campbellian approach to construct validity, see Lund (2020).
  17. See World Bank (2018), Conducting a Structured Literature Review in the Framework of IEG (Major) Evaluations.
  18. The EAST acronym is derived from the following: “If you want to encourage a behavior, make it Easy, Attractive, Social and Timely.”
  19. Although construct validity originally emerged from psychological research, Strauss and Smith (2009) showed how this concept can be broadened to cover the definition and operationalization of key concepts in studies, as well as the relations between concepts and variables.
  20. This was particularly valuable for evaluations that spanned across multiple years, projects, interventions, and different institutional layers.
  21. In the report, the mechanism concept is only referred to in reference to issues of tracing, funding, and quality assurance.
  22. Two “evaluative lenses” are presented: one on behavioral change and the other on service delivery.
  23. The literature review that underpinned the evaluation also cited mechanisms such as trust and raising awareness.
  24. See Lee and Fidler (2007).
  25. The model was dubbed CRI2SP, standing for communication, resources, incentives, information, society, and psychology (figure 4.1).
  26. Evidence gap maps are evidence collections that map out existing and ongoing systematic reviews or primary studies on a particular set of interventions in a framework of policy relevant interventions and outcomes.
  27. Specifically, it is important to ensure that there are feedback loops between theory and empirical evidence. While the theory determines how evidence is brought in, the latter can be used to iteratively refine the former.
  28. The Maryland Scientific Methods Scale is one example of such a design hierarchy. The Cochrane Collaboration, the Campbell Collaboration, and several other organizations have developed publications, protocols, and other guidance documents on this topic.
  29. For instance, the evaluation on World Bank Group support to electricity access (World Bank 2015d).
  30. For example, the learning and results evaluation explicitly included a country case study that was not intended to be representative of the Bank Group portfolio (World Bank 2015b). Findings were based on evidence gathered from a pre-2014 organizational structure, whereas recommendations were framed around the perceived needs of a post-2014 reformed structure in which power had shifted from countries and regions to sector and thematic practices.
  31. For example, “the evaluation also included some interviews with IFC comparator institutions to benchmark IFC’s approaches to client engagement,” and “a comprehensive assessment of IFC’s investment and advisory portfolio … to derive characteristics and patterns of performance” (World Bank 2018f 5).
  32. The health services evaluation provides an example of this phenomenon.
  33. As noted in the Results and Performance of the World Bank Group 2020, the World Bank Group collects limited systematic evidence on its contribution to higher-level outcomes. Higher-level outcomes result from the interplay of different projects and types of World Bank Group engagements—lending, knowledge, and convening—over time (World Bank 2020b). In response, the Board requested more evidence on how interventions help achieve Sustainable Development Goals. “Better evidence on higher level outcomes would also help with learning, reflections on strategy, and course corrections where needed.” See https://ieg.worldbankgroup.org/blog/what-world-bank-groups-performance-results-cannot-tell-us-about-development-outcomes.
  34. For example, this can involve the opportunity structures by which a community is defined: the more opportunities (such as employment) present, the greater the chance that any individual will be able to find work. Another example can be found in the demographic composition of families and societies (including the Easterlin mechanism linking the size of birth cohorts to job opportunities, and so on).
  35. Examples include cognitive dissonance, fundamental attribution errors, as well as other cognitive biases. Crowding out, stress levels, relative deprivation, reactance, and incentive-response mechanisms are also included in this category.
  36. Examples include threshold effects (also referred to as tipping points or critical mass models of collective action).