Back to cover

Meta-Evaluation of IEG Evaluations

Chapter 6 | Conclusions and Suggestions

Evaluation question 6. What conclusions may be derived from the inventory, in-depth review, and interviews? What suggestions can be made for future IEG evaluations?

The meta-evaluation examined the quality and credibility of IEG evaluations based on their methodological characteristics. The analysis distinguished between the inventory of methods (assessing the full universe of IEG evaluations published between FY15 and FY19) and an in-depth assessment of a sample of eight evaluations. The latter involved an assessment of the evaluation reports and their corresponding Approach Papers on the basis of a framework of seven attributes of methodological clarity and rigor. The inventory exposed the breadth of methodological approaches featured in the full sample of evaluation reports, comparing the range of methodologies used across evaluation reports and their respective Approach Papers. The total number of methods tended to be higher in the final evaluation reports than what was initially proposed in the Approach Papers. The prevalence of more innovative methods also increased in more recent evaluations. The use of at least one innovative method per report appears to have become a norm in more recent evaluations. Overall, IEG evaluations scored very well on the attributes of scope and focus and consistency. Evaluations also performed quite well on the attributes of construct validity and data analysis validity. Finally, a more mixed picture was found for the attributes of reliability, internal validity, and external validity. On each of these, a number of good and weaker examples of evaluations were identified.

The sections below present six conclusions from the meta-evaluation. These are supplemented with suggestions for future IEG evaluations, highlighting some of the strengths and weaknesses identified in the assessment of programmatic and corporate evaluations.

Scope and Focus of IEG Evaluations

Conclusions

Overall, information presented on scope, rationale, and goals in the evaluation reports and Approach Papers was elaborate, relevant, and thorough. At the same time, the scope of some IEG evaluations tended to be overambitious and diluted. This was mainly due to two aspects: the complexity of the evaluand (multisite, multilevel, and multiactor in nature) and the number and clarity of evaluation questions. While one or more overarching questions were usually formulated, certain evaluations subsequently added more than 10 subquestions for a bag-of-questions approach.

Suggestions

The meta-evaluation offers two suggestions for improvement in this area. First, the use of portfolio analysis as a standard operational procedure should be reconsidered. Specifically, Approach Papers should explicitly discuss the necessity of addressing the full diversity of interventions underlying a (thematic or sectoral) portfolio.1 Such an analysis will help formulate more precise evaluation questions. Moreover, less time and resources need be spent on the identification and descriptive analysis of the portfolio.2 Second, evaluators should refrain from formulating bags of questions, and instead devote more time to refining the focus of evaluations.

Use of Conceptual Frameworks and Theories of Change

Conclusions

Overall, IEG evaluations adequately defined concepts (though they did not always operationalize them). More recent evaluations systematically incorporated evidence from the literature and made adequate use of theories of change. However, the function of the theory of change was not always clearly articulated; its relation to the empirical parts of the evaluative analysis could have been strengthened.

The evaluations in the sample usually employed one (or more) of three approaches for applying theories of change. In the first, the conceptual framework would capture the inputs, activities, outputs, and outcomes of a body of work alongside major enabling or restricting contextual factors.3 This usually served as a sense-making framework to better understand and define the often complex scope of the evaluation. The second approach involved the development of a substantive theory of change, disaggregating specific packages of interventions and confronting the theory with empirical evidence.4 The third approach involved a combination of a more general theory of change underlying macro-level Bank Group categories of activities and one or more nested theories within this broader framework. Though all evaluations applied theories of change, more attention could have been paid to the ways in which they interact with the empirical part of the evaluation. Some evaluations studied intervention mechanisms, but relatively less attention was paid to how such mechanisms operate in specific contexts.5

Suggestions

The meta-evaluation offers three suggestions in this area. First, evaluations should more explicitly articulate the role theories of change play in data collection and analysis, assessing their relationship to relevant empirical work. Where possible, the analysis should always link back to the theory of change, providing an assessment of its veracity as well as its potential shortcomings. Second, evaluations could be more precise about the content of their theories of change. Specifically, the adoption of a context-mechanism-outcomes model or comparable analogs from the field of realist evaluations is recommended.6,7 Finally, greater attention to operationalizing concepts into variables and measurement instruments could improve construct validity.

Clarity of Research Methods and Design

Conclusions

Overall, clarity in evaluation design has improved in IEG evaluations over the past five years. The use of tools such as the EDM is widespread. However, sometimes the EDM presents only a list of evaluative instruments. A number of evaluations still do not show sufficient clarity on how different methods help answer specific evaluation questions and how evidence from different sources is triangulated and used to substantiate evaluation findings.

As shown in the inventory of 28 evaluations (see chapter 3), the EDM is an increasingly important tool for enhancing the reliability of evaluations, with more recent evaluations paying closer attention to its formulation. However, despite their role in clarifying the evaluation design, certain EDMs (and the supporting narratives) did not go beyond a listing of the individual methods used. Designs are “not about the logistics of research—how the data are collected, for example—but rather about the logic of inquiry, the links between questions, data and conclusions” (White 2013.)

Suggestions

Two suggestions are provided for this area. First, more attention should be paid to distinguishing between data collection and data analysis methods, fully articulating the ways in which the two complement each other. Approach papers (and methodology section in the reports) should clarify the logic of the design rather than merely listing the methods (to be) used. Second, guidance on best practices in the practical implementation of principles of triangulation and synthesis in evaluation should be developed.

Validity

Conclusions

While there are good examples of evaluations with high internal, external, and data analysis validity of findings, there are ongoing challenges that merit further attention.8

Internal validity assesses the extent to which a study establishes a trustworthy causal relationship (either attribution or contribution).9 As noted previously, theories of change play an important role in this area. However, the reviewed evaluations offered limited references to conventional threats to validity or how to address them. The complexity of evaluands exacerbates this challenge, especially in contexts where evaluations covered dozens of countries, hundreds of projects, and several years of implementation. While the sample yielded mixed results on the attribute of external validity (or generalizability), its discussion was generally consistent with and reflective of the nature of the evaluation. Some evaluations explicitly discussed the limitations of generalizability across different contexts but provided limited mitigation strategies. Finally, the meta-evaluation’s assessment of data analysis validity was quite positive across the sample. However, two common challenges were noted, relating to issues of transparency and triangulation. First, some evaluations faced difficulties in clearly demonstrating the stream of evidence that supported some of the key findings. Second, the triangulation of evidence was insufficiently applied (or clarified) in some evaluations.

Suggestions

The meta-evaluation proposes three suggestions for improvement in this area. While suggestions related to the use of theories of change have already been presented, it should be noted that improvements in this area can also improve internal validity. Second, a dedicated section on the diagnosis and treatment of internal and external validity issues could be useful in mitigating some of the challenges posed by the complexity of evaluands. Finally, guidance (as suggested previously) on how to triangulate evidence with and across sources of evidence would be helpful.

Consistency

Conclusions

Overall, IEG evaluation reports fared quite well with respect to the consistency between rationale, scope, questions, methods, findings, and recommendations. There was a generally strong fit among the use of methods, data sources, and evaluation questions.

In most cases, recommendations from the reports logically followed from the findings. Less evident in some cases was the added value of individual methods within a given evaluation. The consistency between questions, levels of data collection and analysis, and synthesis of findings was not always clear. Furthermore, the nature of macro-meso-micro links tended to be implicit rather than explicit in most of the evaluations assessed.10

Suggestions

To further strengthen analytical rigor, IEG evaluations should consider developing a more systematic approach to assessing how contextual (macro and meso) characteristics may or may not influence the behavior of the beneficiaries of Bank Group-supported interventions. This would include clarifying how and under what conditions different levels of analysis are linked. Apart from the use of multilevel EDMs, the literature provides several analytical models to tackle this issue: the Coleman Boat Model, for example, could provide a useful framework in this context.11

Innovation in Evaluation

Conclusions

During FY15 to FY19, IEG evaluations demonstrated a broadening range of methods used to respond to evaluation questions. While innovation in methods used for data collection and analysis should be applauded, such innovation should not become an end in itself. Evaluation teams should always carefully consider the cost-benefit ratio of innovation and the logic of using specific methods to address evaluation questions.

Suggestions

The meta-evaluation proposes the following suggestions on innovation. IEG could benefit from a more strategic view of methodological innovation in evaluation. Among other things this would involve distinguishing between innovations that (potentially) significantly change the evaluation approach as a whole (or a large part thereof) and boutique studies. Systems of innovation should be seen as “a way of summarizing the patterns of interactions and interdependencies [that are] evolving and changing” between and within organizations (Eig 2014). If a collaborative social environment for innovation can be fostered, the quality of evaluations can be improved through the integration of innovative approaches and greater interactions between them. We suggest that IEG further stimulate experimentation and collaboration across IEG on innovative approaches.

Finally, as Jewitt et al. (2017) note, “the digital is a catalyst for innovation.” Given the recent challenges posed by the COVID-19 pandemic, digital tools and approaches will undoubtedly grow in relevance in the work of the Bank Group generally and IEG specifically. IEG should therefore be ready to learn from recent experiences in innovation (especially in the field of data science) and make informed decisions to adapt its practices where needed.

  1. This is particularly relevant for evaluations whose scope spans across multiple countries, long time horizons, and the three Bank Group institutions (World Bank, International Finance Corporation, Multilateral Investment Guarantee Agency) in both lending and nonlending operations.
  2. This will also improve the value added by investments in portfolio review and analysis.
  3. Such characteristics were sometimes referenced in a similar manner as a logical framework approach.
  4. Attention was sometimes paid to the mechanisms that made interventions work.
  5. This is critical given that there is often no a priori evidence that a theory of change will be valid in different contexts.
  6. See Lemire et al. (2020).
  7. See Pawson (2013).
  8. Regarding construct validity, please refer to the points made above under the heading “Use of Conceptual Frameworks and Theories of Change.”
  9. Given the complexity of evaluands and issues of equifinality in attributing formal causal relationships, contributory causal relationships (those that support the outcome but are not the sole determinant of causation) are mainly considered here.
  10. “Macro” in this context pertains to country-level characteristics such as infrastructure, connectivity, investment climate, social inclusion/exclusion, fragility or conflict situations, economic or financial context, demography, and so forth. “Meso” refers to the role played by intermediary organizations and institutions. Finally, “micro” concerns the behavior of beneficiaries and end users. In most if not all logic models (theories of change) examined in the sample of eight evaluations, these links were not clearly articulated.
  11. See, for example, Hedström and Ylikoski (2010), Raub et al. (2012), and Astbury and Leeuw (2010).