Back to cover

Evaluation of International Development Interventions

Chapter 2 | Methodological Principles of Evaluation Design

Evaluation approaches and methods do not exist in a vacuum. Stakeholders who commission or use evaluations and those who manage or conduct evaluations all have their own ideas and preferences about which approaches and methods to use. An individual’s disciplinary background, experience, and institutional role influence such preferences; other factors include internalized ideas about rigor and applicability of methods. This guide will inform evaluation stakeholders about a range of approaches and methods that are used in evaluative analysis and provide a quick overview of the key features of each. It thus will inform them about the approaches and methods that work best in given situations.

Before we present the specific approaches and methods in chapter 3, let us consider some of the key methodological principles of evaluation design that provide the foundations for the selection, adaptation, and use of evaluation approaches and methods in an IEO evaluation setting. To be clear, we focus only on methodological issues here and do not discuss other key aspects of design, such as particular stakeholders’ intended use of the evaluation. The principles discussed in this chapter pertain also to evaluation in general, but they are especially pertinent for designing independent evaluations in an international development context. We consider the following methodological principles to be important for developing high-quality evaluations:

  1. Giving due consideration to methodological aspects of evaluation quality in design: focus, consistency, reliability, and validity
  2. Matching evaluation design to the evaluation questions
  3. Using effective tools for evaluation design
  4. Balancing scope and depth in multilevel, multisite evaluands
  5. Mixing methods for analytical depth and breadth
  6. Dealing with institutional opportunities and constraints of budget, data, and time
  7. Building on theory

Let us briefly review each of these in turn.

Giving Due Consideration to Methodological Aspects of Evaluation Quality in Design

Evaluation quality is complex. It may be interpreted in different ways and refer to one or more aspects of quality in terms of process, use of methods, team composition, findings, and so on. Here we will talk about quality of inference: the quality of the findings of an evaluation as underpinned by clear reasoning and reliable evidence. We can differentiate among four broad, interrelated sets of determinants:

  • The budget, data, and time available for an evaluation (see the Dealing with Institutional Opportunities and Constraints of Budget, Data, and Time section);
  • The institutional processes and incentives for producing quality work;
  • The expertise available within the evaluation team in terms of different types of knowledge and experience relevant to the evaluation: institutional, subject matter, contextual (for example, country), methodological, project management, communication; and
  • Overarching principles of quality of inference in evaluation research based on our experience and the methodological literature in the social and behavioral sciences.1

Here we briefly discuss the final bullet point. From a methodological perspective, quality can be broken down into four aspects: focus, consistency, reliability, and validity.

Focus concerns the scope of the evaluation. Given the nature of the evaluand and the type of questions, how narrowly or widely does one cast the net? Does one look at both relevance and effectiveness issues? How far down the causal chain does the evaluation try to capture the causal contribution of an intervention? Essentially, the narrower the focus of an evaluation, the greater the concentration of financial and human resources on a particular aspect and consequently the greater the likelihood of high-quality inference.

Consistency here refers to the extent to which the different analytical steps of an evaluation are logically connected. The quality of inference is enhanced if there are logical connections among the initial problem statement, rationale and purpose of the evaluation, questions and scope, use of methods, data collection and analysis, and conclusions of an evaluation.

Reliability concerns the transparency and replicability of the evaluation process.2 The more systematic the evaluation process and the higher the levels of clarity and transparency of design and implementation, the higher the confidence of others in the quality of inference.

Finally, validity is a property of findings. There are many classifications of validity. A widely used typology is the one developed by Cook and Campbell (1979) and slightly refined by Hedges (2017):

  • Internal validity: To what extent is there a causal relationship between, for example, outputs and outcomes?
  • External validity: To what extent can we generalize findings to other contexts, people, or time periods?
  • Construct validity: To what extent is the element that we have measured a good representation of the phenomenon we are interested in?
  • Data analysis validity: To what extent are methods applied correctly and the data used in the analysis adequate for drawing conclusions?

Matching Evaluation Design to the Evaluation Questions

Although it may seem obvious that evaluation design should be matched to the evaluation questions, in practice much evaluation design is still too often methods driven. Evaluation professionals have implicit and explicit preferences and biases toward the approaches and methods they favor. The rise in randomized experiments for causal analysis is largely the result of a methods-driven movement. Although this guide is not the place to discuss whether methods-driven evaluation is justified, there are strong arguments against it. One such argument is that in IEOs (and in many similar institutional settings), one does not have the luxury of being too methods driven. In fact, the evaluation questions, types of evaluands, or types of outcomes that decision makers or other evaluation stakeholders are interested in are diverse and do not lend themselves to one singular approach or method for evaluation. Even for a subset of causal questions, given the nature of the evaluands and outcomes of interest (for example, the effect of technical assistance on institutional reform versus the effect of microgrants on health-seeking behavior of poor women), the availability and cost of data, and many other factors, there is never one single approach or method that is always better than others. For particular types of questions there are usually several methodological options with different requirements and characteristics that are better suited than others. Multiple classifications of questions can be helpful to evaluators in thinking more systematically about this link, such as causal versus noncausal questions, descriptive versus analytical questions, normative versus nonnormative questions, intervention-focused versus systems-based questions, and so on. Throughout this guide, each guidance note presents what we take to be the most relevant questions that the approach or method addresses.

Using Effective Tools for Evaluation Design

Over the years, the international evaluation community in general and institutionalized evaluation functions (such as IEOs) in particular have developed and used a number of tools to improve the quality and efficiency of evaluation design.3 Let us briefly discuss four prevalent tools.

First, a common tool in IEOs (and similar evaluation functions) is some type of multicriteria approach to justify the strategic selectivity of topics or interventions for evaluation. This could include demand-driven criteria such as potential stakeholder use or supply-driven criteria such as the financial volume or size of a program or portfolio of interventions. Strategic selectivity often goes hand in hand with evaluability assessment (Wholey 1979), which covers such aspects as stakeholder interest and potential use, data availability, and clarity of the evaluand (for example, whether a clear theory of change underlies the evaluand).

A second important tool is the use of approach papers or inception reports. These are stand-alone documents that describe key considerations and decisions regarding the rationale, scope, and methodology of an evaluation. When evaluations are contracted out, the terms of reference for external consultants often contain similar elements. Terms of reference are, however, never a substitute for approach papers or inception reports.

As part of approach papers and inception reports, a third tool is the use of a design matrix. For each of the main evaluation questions, this matrix specifies the sources of evidence and the use of methods. Design matrixes may also be structured to reflect the multilevel nature (for example, global, selected countries, selected interventions) of the evaluation.

A fourth tool is the use of external peer reviewers or a reference group. Including external methodological and substantive experts in the evaluation design process can effectively reduce bias and enhance quality.

Balancing Scope and Depth in Multilevel, Multisite Evaluands

Although project-level evaluation continues to be important, at the same time and for multiple reasons international organizations and national governments are increasingly commissioning and conducting evaluations at higher programmatic levels of intervention. Examples of the latter are sector-level evaluations, country program evaluations, and regional or global thematic evaluations. These evaluations tend to have the following characteristics:

  • They often cover multiple levels of intervention, multiple sites (communities, provinces, countries), and multiple stakeholder groups at different levels and sites.
  • They are usually more summative and are useful for accountability purposes, but they may also contain important lessons for oversight bodies, management, operations, or other stakeholders.
  • They are characterized by elaborate evaluation designs.

A number of key considerations for evaluation design are specific to higher-level programmatic evaluations. The multilevel nature of the intervention (portfolio) requires a multilevel design with multiple methods applied at different levels of analysis (such as country or intervention type). For example, a national program to support the health sector in a given country may have interventions relating to policy dialogue, policy advisory support, and technical capacity development at the level of the line ministry while supporting particular health system and health service delivery activities across the country. Multilevel methods choice goes hand in hand with multilevel sampling and selection issues. A global evaluation of an international organization’s support to private sector development may involve data collection and analysis at the global level (for example, global institutional mapping), the level of the organization’s portfolio (for example, desk review), the level of selected countries (for example, interviews with representatives of selected government departments or agencies and industry leaders), and the level of selected interventions (for example, theory-based causal analysis of advisory services in the energy sector). For efficiency, designs are often “nested”; for example, the evaluation covers selected interventions in selected countries. Evaluation designs may encompass different case study levels, with within-case analysis in a specific country (or regarding a specific intervention) and cross-case (comparative) analysis across countries (or interventions). A key constraint in this type of evaluation is that one cannot cover everything. Even for one evaluation question, decisions on selectivity and scope are needed. Consequently, strategic questions should address the desired breadth and depth of analysis. In general, the need for depth of analysis (determined by, for example, the time, resources, and triangulation among methods needed to understand and assess one particular phenomenon) must be balanced by the need to generate generalizable claims (through informed sampling and selection). In addition to informed sampling and selection, generalizability of findings is influenced by the degree of convergence of findings from one or more cases with available existing evidence or of findings across cases. In addition, there is a clear need for breadth of analysis in an evaluation (looking at multiple questions, phenomena, and underlying factors) to adequately cover the scope of the evaluation. All these considerations require careful reflection in what can be a quite complicated evaluation design process.

Mixing Methods for Analytical Depth and Breadth

Multilevel, multisite evaluations are by definition multimethod evaluations. But the idea of informed evaluation design, or the strategic mixing of methods applies to essentially all evaluations. According to Bamberger (2012, 1), “Mixed methods evaluations seek to integrate social science disciplines with predominantly quantitative and predominantly qualitative approaches to theory, data collection, data analysis and interpretation. The purpose is to strengthen the reliability of data, validity of the findings and recommendations, and to broaden and deepen our understanding of the processes through which program outcomes and impacts are achieved, and how these are affected by the context within which the program is implemented.” The evaluator should always strive to identify and use the best-suited methods for the specific purposes and context of the evaluation and consider how other methods may compensate for any limitations of the selected methods. Although it is difficult to truly integrate different methods within a single evaluation design, the benefits of mixed methods designs are worth pursuing in most situations. The benefits are not just methodological; through mixed designs and methods, evaluations are better able to answer a broader range of questions and more aspects of each question.

There is an extensive and growing literature on mixed methods in evaluation. One of the seminal articles on the subject (by Greene, Caracelli, and Graham) provides a clear framework for using mixed methods in evaluation that is as relevant as ever. Greene, Caracelli, and Graham (1989) identify the following five principles and purposes of mixing methods:

  1. Triangulation Using different methods to compare findings. Convergence of findings from multiple methods strengthens the validity of findings. For example, a survey on investment behavior administered to a random sample of owners of small enterprises could confirm the findings obtained from semistructured interviews for a purposive sample of representatives of investment companies supporting the enterprises.
  2. Initiation Using different methods to critically question a particular position or line of thought. For example, an evaluator could test two rival theories (with different underlying methods) on the causal relationships between promoting alternative livelihoods in buffer zones of protected areas and protecting biodiversity.
  3. Complementarity Using one method to build on the findings from another method. For example, in-depth interviews with selected households and their individual members could deepen the findings from a quasi-experimental analysis on the relationship between advocacy campaigns and health-seeking behavior.
  4. Development Using one method to inform the development of another. For example, focus groups could be used to develop a contextualized understanding of women’s empowerment and could use that information to develop a survey questionnaire.
  5. Expansion Using multiple methods to look at complementary areas. For example, social network analysis could be used to understand an organization’s position in the financial landscape of all major organizations supporting a country’s education sector, while using semistructured interviews with officials from the education ministry and related agencies to assess the relevance of the organization’s support to the sector.

Dealing with Institutional Opportunities and Constraints of Budget, Data, and Time

Evaluation is applied social science research in the context of specific institutional requirements, constraints, and opportunities, and a range of other practical constraints. Addressing these all-too-common constraints, including budget, data, time, political, and other constraints, involves balancing rigor and depth of analysis with feasibility. In this sense, evaluation clearly distinguishes itself from academic research in several ways:

  • It is strongly linked to an organization’s accountability and learning processes, and there is some explicit or implicit demand-orientation in evaluation.
  • It is highly normative, and evidence is used to underpin normative conclusions about the merit and worth of an evaluand.
  • It puts the policy intervention (for example, the program, strategy, project, corporate process, thematic area of work) at the center of the analysis.
  • It is subject to institutional constraints of budget, time, and data. Even in more complicated evaluations of larger programmatic evaluands, evaluation (especially by IEOs) is essentially about “finding out fast” without compromising too much the quality of the analysis.
  • It is shaped in part by the availability of data already in the organizational system. Such data may include corporate data (financial, human resources, procurement, and so on), existing reporting (financial appraisal, monitoring, [self-] evaluation), and other data and background research conducted by the organization or its partners.

Building on Theory

Interventions are theories, and evaluation is the test (Pawson and Tilley 2001). This well-known reference indicates an influential school of thought and practice in evaluation, often called theory-driven or theory-based evaluation. Policy interventions (programs and projects) rely on underlying theories regarding how they are intended to work and contribute to processes of change. These theories (usually called program theories, theories of change, or intervention theories) are often made explicit in documents but sometimes exist only in the minds of stakeholders (for example, decision makers, evaluation commissioners, implementing staff, beneficiaries). Program theories (whether explicit or tacit) guide the design and implementation of policy interventions and also constitute an important basis for evaluation.

The important role of program theory (or variants thereof) is well established in evaluation. By describing the inner workings of how programs operate (or at least are intended to operate), the use of program theory is a fundamental step in evaluation planning and design. Regardless of the evaluation question or purpose, a central step will always be to develop a thorough understanding of the intervention that is evaluated. To this end, the development of program theories should always be grounded in stakeholder knowledge and informed to the extent possible by social scientific theories from psychology, sociology, economics, and other disciplines. Building program theories on the basis of stakeholder knowledge and social scientific theory supports more relevant and practice-grounded program theories, improves the conceptual clarity and precision of the theories, and ultimately increases the credibility of the evaluation.

Depending on the level of complexity of the evaluand (for example, a complex global portfolio on urban infrastructure support versus a specific road construction project) a program theory can serve as an overall sense-making framework; a framework for evaluation design by linking particular causal steps and assumptions to methods and data; or a framework for systematic causal analysis (for example, using qualitative comparative analysis or process tracing; see chapter 3). Program theories can be nested; more detailed theories of selected (sets of) interventions can be developed and used for guiding data collection, analysis, and the interpretation of findings, while the broader theory can be used to connect the different strands of intervention activities and to make sense of the broader evaluand (see also appendix B).


Bamberger, M. 2012. Introduction to Mixed Methods in Impact Evaluation. Impact Evaluation Notes 3 (August), InterAction and the Rockefeller Foundation.

Bamberger, M., J. Rugh, and L. Mabry. 2006. RealWorld Evaluation: Working under Budget, Time, Data, and Political Constraints. Thousand Oaks, CA: SAGE.

Cook, T. D., and D. T. Campbell. 1979. Quasi-Experimentation: Design and Analysis Issues for Field Settings. Boston: Houghton Mifflin.

Greene, J., V. Caracelli, and W. Graham. 1989. “Toward a Conceptual Framework for Mixed-Method Evaluation Designs.” Educational Evaluation and Policy Analysis 11 (3): 209–21.

Hedges, L. V. 2017. “Design of Empirical Research.” In Research Methods and Methodologies in Education, 2nd ed., edited by R. Coe, M. Waring, L. V. Hedges, and J. Arthur, 25–33. Thousand Oaks, CA: SAGE.

Morra Imas, L., and R. Rist. 2009. The Road to Results. Washington, DC: World Bank.

Pawson, R., and N. Tilley. 2001. “Realistic Evaluation Bloodlines.” American Journal of Evaluation 22 (3): 317–24.

Wholey, Joseph. 1979. Evaluation—Promise and Performance. Washington, DC: Urban Institute.

  1. For simplification purposes we define method as a particular technique involving a set of principles to collect or analyze data, or both. The term approach can be situated at a more aggregate level, that is, at the level of methodology, and usually involves a combination of methods within a unified framework. Methodology provides the structure and principles for developing and supporting a particular knowledge claim.
  2. Development evaluation is not to be confused with developmental evaluation. The latter is a specific evaluation approach developed by Michael Patton.
  3. Especially in independent evaluations conducted by independent evaluation units or departments in national or international nongovernmental, governmental, and multilateral organizations. Although a broader range of evaluation approaches may be relevant to the practice of development evaluation, we consider the current selection to be at the core of evaluative practice in independent evaluation.
  4. Evaluation functions of organizations that are (to a large extent), structurally, organizationally and behaviorally independent from management. Structural independence, which is the most distinguishing feature of independent evaluation offices, includes such aspects as independent budgets, independent human resource management, and no reporting line to management, but some type of oversight body (for example, an executive board).
  5. The latter are not fully excluded from this guide but are not widely covered.
  6. Evaluation is defined as applied policy-oriented research and builds on the principles, theories, and methods of the social and behavioral sciences.
  7. Both reliability and validity are covered by a broad literature. Many of the ideas about these two principles are contested, and perspectives differ according to different schools of thought (with different underlying ontological and epistemological foundations).
  8. A comprehensive discussion of the evaluation process, including tools, processes, and standards for designing, managing, quality assuring, disseminating, and using evaluations is effectively outside of the scope of this guide (see instead, for example, Bamberger, Rugh, and Mabry 2006; Morra Imas and Rist 2009).