Process Tracing Method in Program Evaluation
Chapter 2 | Tracing a Process Theory of Change Empirically
Once they have developed a preliminary disaggregated pToC, evaluators can engage in further fieldwork to more systematically assess whether the pToC worked as theorized or, if it did not, how they should revise the pToC. This involves what can be termed “operationalizing” the theory, in which testable hypotheses are developed for what types of empirical observables might be left if the activities and links played out as theorized in the pToC. In simple terms, operationalizing a pToC involves asking questions such as “If activity and link A took place, what type of empirical traces might we expect that they left in the case?” The hypothesized empirical observables can also be considered the “fingerprints” that might be left by the activities and links in a case.
In process tracing, empirical evidence can be any form of empirical material that changes evaluators’ confidence in how a particular theory worked in the selected case. Evidence can be sequences of events in a case, patterns in the empirical record (for example, the number of downloads of a report), traces in which mere existence provides proof, or accounts from interviews and the content of documents (Beach and Pedersen 2019). Different research techniques are relevant for collecting and assessing different types of evidence. Note that this can include statistical analysis of patterns in the empirical record, if relevant.
Process tracing often involves different modes of empirical research through the course of an evaluation. As discussed in chapter 1, the development of an initial pToC for an intervention involves an initial round of considering and probing the empirical record. A more systematic testing and revision phase then follows that involves (i) operationalizing the pToC in the form of expected observable traces that are tested empirically and (ii) assessing the collected evidence and, if necessary, revising the pToC.
Step 1: Operationalization of Expected Observables
The working pToC for a particular case being studied should be operationalized by asking what empirical observables the actions and links in the case might have left. Process-tracing methods build on Bayesian logic, in which evaluators update prior confidence in the workings of a theory based on new evidence they have gathered (Beach and Pedersen 2019; Befani 2021; Befani and Stedman-Bryce 2017). Evaluators can increase or decrease their degree of confidence in a theory based on this updating, depending on whether they have found confirmatory or disconfirmatory evidence. Bayesian logic suggests that some empirical observables can be characterized as “need to find” because the action or link for which they are expected to provide evidence should have left a particular fingerprint in a case (Befani 2021). Not finding that fingerprint would disconfirm to some degree evaluators’ confidence that the action or link is present in the case being studied.
Other empirical observables can be characterized as “love to find,” meaning that if found, they provide relatively strong confirmation of the actions and links involved in the pToC for the case being studied because no plausible alternative explanations for finding the evidence exist. These observables can therefore be thought of as a confirmatory “signature” that the part of the process for which they provide confirmation is working as theorized (Befani 2021), but if other explanations for finding an observable are equally plausible, then finding the observable provides little or no confirmation of the actions and links involved in the pToC. In addition, if not found, love-to-find observables do not necessarily invalidate the pToC. These terms are defined further, alongside examples from the IEG evaluation, in box 2.1.
Box 2.1. The Confirmatory and Disconfirmatory Power of Evidence
Need-to-find evidence = disconfirmatory evidence. Need-to-find evidence is empirical observables that should be observed as a result of activities associated with a part of a process. If such empirical observables are not found in the case being studied, the lack of expected evidence disconfirms, to some degree, that that particular part of the process took place, with the degree of disconfirmation depending on how likely it was that the evidence would be found. Need-to-find evidence is related to terms such as rate of false negatives, sensitivity and certainty used in other research traditions.
- Example from the Independent Evaluation Group evaluation: If World Bank ideas helped shape policy reforms, we should expect at least some overlap between the final reforms and the ideas. Not finding any overlap at all would be very disconfirmatory.
Love-to-find evidence = confirmatory evidence. Love-to-find evidence is empirical observables that ideally would be observed as a result of activities associated with a part of a process and whose presence is difficult to explain in alternative ways. If such empirical observables are found in the case being studied, the evidence may confirm that that particular part of the process took place, depending on how unlikely alternative explanations for the evidence are. Love-to-find evidence is related to the terms rate of false positives, specificity, and uniqueness used in other research traditions.
- Example from the Independent Evaluation Group evaluation: Particular wording from a World Bank diagnostic work is found in a final reform document. If it can be demonstrated that no other actors put forward similar proposals, and the work was put forward before the final reform, finding such wording is highly confirmatory because it would be highly implausible that the group preparing the reform document would have reached such similar language purely by coincidence.
Source: Independent Evaluation Group.
Empirical observables can be both need to find and love to find, just one or the other, or neither. Evidence that is both high need to find and high love to find is confirmatory if found and disconfirmatory if not found. Evidence that is neither need to find nor love to find has little probative value by itself—although it might play a role in corroborating other evidence. Figure 2.1 illustrates the two dimensions and their relationship with each other. The dimensions should be understood as continuums, with some types of evidence offering stronger confirmation (if found) than others.
Figure 2.1. The Two Dimensions of Probative Value of Empirical Evidence

Source: Independent Evaluation Group.
For critical assessment of whether evidence exists that particular actions took place and were linked in the way theorized, the specifics of the pToC determine what types of sources evaluators are interested in and what questions they ask these sources. Evaluators will want to interview those people either who were directly involved in the actions depicted in the pToC or who can provide them with important (unbiased) accounts of those actions. Similarly, the questions evaluators ask in interviews should directly relate to the content of the pToC (Camacho et al. 2025). As a result, evaluators will often tailor interview questions to a particular respondent, depending on what elements of the pToC that respondent is able to shed light on.
When hypothesizing what empirical observables might be found, it is important to cast the net widely for different observables that a given activity and link might have left. Because the most probative confirmatory or disconfirmatory evidence is often not available, in most situations evaluators have to settle for second-best evidence in which each individual piece tells them little. However, if those pieces are independent of one another, when combined, they can have greater weight. Independence of evidence relates to whether sources could have influenced each other or not. For example, if we interview two colleagues in an organization and we find similar reconstructions of a set of events, it could be that these are two independent eyewitness accounts or, alternatively, that the colleagues have colluded to harmonize their answers to the evaluators’ questions. Only when independence can be demonstrated can we treat the evidence as independent. Working with evidence, therefore, often involves relatively painstaking piecing together of different types of evidence (Beach and Pedersen 2019).
Returning to the IEG evaluation example, we operationalized expected observables for the action “World Bank officials engage key domestic officials during drafting to ensure feedback and ownership” by thinking about the different types of empirics that might have been left in the case by the activities involved in this action. Our expected observables involved the following:
- We expected to find, in interviews with World Bank officials, that
- They spent time trying to identify and cultivate relevant contacts in the government of the client country;
- They met relatively frequently with these national officials;
- They were interested in getting candid feedback from national officials and in promoting their feelings of ownership of the issue.
- If it proved possible to access it, we expected to find, in project documentation for the production of the diagnostic report, information about meetings with national officials where ideas and potential formulations of draft text were discussed.
- We expected to find, in interviews with national officials, that they
- Met with World Bank officials to discuss the diagnostic report, and
- Provided feedback on the report draft.
(Despite many efforts, we were unfortunately unable to interview the national officials involved; how we managed to deal with this is discussed later in the chapter.) We characterized most of our expected observables as need to find, although interviews with national officials might have provided us with love-to-find evidence.
Step 2: Fieldwork to Test and Revise the Initial Process Theory of Change
In step 2, evaluators go again into the field to collect empirical material related to the expected empirical observables of the pToC. In some instances, this material does not confirm the initial pToC, with evidence suggesting that the whole process, or parts of it, worked in different ways than theorized. Such an outcome should lead evaluators to revise the pToC and thereafter produce a new set of expected empirical observables that can then be rigorously assessed through further fieldwork.
Returning to the IEG evaluation example, we broke down the second pathway of the policy dialogue episode, depicted in figure 2.2, into actors, action, and links. In the second pathway, World Bank officials were theorized to have engaged in dialogue with a domestic think tank that might have led them to adopt World Bank ideas in their own report. Engagement was theorized to involve the domestic think tank providing feedback about the suitability of World Bank ideas for potential reforms. By feeling “heard,” we theorized that the domestic think tank officials might have taken some of the World Bank ideas on board and included them in their own report. In this sense, the World Bank ideas might have more credence than otherwise because they were also being promoted by a trusted national voice.
In attempting to track down evidence to assess this part of our pToC, we could not access some sources that would have been able to shed more light on how the engagement actually played out. We were able to interview the World Bank officials involved, giving us evidence of whom they met with and how frequently (once a month). Given that the information from the interviews was collected several years after the events took place, it unfortunately lacked precise details on the feedback received, as well as the specifics of the dialogue. This meant that we needed either to treat the evidence as weaker than ideal or to seek corroboration through other sources of evidence. Figure 2.2 documents the evidence for each part of the episode.
Figure 2.2. Policy Dialogue Episode (Engagement with Domestic Think Tank)

Source: Independent Evaluation Group.
In pursuing and assessing evidence for the domestic think tank side of the interactions, we sent the think tank a set of questions and received a brief written statement in response. The statement provided some evidence regarding one particular topic on which feedback was provided, but it was vague about the rest, and it was not very specific about the nature of interactions between World Bank officials and the think tank staff. Ideally, for our evaluation, we would have received more documentation regarding the domestic think tank’s written and oral interactions with World Bank officials, but we were unable to gain access to such documentation despite repeated attempts. As a second-best option, our evaluation relied instead on indirect evidence of the interactions between the think tank and the World Bank (with weaker confirmatory power), where we used knowledge products produced by the domestic think tank before the World Bank report as a baseline for comparing the think tank’s diagnosis of problems in the client country prior to engagement. We then compared early drafts of the World Bank diagnostic report with later drafts and the report the domestic think tank produced for the client country government. We mapped when data and ideas first appeared in the reports and whether they were included or dropped in subsequent versions. In this way, we were able to establish that early think tank documents had not included many of the ideas that were found in early World Bank drafts of the diagnostic report and that were subsequently reflected in the domestic think tank’s report.
If evaluators find evidence confirming a particular aspect of their pToC, they need to assess whether the evidence is enough to enable them to stop collecting data, especially if the aspect is an important causal element in the pToC. In the IEG evaluation, because our initial pToC described interactions among diverse actors, we would typically have conducted multiple interviews from different sides of the interactions (if we had gained enough access). If the accounts from different sides of the interaction had been similar and consistent, we might have concluded that we had confirmed the workings of that part of our pToC.
However, interviews in evaluations are typically with stakeholders, all of whom may potentially show bias because they have a stake in the project in some form or other. Therefore, evaluators should ideally also corroborate interview findings with other types of evidence (Camacho et al. 2025). If evaluators’ sources are weak (that is, if we cannot necessarily trust their veracity), the evaluators should try to collect more information through other sources of evidence. Finding multiple pieces of confirming evidence helps corroborate the veracity of particular sources. In other situations where we can trust the evidence, finding one piece of confirming evidence can be enough. In the IEG evaluation, for example, we would not have expected the client country’s head of state to cite the World Bank as a justification for a change in the country’s policy direction. However, we did find that he had done just this, in an official speech, along with providing quite specific details about some of the problems the World Bank had identified. We took this to be a strong confirming piece of evidence that World Bank ideas were being listened to at the highest level in the country.
If evaluators do not find evidence confirming one or more aspects of their pToC, three different situations may apply: (i) they have found disconfirming evidence and should revise their pToC, (ii) they have found contradictory evidence, or (iii) they have not found any relevant evidence.
In the first case, the evidence might be very straightforward—for example, that a particular theorized action did not take place—in which case the evaluators would want to revise their pToC accordingly.
In the second case, evaluators need to figure out why the evidence they have found is contradictory. This can involve trying to collect other independent evidence that can corroborate one of the alternative interpretations of what happened and why. It is important to consider, however, that contradictory accounts do not necessarily mean one source of the accounts is wrong. It might be, instead, that the activities and links occurred in a way that was more complex than initially theorized, making both sources of evidence consistent with a revised pToC. Reconciliation of this type often requires significant detective work in piecing together the activities and links and how they worked in the case being studied. Returning to the IEG example, at the beginning of our inquiry—that is, before we identified the domestic think tank pathway—we found some contradictory evidence that suggested that the domestic think tank’s ideas might have been more important in shaping the revised reform agenda than the World Bank’s diagnostic report. However, with further digging, we discovered that there had been significant engagement between the think tank and World Bank officials. By tracing who put forward what ideas and when, we were able to reconcile the initial contradictory findings, uncovering evidence that through the World Bank’s work with the think tank behind the scenes, World Bank officials’ ideas had actually shaped the think tank’s report (that is, we discovered the second policy dialogue pathway).
In the third case, not finding relevant evidence can mean different things. If evaluators cannot obtain need-to-find evidence for something after systematically searching for it, one possibility is that this absence of evidence means that the evidence does not exist at all and therefore disconfirms the corresponding part of the pToC. A second possibility is that there are important pieces of evidence that evaluators cannot access or that do not exist (for example, activities in a meeting were not recorded). Even if no evidence is available for a particular activity and link in a pToC, evaluators can expand the search by asking whether there might be indirect or circumstantial evidence for that activity and link, for example, if the inputs (actions) correspond closely with the outputs (expected responses). If even this is not possible for a particular part of the pToC, evaluators will want to establish evidence for this part even more indirectly. They can do this by finding strong evidence for the part before or after, which may shed light on the part for which evidence is lacking. Ultimately, it is still good to know what one does not know. If the lack of evidence is related to something evaluators would have loved to find rather than something they needed to find, not finding the evidence tells us little.
In the IEG evaluation, we were not able to talk directly to the national officials who advised the client country’s head of state and so could not confirm that the World Bank’s diagnostic report had actually made it to the advisers’ ears and that they had used it in their own advice for the head of state’s speech to the nation. We did, however, have interview data from World Bank officials suggesting that they had met with these national officials. Further, we found more indirect, circumstantial evidence of the link between the World Bank diagnostic report and the head of state’s speech, such as the similar framing of issues in official country documents and those of the World Bank, the fact that the World Bank framing preceded the national one, and the fact that similar frames were not found in any other published report or the like. Further, we gathered information from press coverage of interactions between the World Bank team and high-level country authorities that shed light on the extent to which World Bank ideas had framed how national officials understood the problems facing them.
Regarding the impact of World Bank ideas on the country’s final policy reform document, we found relatively strong confirming evidence that indicated that a high-level authority in charge of adopting an alternative policy framework for the country’s future had used the conclusions of the World Bank’s diagnostic report a few years after its publication. To ascertain this, we had examined the levels of correspondence between the diagnostic report, other reports and relevant knowledge present in the country, and the final policy reform framework the country adopted. In particular, we assessed whether the other reports and the final framework showed signatures of World Bank influence in the form of particular formulations or combinations of ideas that would suggest that national officials relied heavily on the World Bank diagnostic report.
Evaluators will assess the evidence they have collected and will continue with fieldwork until they have evidence strong enough to confirm each key episode of their pToC to some degree. If significant amounts of disconfirming evidence are found, the pToC should be revised, after which the revised version should be tested systematically. Typically, a follow-up round of fieldwork is necessary in the final stages of an evaluation to fill in evidential gaps.
Bayesian logic offers an intuitive framework for assessing the degree of confidence evaluators can hold in their pToC based on the evidence they have assembled. Table 2.1 depicts how varying degrees of confidence in a theory can be expressed in words—language that intelligence agencies use widely in presenting assessments and US courts employ to summarize the strength of evidence behind their conclusions. Numerical equivalents are also presented, although we recommend that they are not used in final evaluation reports. The exception would be if the likelihoods of finding or not finding evidence can be meaningfully quantified, as some scholars suggest (for example, Befani 2021; Fairfield and Charman 2017). In real-world evaluation settings, however, formalized Bayesian updating is not always practical or useful.
Table 2.1. Strength of Evidence Expressed Linguistically
Strength of Evidence |
Linguistic Expression |
Numerical Equivalent |
Strongly confirming evidence (high internal validity) | | | | | | | | | | | | | (low internal validity) Strongly disconfirming evidence |
“Beyond reasonable doubt,” “almost certainly” |
>90% (greater than 9-in-10 chance) |
“Very probably” |
80% (8-in-10 chance) |
|
“Probably” |
70% (7-in-10 chance) |
|
“Somewhat more likely than not” |
60% (6-in-10 chance) |
|
“Neutral,” “as likely as not” |
50% (1-in-2 chance) |
|
“Somewhat less than even chance” |
40% (4-in-10 chance) |
|
“Probably not” |
30% (3-in-10 chance) |
|
“Very probably not” |
20% (2-in-10 chance) |
|
“Almost certainly not” |
10% (1-in-10 chance) |
Source: Adapted from CIA 1968.
Different evaluations can require different evidential thresholds, depending on their purposes. An evaluation that seeks actionable knowledge or strong evidence-based conclusions typically requires multiple rounds of research that revise a pToC based on emerging trends in the evidence found, especially if the pToC does not operate in a way that produces empirical observables that can be collected easily. When relatively strong confirming evidence is found, the evaluation report can state that the evidence suggests that the pToC “very probably” worked as theorized in the case(s) studied. In other situations, lighter (that is, weaker) evidence supporting a relatively simple pToC might be enough for the evaluation report to conclude that it is “more likely than not” that the pToC worked as theorized. In the example IEG evaluation, we significantly updated our confidence in our theorized claim, from being rather skeptical of the World Bank’s impact through a particular channel to concluding that the intervention we examined very probably worked as we had theorized based on the cluster of different types of confirming evidence that we found.