Collateral damage: the pitfalls of quantitative measures of success
An evaluator’s perspective on "Weapons of Math Destruction"
Given our accountability mandate, evaluators are naturally drawn to “objective” indicators—we love to assign scores and ratings, and hunger for more sophisticated ways of generating metrics to assess performance. But complexity does not equal rigor, correlation is not causality, and a number is not a fact.
I am an economist by training and a skeptic by nature. That combination underpins a healthy suspicion of how mathematics is sometimes applied to social science. Recently, I was encouraged to read Weapons of Math Destruction, a best-selling book by Cathy O’Neil. I usually balk at books with inflammatory titles (WMDs? Really?!?), but this one was recommended by a person I respect, so I moved it to the top of the pile of books I had accumulated via recommendations from Amazon’s “Recommended for You” algorithm.
O’Neil is a mathematician and data scientist, with impressive academic credentials (including a PhD in math from Harvard) and extensive experience in business and finance. She was motivated to write Weapons of Math Destruction out of concern that “sloppy use of statistics and biased models” were entrenching inequalities and perpetuating biases, with collateral damage to people’s lives. She attributes this to the “separation between technical models and real people, and the moral repercussions of that separation”. She illustrates this with reference to the application of mathematical models and statistics to education, banking, insurance, marketing and employment.
For example, data might reveal a correlation between an individual’s zip code or race and the probability of default on a personal loan, and this correlation could be used to inform a credit risk model and influence the likelihood of that person being able to obtain a loan. Clearly there is no causal relationship between race/residence and creditworthiness, but there is evidence that these variables have been used by as proxies for more relevant (but perhaps harder to measure) factors like trustworthiness or quality of education. When such correlations are used to deny loans to people of a particular race living in a particular neighborhood, they have the effect of perpetuating inequality and injustice
O’Neil expresses dismay at what she sees as growing incentives to develop increasingly complex models and formulas “to impress rather than clarify”. At no point does she suggest that the analysts behind these models and metrics are anything but well-intentioned, but that they—and those that make use of their increasingly opaque models—may be unaware either of hidden biases in underlying assumptions, or that these models are often used for purposes for which they are not well suited.
I see a parallel in the way new technologies are often used. Take text analytics (or “text mining”) for example. Text analytics is a powerful tool to extract information about people’s attitudes and preoccupations. It is for that reason that marketers, political strategists, and sociologists make extensive use of it. However, its use to extract less impressionistic information is more problematic.
From personal experience, I know how easy it is to adopt new terminology and jargon (MFD, inclusion, resilience, disruptive technology, digital economy, mobilization, transformative engagements etc.) into strategy papers and program documents without meaningfully integrating those concepts into the design of specific strategies and operations. Even the most sophisticated text analytics software would have a hard time distinguishing this “ornamentation” from meaningful integration of these concepts in the design of operations. That distinction requires critical reading and judgement (generally associated with intelligence of the “non-artificial” kind). This is not to say that we cannot use text analytics to help focus our work and narrow the range of material we have to work with, but the technology, as impressive as it may be, is no substitute for critical and thoughtful review. It provides only the illusion of rigor.
Another example O’Neil provided, which struck me as particularly relevant to the work of an evaluator, was of a well-intentioned effort by a group of journalists to create a ranking of the educational quality of U.S. universities. Since “educational quality” could not be directly measured, the journalists picked proxies that seemed to them to correlate with success. These included the shares of incoming students who made it to their second year, the share of those that graduated, and the share of living alumni who donated money to their alma mater (surmising that this was evidence that they are more appreciative of the education they received). They then developed an algorithm using these proxies which was combined with the subjective views of college officials across the country.
As their ranking gained national attention, it began to influence college choice among prospective students, raising the stakes for college administrators to improve their rankings. For any of us who have used proxy-based indicators to assess performance (a good description of many of the results indicators in our operations, particularly development policy financing), or had their own performance measured by seemingly objective measures, it’s easy to see where this led: Incentives quickly coalesced around influencing proxies rather than achieving ultimate objectives which, in our case, would be development impact.
So what does all this have to do with evaluation, you might ask?
Given our accountability mandate, evaluators are naturally drawn to “objective” indicators—we love to assign scores and ratings, and hunger for more sophisticated ways of generating metrics to assess performance. But complexity does not equal rigor, correlation is not causality, and a number is not a fact. While that may not make us more confident in our findings, it will keep us more honest and humble. I see these attributes as essential if we are also to fulfill the other key part of our mandate—identifying and disseminating lessons from experience.
A final note—I haven’t yet talked about data quality which can make the difference between a valid and a true conclusion. Stay tuned on this front….
This post is the first of IEG's new Measuring Up series, designed to spark greater knowledge sharing in the development community about what success looks like in Global Development, as well as they ways we measure, assess, and evaluate it. We hope you will share your views and perspectives in the comments to this blog post.