Collateral damage: the pitfalls of quantitative measures of success
An evaluator’s perspective on "Weapons of Math Destruction"
An evaluator’s perspective on "Weapons of Math Destruction"
By: Jeff ChelskyI am an economist by training and a skeptic by nature. That combination underpins a healthy suspicion of how mathematics is sometimes applied to social science. Recently, I was encouraged to read Weapons of Math Destruction, a best-selling book by Cathy O’Neil. I usually balk at books with inflammatory titles (WMDs? Really?!?), but this one was recommended by a person I respect, so I moved it to the top of the pile of books I had accumulated via recommendations from Amazon’s “Recommended for You” algorithm.
O’Neil is a mathematician and data scientist, with impressive academic credentials (including a PhD in math from Harvard) and extensive experience in business and finance[1]. She was motivated to write Weapons of Math Destruction out of concern that “sloppy use of statistics and biased models” were entrenching inequalities and perpetuating biases, with collateral damage to people’s lives. She attributes this to the “separation between technical models and real people, and the moral repercussions of that separation”. She illustrates this with reference to the application of mathematical models and statistics to education, banking, insurance, marketing and employment.
For example, data might reveal a correlation between an individual’s zip code or race and the probability of default on a personal loan, and this correlation could be used to inform a credit risk model and influence the likelihood of that person being able to obtain a loan. Clearly there is no causal relationship between race/residence and creditworthiness, but there is evidence that these variables have been used by as proxies for more relevant (but perhaps harder to measure) factors like trustworthiness or quality of education. When such correlations are used to deny loans to people of a particular race living in a particular neighborhood, they have the effect of perpetuating inequality and injustice
O’Neil expresses dismay at what she sees as growing incentives to develop increasingly complex models and formulas “to impress rather than clarify”. At no point does she suggest that the analysts behind these models and metrics are anything but well-intentioned, but that they—and those that make use of their increasingly opaque models—may be unaware either of hidden biases in underlying assumptions, or that these models are often used for purposes for which they are not well suited.
I see a parallel in the way new technologies are often used. Take text analytics (or “text mining”) for example. Text analytics is a powerful tool to extract information about people’s attitudes and preoccupations. It is for that reason that marketers, political strategists, and sociologists make extensive use of it. However, its use to extract less impressionistic information is more problematic.
From personal experience, I know how easy it is to adopt new terminology and jargon (MFD, inclusion, resilience, disruptive technology, digital economy, mobilization, transformative engagements etc.) into strategy papers and program documents without meaningfully integrating those concepts into the design of specific strategies and operations. Even the most sophisticated text analytics software would have a hard time distinguishing this “ornamentation” from meaningful integration of these concepts in the design of operations. That distinction requires critical reading and judgement (generally associated with intelligence of the “non-artificial” kind). This is not to say that we cannot use text analytics to help focus our work and narrow the range of material we have to work with, but the technology, as impressive as it may be, is no substitute for critical and thoughtful review. It provides only the illusion of rigor.
Given our accountability mandate, evaluators are naturally drawn to “objective” indicators—we love to assign scores and ratings, and hunger for more sophisticated ways of generating metrics to assess performance. But complexity does not equal rigor, correlation is not causality, and a number is not a fact.
Another example O’Neil provided, which struck me as particularly relevant to the work of an evaluator, was of a well-intentioned effort by a group of journalists to create a ranking of the educational quality of U.S. universities. Since “educational quality” could not be directly measured, the journalists picked proxies that seemed to them to correlate with success. These included the shares of incoming students who made it to their second year, the share of those that graduated, and the share of living alumni who donated money to their alma mater (surmising that this was evidence that they are more appreciative of the education they received). They then developed an algorithm using these proxies which was combined with the subjective views of college officials across the country.
As their ranking gained national attention, it began to influence college choice among prospective students, raising the stakes for college administrators to improve their rankings. For any of us who have used proxy-based indicators to assess performance (a good description of many of the results indicators in our operations, particularly development policy financing), or had their own performance measured by seemingly objective measures, it’s easy to see where this led: Incentives quickly coalesced around influencing proxies rather than achieving ultimate objectives which, in our case, would be development impact.
So what does all this have to do with evaluation, you might ask?
Given our accountability mandate, evaluators are naturally drawn to “objective” indicators—we love to assign scores and ratings, and hunger for more sophisticated ways of generating metrics to assess performance. But complexity does not equal rigor, correlation is not causality, and a number is not a fact. While that may not make us more confident in our findings, it will keep us more honest and humble. I see these attributes as essential if we are also to fulfill the other key part of our mandate—identifying and disseminating lessons from experience.
A final note—I haven’t yet talked about data quality which can make the difference between a valid and a true conclusion. Stay tuned on this front….
[1] For those interested in her work, she writes a regular blog under the somewhat provocative name “Math Babe: Exploring and Venting about Quantitative Issues” at https://mathbabe.org/)
This post is the first of IEG's new Measuring Up series, designed to spark greater knowledge sharing in the development community about what success looks like in Global Development, as well as they ways we measure, assess, and evaluate it. We hope you will share your views and perspectives in the comments to this blog post.
Comments
Maybe in development…
Maybe in development countries, this big data in numerics output, be it a problem in the distortion of the reality by the use indiscriminated the theorical models, with the insertion of parameters what correspondig to others context far away the social needes and that strenght politics for the minority rich. In contraposition in undevelopment countries, this big data software use is still incipient.
Greetings from Guatemala
This is a great piece…
This is a great piece shedding light on why some development interventions are reported as successful while some others are not. It is unfair to pretend to measure the reality in complex development contexts using numbers or statistics. In developing world for example there are respondents or beneficiaries who are not literate numerically. Evaluators or data collectors put words (in form of numbers) in the mouth of respondents, sometimes (not all the time) misrepresenting their reality or experience. No valid conclusion would be reached when other facets of reality are not taken into account. Unimethod, mainly quantitative methods and its models alone are unhelpful.
In a 2-day pre-conference workshop that I will be facilitating on behalf of the African Capacity Building Foundation at the African Evaluation Association (AfrEA) International Conference in March 2019, i will take participants through the use of mixed methods in evaluation, gathering all types data (numerical, audio, video, pictorial, narrative, etc) to evaluate development interventions. I strongly believe "all statistics are data but all data are not statistics", hence the need to embrace multiple methods in evaluation.
Expanding my previous…
Expanding my previous comment: in Guatemala the use of software for the analysis of a large volume of data is incipient because of the large amount of money (mainly) it represents for a small market (microeconomic point of view). However the abuse of econometric models (in the public sector) they hold only within the theoretical framework of the mathematical requirement, but that they are completely divorced from reality, can result, as the article well establishes it to not only fix but also to expand the inequality gap, besides being an instrument to perpetuate perverse institutions such as corruption, nepotism, marginalization, etc. In summary the handling of the figures at discretion is an end in itself; I call it "programmed bias".
We must use them for the establishment of public policies, because they give a certain sense of direction, but the management thereof must be evaluated and monitored exhaustively.
Greetings!
Completely agree wrt the new…
Completely agree wrt the new "fad" that still has to show the promise. I interact frequently with "data scientists" (I am an Econometrician with a PhD in Economics, so imagine I am a bit skeptical), but we have agreed that terminology and methods used should be handled very very cautiously. So if included in evaluation, I would use levels of evidence strength. I only talk more with Statisticians (w/o PhD) that know a little more about what problems "fancyness" may pose (let alone talking about the statistics behind those problems, which are not well posed). All of them silently disregard the term "Big Data", since most data we will ever use will not approach what is though to be in that category. Before that, at the risk of being old fashioned, I would address problems from the field: ¿are questionnaires really measuring what we say they measure? This may be easy for an LSMS type of survey, but for experimental data, soft variables related risk aversion, behavioral-economic-related variables... uhmmm… it seems we have a lot to learn. So this page is very very welcome!
Dear Carlo, many thanks for…
Dear Carlos, many thanks for the comment. I share your concern with the rigor with which questionnaires are designed and used. Perhaps this would be a good topic for a future blog?
Thumbs up for more humility…
Thumbs up for more humility in the evaluator's toolkit. But let's not throw away text analytics just yet. Sure, we need to be mindful of its limitations, but a judicious use of it can help save time, identify patterns in big data, and focus the evaluator's attention on areas of vast qualitative data that need human (evaluator) investigation. We used it in that way in a recent IEG evaluation "Growth for the Bottom 40 Percent: World Bank Group's Support for Shared Prosperity" https://ieg.worldbankgroup.org/evaluations/shared-prosperity. The approach helped us identify exactly part of your concern that the early introduction of the WBG shared prosperity corporate goal has been much more operationalized in the SCDs and knowledge work than in the country-level strategies, lending, and project work. New rhetoric vs. Operational focus, and impact.
Dear Zeljko, no one is…
Dear Zeljko, no one is suggesting throwing any babies out. Indeed, judicious and informed use of text analytics is a powerful tool. But how often have you seen the use of text analytics or proposals for its use accompanied by an informed discussion of its limitations? I once sat through a full day "Introduction to Text Analytics" workshop. They day was all about how to use text analytics. Not once did we address when--or when not--to use it. Regrettably, this is more then rule than the exception as too many people are seduced by the use of this technology and apply it without sufficient understanding of its limitations.
Add new comment