Why Data Science Needs More Data Scientists: Diversity In The Data Supply Chain
As a student in the Executive MBA program at MIT’s Sloan School of Management, we are always reminded by our professors, “bad data in; bad data out.” In my scientific career as a biophysicist and now my role in product management, this is an oft-quoted phrase, but what does it mean to take action to remedy this? I commonly hear this phrase after the experiments, surveys or algorithms are shown to have a built-in error. We use it to discard the flawed test, but we often neglect to understand the mechanism that got us there. Let’s take a journey to perhaps discover why.
Go to any scientific conference, whether the topic is cancer research, forensics or chemical analysis, and you will see a rush of people heading into an Artificial Intelligence (AI) presentation. I recently found myself in such a room in Anaheim for the American Association of Clinical Chemistry. It was standing-room only. Hours later I sought out a lecture on race in genomics and medicine. It was practically empty, the hundreds had become tens. As I looked around, my frustration built as it dawned on me that the size of the audiences should have been switched. The question that’s been nagging me since is, “If we aren’t in the rooms to learn about the ways we create flawed inputs, how can we ever hope to improve the outputs? “
The propensity for algorithms to show bias is increasingly discussed and has generated growing awareness, prompting a conversation about the role that scientists must play in ensuring equity and inclusion in the data they are gathering. As an example, Joy Buolamwini of MIT’s Media Lab has published research showing that facial recognition can quite accurately predict the faces of white men. That accuracy dropped from 99% to as low as 35%, however, when you go beyond that demographic.
Even our wearable devices are not immune to bias. A recent article by Ruth Hailu states that, “some people of color may be at risk of getting inaccurate readings” when getting heart-rate measurements. There is a very real concern that “the growing body of scientific research that relies on these wearables” will be compromised. This could have a massive ripple-effect when we think about how studies build upon and reference each other. If flawed heart rate data were then used as a parameter to study risk of heart attacks, the results could be compromised right from the beginning.
When we group data according to race, for example, we sacrifice true understanding. As Jay Kaufmann of McGill University stated at the opening of his talk, Uses and Abuses of Racial and Ethnic identification in Medical Research, “Human variation is continuous, not discrete.” This is an idea that originated in anthropology which asserts that the differences between humans are gradual and nuanced, not blunt and dramatic. An excellent illustration was given by Jonathan Marks in an interview with PBS that our historical “understanding of race is that there’s a small number of basically different kinds of people, perhaps localized to continents… but the human species simply doesn’t come patterned.”
Where do we go from here? Yes, we need to fix the algorithms. But I would argue that we need to think differently and change our mindset about the entire supply chain of data both vertically (by the groups that produce and manipulate data) as well as horizontally (by the chain that the data travels).
If the experiments are designed with any level of bias to begin with, then no amount of well-informed, diverse group can analyze that out. We are simply thinking about data in the wrong way, overlooking or ignoring a lot of the lessons we’ve learned — we have to start with informed, scientifically developed inputs otherwise the output will be compromised.
This means we need help everywhere, at the same time, from experimental design to data analysis, to reviewers at major scientific and medical journals and the larger media, who ultimately inform the public about conclusions of research and experiments. If we can fix our inputs, we will be happier with the outputs. Making any amount of progress in the entire data supply chain will make a difference. Though Big Data & AI is a relatively new field, there is no excuse as the algorithms we’re designing today will be used to generate plans for how our world is structured tomorrow.
And, yes, for some disciplines that use this data it will be harder to rapidly change, you cannot lead a clinical study overnight. However, in certain areas the bar is easier to clear. There are places to learn to code, like Coding Dojo, conferences where diversity is discussed, classes on Artificial Intelligence and Machine Learning offered online. You can also simply support groups like AI4ALL, whose mission is to “Opens Doors to Artificial Intelligence for Underrepresented Talent Through Education and Mentorship” and help ensure changes stay for the long term.
We need to remain focused and with a sense of urgency because the time for thoughtful action is now. Let’s be sure that all the presentations teaching us how to design, collect and analyze our data are standing-room only — and then we can walk to that AI lecture together.
Want to be part of the solution to this problem? Coding Dojo’s Data Science Program is the best 14-week immersive course available. Apply now!