10 ways to integrate Social Justice into teaching Data Science

Yim Register (they/them)
8 min readSep 1, 2020

Whether it’s a formal classroom or a corporate team meeting, here’s a list of things I’ve come across in my experience to integrate Social Justice into teaching Data Science.

a cardboard protest sign in a crowd that says “Fight today for a better tomorrow”

Use real-world data… carefully

If you follow my writing you know I have a personal vendetta against the iris dataset and mtcars. Students can learn better when working in personally relevant domains, and I have simply never had a justice-oriented conversation about flowers (I’m sure I could think of one though). When we teach Data Science we automatically fall back on what’s available, which makes a ton of sense. But there is so much more data out there than NYCFlights or theTitanic dataset. If we want to teach people to be Data Scientists in the world, they need to learn on social contexts with real consequences and contradicting stakeholders. But this doesn’t mean immediately throwing your classroom into police brutality data or abortion statistics, because…

Assume someone in the room has experienced the data you’re discussing

Autism birthrates. Rape and assault. Police brutality. Racial profiling. Abortion rates. Teen pregnancy. Refugee data. Incarceration. Cancer diagnosis. Natural Disasters. Child abuse statistics.

These are all datasets I’ve seen discussed. I think we all have a basic level of compassion when discussing sensitive topics, but I challenge you to assume there is a student sitting there in the classroom who has personally experienced what you’re talking about. Because it’s likely there’s someone who has. I’ve been a rape victim in a room discussing rape statistics as casually as the weather. I’ve been someone with a high “Adverse Childhood Experiences” (ACE) score while the professor goes on and on about how ACEs correlate with early death. It’s important to discuss real data with real consequences, while also keeping in mind that all of us carry several identities with us that need to be kindly and carefully navigated in a classroom. Giving warning about data you’ll be discussing, asking for feedback, speaking kindly about those affected in the dataset, and demonstrating the basic level of respect needed to discuss complex issues in a classroom.

Every dataset has a backstory; talk about it!

Where did the data come from? Who collected it and why? Even the iris dataset isn’t safe. Ronald Fisher was a eugenicist who believed that the different races differed “in their innate capacity for intellectual and emotional development”. So this guy who was measuring flowers also spread racist views around the academic sphere, and we never talk about that!

How about the Titanic? The poorer passengers died! Why? Because the rich people were allowed to get onto lifeboats first. Let’s talk about it and the other classism in society and data!

How is census data used? Who decides the categories? Was everyone asked to participate? Was the data collection accessible? Was it voluntary or collected without participants realizing? Was the original data collection purpose different from how it was later used?

Discuss data for what it is: real human lives being affected

It’s really easy to think in the abstract when you’re running R scripts and getting opaque numbers back. After all, you can run a lot of tests you really shouldn’t be allowed to run and they’ll still give you a “result”! It’s also easy to set a reasonable threshold in our models for something like “accept” or “reject” without thinking too much about it. But what about the model that predicts if someone will default on a loan? If the threshold for accepting is 85% sure they won’t default on the loan, what about the young woman trying to buy her first home; working for everything she’s got, and she falls at 84.5% likely to default. We set thresholds for a reason; I’m not saying it’s totally unreasonable. But are you sure 85% is a good threshold? Imagine it’s you. Imagine it’s your friend. Real human lives are affected by the models we run and the conclusions we draw. Just keep that in mind.

Recognize that categorization of humans can be oppressive

So you got a dataset and you’re visualizing it and doing some exploratory data analysis. Especially when we didn’t collect the data ourselves, we forget to question the data itself. Speaking on my own experience as a nonbinary person, do you have any idea how many times I’m only given the option “Male” or “Female” on a form? Yes I understand not every form will update with societal change and not every form even needs to know my gender identity. That’s why I’ve started randomly choosing between male and female instead of shamefully always defaulting to my birth sex. Because if the system doesn’t work for me, then I won’t work for the system.

How we choose racial categories, gender categories, poverty categories, marital status (yep, even that. Look into polyamory), etc. These choices can be oppressive, uninclusive, confusing, and “othering” for those of us who don’t fit in one of the boxes. This doesn’t mean stop doing Data Science or stop collecting any information. Why don’t you ask your class what they think we should do? A lot of us are trying to solve these problems.

Make it personal and ask your students to critique models

Have students take on the role of someone being affected by the model you’re discussing. In the Data Science introduction series me and Emma Spiro are working on, we use the case study of when Target ads predicted someone was pregnant before they even knew. We are being particularly careful to remember the above tips; being kind, remembering someone in the room might be personally affected by this content, giving backstory, critiquing the data, and giving adequate warning. Next what we are doing is having students take on the role of various stakeholders for this model (excluding the pregnant individual because it’s just not appropriate to put someone in that role randomly). We are asking people to advocate as if they are the company CEO, someone who is looking for target pregnancy ads, friends and family of the individual, journalists covering the story, and Data Scientists at the company. Once the students are situated, they will critique the model and/or advocate for their position in the scenario, followed by a group discussion between all the stakeholders. Give students a chance to immerse themselves in the scenario and critique why models aren’t working for them. You’ll be amazed at what students come up with.

Point out how and why models are “gamed”

One natural response to unfair models is to “game” them. For example, Google sorts search results using an algorithm called PageRank. PageRank doesn’t rank pages by how often they’re clicked on, because then you could just set up a bot to click the page over and over and increase the page’s status. Instead, PageRank relies on how many other sites link to a page. It’s based on the assumption that if other pages are linking somewhere, the link is likely reputable. It’s a fine assumption and tends to work out. But something called link farms got around this by setting up a bunch of websites that all link to one another in order to increase their page ranks. Pretty clever, but annoying.

When we flood hashtags on purpose or include a selfie to get some information more attention online, we are gaming the system. Most models can be gamed. Talk about it.

Always point out assumptions baked into models

We’ve talked a bit about categories and data history. But this one goes a little deeper. Sometimes a seemingly “helpful” model is actually founded on poor assumptions. In the disability space this is huge. So many models, data explorations, and results refer to disability as something to correct; a deviation from the norm that can be accommodated for, an unfortunate mistake. Within the Autism community, we often regard our neurodivergence as a special gift, a “different but not less” category, and something that should not be corrected or shifted towards neurotypical. This sentiment is similar in the Deaf community and many others.

Another example is about “beauty filters”. Beauty is equated with whiteness and thinness, literally baked into the design of the algorithm. The beauty filter lightens the skin, opens the eyes, and thins the face. I’m not entirely sure if this fits in with “Data Science” but it’s a good example of how seemingly innocuous parts of every day life contribute to oppression. What other assumptions get baked into our models?

Provide community and representation for minority students

If you’re the “only one” of any kind of identity in a classroom, it affects your learning. We all seek connection, solidarity, and community. Always provide resources to your class, even if it “looks” homogeneous to you. You never know. Here are some: Black in AI, Women in Machine Learning, LatinX in AI, Queer in AI, Data Science for Social Good, Visa information for hiring Data Scientists outside of the US, AINow, Data Feminism, Diversity in Tech from Information is Beautiful. Please contribute more with a comment!

Be kind

Learning Data Science is hard. Existing in an unjust society is harder. Especially in the hellscape that is 2020, just be kind. Many of us can’t afford to give tons of effort and energy for anyone but ourselves and our immediate loved ones right now, but a little bit can go a long way. My teaching motto is: “People don’t care what you know until they know that you care”.

*Icons from Icons8.



Yim Register (they/them)

Attending PhD School. Radical optimist. Machine learning literacy for self-advocacy and algorithmic resistance