Teaching Machine Learning with Social Media Examples
If you’ve ever tried to learn machine learning (ML), you’ve probably seen images like this:
You may have seen examples like this:
I’m here to tell you that I think we are stuck in the past when it comes to how we teach machine learning. Many ML courses are still taught in an entirely theoretical way; think mathematical proofs and matrix operations. Personally, I find these foundational skills to be important to mitigate the irresponsible plug-and-play practices we have today; but they aren’t the most accessible or meaningful deterrent to negligent machine learning. Using toy datasets like the iris dataset or Boston housing dataset also have their benefits — they’ll give you some clear and reliable classification boundaries while also being small datasets. But so often we pass up the opportunity to discuss how the 1936 iris dataset is linked to the eugenicist philosophies of the scientist who created it, or how the 1970 Boston housing dataset implies correlation between race and crime.
Once upon a time, it made sense to use both theory and toy datasets. Especially back in the 1950s when neural networks were still theoretical, we needed to rely on mathematical proofs and simple examples to demonstrate machine learning ideas. But here we are, in 2022, relying on datasets from decades ago to teach data science and machine learning.
So often in ML courses we rely on abstract descriptions of a problem, like in the overfitting example. We refer to the input features as x₁ x₂ x₃, and rely on mental modeling of geometric relationships that, quite frankly, no human can do. So many ML examples aren’t framed with an example in mind — we talk about Model A and Model B and feature x and outcome y and it’s completely detached from anything a student can grasp on to. But why? We are living in the information age! Machine learning is all around us; in our pockets and laptops and cars and public transport and doctor’s offices and banks and every day in the news!
My PhD research specializes in algorithmic literacy on social media platforms; facilitating public understanding of the machine learning that underlies Feed Organization, Content Recommendation, and Content Moderation. I am particularly interested in the ways that social media algorithms shape our realities — what we believe about the world and ourselves — often with unexpected and harmful consequences. Bias in the algorithms quickly turns into threatening the safety of marginalized communities, warping political landscapes, and promoting restrictive and normative ideals about the ways we should look, speak, and behave. Through my work on social media algorithms, I have explored the technical details of ethical dilemmas; something I want every Data Science classroom to grapple with.
So here is a non-exhaustive list of ways to teach ML algorithms with social media examples. My goals are to give students actual examples they can grasp on to, as well as introduce the ethical questions of right now into the classroom. As teachers, we are training the future Data Science leaders of the world. We want them to be able to critically engage with ML as a practice; always looking out for the ways that ML can affect real human lives.
Linear Regression: Likes vs Comments
Let’s start off simple. What is the relationship between the number of Likes on a post, and the number of Comments? I once ran a pilot study where I had students manually collect this data for the last 10 posts on their Facebook feed. To be painstakingly clear, we must be careful when asking students to open their social media in class — social media is personal, sensitive, and private. Students should never be expected to collect data on themselves in a group setting, nor share their results. People deal with all kinds of complex issues online, such as mental health crises, death of family members, or may be the targets of bullying or harassment. Even getting a low number of Likes can be embarrassing for some people. So for this exercise, I’ve gone ahead and done my own little data collection on 50 posts in my Instagram Feed (you could do your own posts as well, up to you). You can Hide or Unhide Like Counts on Instagram, a feature introduced because seeing the Like counts often contribute to social comparison and depression, a topic you can discuss with your class!
What is the relationship between Likes and Comments?
Some of these are ads, can you tell which ones?
What features would make someone more likely to Comment?
Is this data collection fair? Representative?
How have Like counts affected mental health?
Naive Bayes: Hate Speech Detection
We often use spam detection as the example for Naive Bayes. It’s a fine example, but here’s another option: Hate speech detection on social media. I’ve put together a video that explains using Naive Bayes for hate speech, including some discussion prompts for your students to think about. Hate speech detection is a daunting task; who decides what counts as hate speech? What about when marginalized communities reclaim slurs to be used within their own in-group? What if you’re posting about slurs someone else said to you, and you are educating about antiracism or disability justice? What if you’re being bullied in the comments, but then you get reported? Content moderation in general is a difficult task; try guiding your class through some conversations about how ML is used online to try to automate moderation efforts.
K Nearest Neighbors: Suggested Content and Radicalization
So I’m the green circle… and you’re the blue square? Are we friends?
Facebook has denied using location data to suggest ads and friends, though I wouldn’t be surprised if they still do. However, there are other features that probably work ‘better’ for who you’re likely to add to your network. Work together with your class to think about a KNN example regarding Suggested Accounts to Follow on various social media platforms. Perhaps it’s number of messages people have exchanged with your most contacted friends. Perhaps it’s the number of ad interests in common, or your age, or your political affiliation. What are the dangers of ‘birds of a feather, flock together’? Especially in terms of political polarization or radicalization, how can KNN go wrong? This is especially clear on Youtube, where one can be pulled into right-wing rabbitholes. (Please note, this NYT interactive article contains many disturbing, transphobic, sexist, and racist themes) I wouldn’t use this article specifically in a classroom setting, but may be good for your own background knowledge as an educator. Please avoid reading if you’re already familiar with Youtube radicalization.
Decision Tree/Random Forest: Targeted Ads and Underage Drinking
I don’t have data for this one, but you can imagine using part of the lesson on Decision Trees to discuss targeted ads. Here’s an article talking about how Facebook has approved alcohol and gambling ads directed at teens, something they claimed to disallow. Alcohol ads are rampant on social media; as are dieting ads. What role do you think social media ads play in underage drinking and body image? Beyond teens, many recovering alcoholics also struggle when exposed to alcohol content on social media. Instagram has since introduced the ability to block alcohol ads.
Imagine a decision tree that pays attention to gender, age, income, past purchases, etc. How can we discuss Decision Trees in the context of predatory and harmful advertising on social media? Here’s an example from Kaggle to get you started.
k-Means/clustering: Misinformation spread
Hate “Clusters” Spread Disinformation Across Social media. Mapping Their Networks Could Disrupt Their Reach.
We can discuss clustering in terms of pockets of social networks that spread mis- and dis-information. Here is a short primer on misinformation and techniques to study it. One issue regarding misinformation is how easily it can spread amongst trusted peers. We see a friend who posts something, and may assume it’s trustworthy and well-intentioned. This kind of word-of-mouth information spread can quickly get out of hand, and the affordances of social media (think Retweets and Reshares) complicate it even further. Could we use clustering to identify sources of misinformation online? What about bridges between clusters? How is it that a conspiracy theory becomes mainstream? These are all things you can think about when discussing clustering analysis in the classroom.
Collaborative Filtering: Recommended Interests and Privacy leaks
Here is one of my projects that explores how Facebook decides what you might be interested in, using collaborative filtering. Facebook gives us access to the Ad Interests it thinks we have. These interests are likely curated using a collaborative filtering process — comparing us to similar friends and mining their interests for potential recommendations. My particular project explored how information may “leak” through the collaborative filter; family may discover you are queer, your political ideals may be shaped by those around you more than you realize, misinformation can easily spread between trusted connections, or you may be overly targeted with dieting ads. Not only can we use our personal data as an interesting look ‘behind the curtain’ of popular social media platforms, but we can also critically engage with complex topics about privacy, community, and safety.
In Conclusion
Hopefully some of these lesson ideas inspire you! Just to reiterate, social media data itself is personal, sensitive, and private. How you use social media may be very different to how your students use it or have experienced it. But as long as you keep the conversation as compassionate and consensual as possible, there is a lot of opportunity in using social media examples when teaching about machine learning algorithms.