Interning at RStudio: Statistical Literacy for Software Engineers

Yim Register (they/them)
5 min readJul 15, 2019

--

I’ve been encouraged to document a bit about my 2019 internship at RStudio. After going on a hike yesterday where I explained each lesson I’ve been creating one by one (in possibly a painful amount of detail for anyone listening), I’m delighted to say that I must really like what I’m doing!

This summer, I’m working on creating lessons to introduce computer science students to statistics. You might ask, why don’t they just take a statistics course? Well, if they have the option, they certainly should. But imagine integrating statistical questions into the computer science coursework, prompting learners to think about their code as not only software, but data to help answer questions as well. One of the things that is missing from many statistics curricula is the idea that how you set up a problem, and what you choose to measure, changes the result. We live in a world where everyone is claiming to be able to predict the future with their Big Data Cognitive Machine Learning Models (+ blockchain!). There is an argument for everything out there, with people rapidly sharing “studies” and “statistics” on their social media. As I’ve mentioned before, I dream of a world where the consumers of science media can critically question methods, results, and visualizations. I find it to be incredibly empowering to have the vocabulary and skills to question what I come across on Google; to not be continuously pulled in different directions by study after study telling me something completely different (though sometimes that actually happens in earnest, of course).

“how you set up a problem, and what you choose to measure, changes the result.”

In my experience, statistics courses have taught how to set up problems on clean and irrelevant data. This is, of course, for a reason. I don’t claim to have all the perfect datasets to teach every concept; perfectly curated and unproblematic in every way. Additionally, I’ve talked a lot about teaching statistics via personally relevant datasets: but should someone with no statistics background be asked to demonstrate the gender gap in CS, or the effects of climate change? Probably not. This internship is helping me to refine what I mean by solving real problems and using personally meaningful data. I personally will still chuck out the iris dataset any day, though I’m not an ecologist. But I’m learning a lot from my mentor about how to slowly build curiosity that leads to a genuine desire to answer the bigger questions somewhere down the line. The middle ground is exactly where my work is centered: personally meaningful questions and data that can demonstrate the nuance that goes into obtaining statistical results.

Have you ever wondered…

I don’t know about you, but I’ve gotten into so many stupid-but-smart arguments. What I mean by that is, deeply engaging in the setup, definition, and operationalization of idiotic concepts. My most recent one was about “cooking” vs. “assembly”. Is making a sandwich considered “cooking”? How about heating up a pizza? Does the definition of “cooking” require heat? Surely making an intricate sushi platter from scratch is cooking, right? Another idiotic argument I recall is about the phrase “nook and cranny” and trying to define the difference between the two. Perhaps your arguments are much more fun, trying to rank Marvel superheroes or discussing what would happen if a vampire bit a werewolf. Maybe you use your time and cognitive effort in much more productive ways. Don’t worry, I’m not creating a curriculum about what counts as a “sandwich”. But I am harnessing that “hmm” intrigue that leads us down a rabbit hole to different caveats, arguments, and conclusions.

Statistics revolves around that “hmm” intrigue. Different paths, different tests, different setups; they can all lead to different answers. Sometimes, we regard a statistic as fact far too easily. Someone reports a number and that’s that. Especially in scenarios like expert witness in a trial, or the word of the consultant at a business meeting. And even more than that, in all corners of the media we consume. My lessons allow the learner to follow the trail of “well what about this?!” kind of questions, developing their ability to critically design and test claims of positivist truth in the world.

How fast is GitHub growing?

Can we fit a model to show how fast GitHub is growing? What does it mean for it to be growing? Number of users? Number of commits? Storage space being used? Do we count non-code? Do we count repositories that haven’t been committed to in years? Do we count forked repos? All of these are part of operationalizing a problem, a statistical literacy skill.

How bad is sleep deprivation for your code?

If I told you that developers who lost a night of sleep performed 46% worse on a coding task, how would you think I came to that conclusion? Well, Fucci et al. actually did it: Need for Sleep: the Impact of a Night of Sleep Deprivation on Novice Developers’ Performance. But reading academic papers can be daunting for anyone; this lesson walks through how to perform statistical tests like the Shapiro-Wilk test, Kruskal-Wallis one-way analysis of variance, and even a Bonferroni correction. It also walks through how to critically read empirical papers, defining vocabulary and describing what warrants certain design, tests, and controls.

How do Abstract Syntax Trees compare between different languages?

Everyone loves to say “Oh, you should use this language because it’s better than that language…” It’s like we all have some weird folk knowledge of why our preferred language is better, without a ton of evidence. This problem looks into some real comparisons between the ASTs, using JavaScript’s acorn.js and Python’s ast, and encourages your own data collection on your programming language of choice.

As I continue to delve into each of these problems, more and more statistical caveats continue to arise. At what point do we stop? At what point are the basic concepts “good enough” to teach a generation of statistically literate thinkers? I believe that my mentor and I are moving ahead with the idea that teaching to question is actually better than teaching the answers; and that is a very powerful idea.

--

--

Yim Register (they/them)
Yim Register (they/them)

Written by Yim Register (they/them)

Attending PhD School. Radical optimist. Machine learning literacy for self-advocacy and algorithmic resistance

No responses yet