In my last year of teaching, I was teaching three AP STEM courses at the same time — both AP Statistics and AP Computer Science Principles were new to me that year, but I had taught AP Computer Science A the previous year. I took a risk and invested time into learning how to do statistics with Python, in an attempt to create an overlap between two of my preps. In the end, it definitely paid off for me: I found a new passion and am now pursuing a career in data science.
This blog is the first in a series of standalone posts dedicated to solving basic AP-style statistics problems from scratch using Python. I will keep the use of specialized libraries to a minimum. It is important to keep in mind that only parts of the statistical pipeline can (or should) be automated, and that computing specific values is just one step in problem-solving — interpreting and applying your results is the ultimate goal. Many students tend to get so far into the weeds with number-crunching that they lose sight of the bigger picture. By automating the calculations with code, we can focus on what those calculations actually tell us.
Please keep in mind that this post is not an introduction to Python, nor is it an introduction to statistics. Instead, it is an introduction to merging the two to enhance your understanding of both. It could be used by students, teachers, or self-studiers.
Here is an example of an AP Statistics problem involving categorical variables:
This is the global distribution of rollercoasters found in the Roller Coaster Database
We’ll tackle these individually. Our approach will be to create a set of functions we can use to solve problems of this type, rather than this problem alone. The goal is not to solve this individual problem — it is to understand how problems that look like this are solved in general.
First, we need some sort of an object that will tie our data to our levels. We could accomplish this with two list
objects (and there’s absolutely nothing wrong with that), but I’m going to take advantage of one of my favorite Python objects: a dict
. Dictionaries are nice for a variety of reasons, and many libraries such as pandas
allow you to pass in dictionaries as you construct more sophisticated object like a DataFrame
.
Here is the dictionary we will work with:
coasters = {'Africa':90, 'Asia':2649, 'Australia':27, 'Europe':1329, 'North America':905, 'South America':175}
In particular, I like how the value
(the frequency) is tied directly to the key
(the continent). Keep in mind that as we progress, our implementation will be using the methods and attributes of a dict
, so if you are unfamiliar with this type of object, you may want to review the [dict](https://www.w3schools.com/python/python_dictionaries.asp)
type before proceeding.
At this point, it’s worth checking and double checking that the values in my dictionary are correct —if they are, this is the only time we’ll need to do this. (Compare that to entering a long string of numbers into your calculator, only to discover you missed a digit on one of them.)
Similar to how the table is divided into a left column (the name of the continent) and a right column (the number of roller coasters), our dictionary naturally separates continents into keys
, frequencies into values
, and each “row” in the table is its own key-value pair.
Onto our first question!
To solve this problem by hand, we can look at our table and add up all of the numbers in the column on the right:
90 + 2649 + 27 + 1329 + 905 + 175 = 5175
Unfortunately, this approach doesn’t generalize to other problems. If I’m given another table of counts, I’d have to pull out my TI-84 Plus (in true AP Statistics fashion) and enter all of the numbers all over again from the beginning, hoping that I don’t make any typos or accidentally miss a number. With a dataset this small, that might not be an issue, but as the number of categories increases, so do the odds of me fat-fingering the number pad.
#functional-programming #function