In my last year of teaching, I was teaching three AP STEM courses at the same time — both AP Statistics and AP Computer Science Principles were new to me that year, but I had taught AP Computer Science A the previous year. I took a risk and invested time into learning how to do statistics with Python, in an attempt to create an overlap between two of my preps. In the end, it definitely paid off for me: I found a new passion and am now pursuing a career in data science.

This blog is the first in a series of standalone posts dedicated to solving basic AP-style statistics problems from scratch using Python. I will keep the use of specialized libraries to a minimum. It is important to keep in mind that only parts of the statistical pipeline can (or should) be automated, and that computing specific values is just one step in problem-solving — interpreting and applying your results is the ultimate goal. Many students tend to get so far into the weeds with number-crunching that they lose sight of the bigger picture. By automating the calculations with code, we can focus on what those calculations actually tell us.

Please keep in mind that this post is not an introduction to Python, nor is it an introduction to statistics. Instead, it is an introduction to merging the two to enhance your understanding of both. It could be used by students, teachers, or self-studiers.

Analyzing A Single Categorical Variable

Here is an example of an AP Statistics problem involving categorical variables:

Example Problem: Analyzing the Distribution of a Categorical Variable

This is the global distribution of rollercoasters found in the Roller Coaster Database

Image for post

According to this table, how many roller coasters are there in the world in total?
(a) Create a relative frequency table of these data. Give your answers in percents, rounded to one decimal place. (b) Do your percents add up to 100? Why or why not?
What percent of roller coasters are in either North America or South America?
Construct a bar chart and a pie chart of these data. Don’t forget to include labels and a descriptive title!

We’ll tackle these individually. Our approach will be to create a set of functions we can use to solve problems of this type, rather than this problem alone. The goal is not to solve this individual problem — it is to understand how problems that look like this are solved in general.

First, we need some sort of an object that will tie our data to our levels. We could accomplish this with two listobjects (and there’s absolutely nothing wrong with that), but I’m going to take advantage of one of my favorite Python objects: a dict. Dictionaries are nice for a variety of reasons, and many libraries such as pandasallow you to pass in dictionaries as you construct more sophisticated object like a DataFrame.

Here is the dictionary we will work with:

coasters = {'Africa':90, 'Asia':2649, 'Australia':27, 'Europe':1329, 'North America':905, 'South America':175}

In particular, I like how the value (the frequency) is tied directly to the key (the continent). Keep in mind that as we progress, our implementation will be using the methods and attributes of a dict, so if you are unfamiliar with this type of object, you may want to review the [dict](https://www.w3schools.com/python/python_dictionaries.asp) type before proceeding.

At this point, it’s worth checking and double checking that the values in my dictionary are correct —if they are, this is the only time we’ll need to do this. (Compare that to entering a long string of numbers into your calculator, only to discover you missed a digit on one of them.)

Similar to how the table is divided into a left column (the name of the continent) and a right column (the number of roller coasters), our dictionary naturally separates continents into keys, frequencies into values, and each “row” in the table is its own key-value pair.

Onto our first question!

1. How many roller coasters are there in the world, in total?

To solve this problem by hand, we can look at our table and add up all of the numbers in the column on the right:

90 + 2649 + 27 + 1329 + 905 + 175 = 5175

Unfortunately, this approach doesn’t generalize to other problems. If I’m given another table of counts, I’d have to pull out my TI-84 Plus (in true AP Statistics fashion) and enter all of the numbers all over again from the beginning, hoping that I don’t make any typos or accidentally miss a number. With a dataset this small, that might not be an issue, but as the number of categories increases, so do the odds of me fat-fingering the number pad.

#functional-programming #function

Analyzing A Single Categorical Variable

Example Problem: Analyzing the Distribution of a Categorical Variable

1. How many roller coasters are there in the world, in total?

levelup.gitconnected.com

Parsing JSON with Circe — Beyond the Basics