Many bike share systems make available their trip data for those who want to understand how their systems are used. The bike share system in New York City, Citi Bike, is one of them, but they don’t provide much more than the data. I’ve got some experience in obtaining and preparing their data for visualization, so in this article I will show you how to get started with this rich data source.
In the Before Times I commuted from suburban New Jersey to my job as a Product Manager in New York City at an office, now shuttered, above Penn Station. To get around in the City at lunch or after work I often relied on Citi Bike, New York’s bike share system. I found I could get to destinations in midtown and even further afield faster than walking and cheaper than the bus or subway. When I discovered that Citi Bike made trip data publicly available I thought that it might provide an interesting use case for the data preparation product that I managed.
Using real data turned out to be much more interesting then the sample files that we had been using because there were actual anomalies that needed to be cleaned up to make the data useful for analysis, and there were interesting stories to tell from the data.
The trip data files contain one record for each ride, around two million records per month, depending on the season. It’s a traditional bike share system with fixed stations where a user picks up a bike at one dock, using a key fob or a code, and returns it at another. The station and time when the ride started and stopped is recorded for each ride.
Some limited information about the rider is also recorded: their gender and year of birth. Citi Bike also distinguishes between what they call Subscribers who buy an annual pass (current cost is $179 for unlimited rides up to 45 minutes) and Customers who buy a day pass ($15 for unlimited 30 minute rides) or a single ride pass ($3).
For each user type there are overage fees for longer rides. For Customers it’s $4 per 15 minutes; for Subscribers it’s $0.15 per minute. These fees seem to be designed to discourage longer rides, more so than to increase revenue.
The Citi Bike System Data page describes the information provided. The specific information for each ride is:
The kinds of questions we wanted to answer included ones like these: What’s the most common ride duration? What times of the day does the system get the most usage? How much does ridership vary over the course of a month? What are the most used stations? How old are the riders?
While the answers to these questions can be found in the trip data files, the data needs to be augmented to provide easy answers. For example the trip duration in seconds is too granular; minutes would be more useful.
Over the years I used this data for numerous presentations to customers and at user group meetings. And the cleansed data I created was used by the product managers for a visualization tool for their own presentations.
When I happened to use Jupyter Notebook, Python and Pandas for another project I decided to see what it would take to prepare the Citi Bike trip data using these tools.
Jupyter Notebook is an open-source web-based application that allows you to create and share documents that contain code, visualizations and narrative text. It’s commonly used for data preparation and visualization but has many other uses as well. **Python **is the programming language used by default and Pandas is a software library widely used for data manipulation and analysis. I also used **Seaborn **as an easy way to visualize the data.
The Jupyter Notebook with all the code and output can be found on github.
#pandas #bike-sharing #data-science #seaborn #python