When we’re doing data analysis with Python, we might sometimes want to add a column to a pandas DataFrame based on the values in other columns of the DataFrame.
Although this sounds straightforward, it can get a bit complicated if we try to do it using an if-else conditional. Thankfully, there’s a simple, great way to do this using numpy!
To learn how to use it, let’s look at a specific data analysis question. We’ve got a dataset of more than 4,000 Dataquest tweets. Do tweets with attached images get more likes and retweets? Let’s do some analysis to find out!
We’ll start by importing pandas and numpy, and loading up our dataset to see what it looks like. (If you’re not already familiar with using pandas and numpy for data analysis, check out our interactive numpy and pandas course).
import pandas as pd
import numpy as np
df = pd.read_csv('dataquest_tweets_csv.csv')
df.head()
We can see that our dataset contains a bit of information about each tweet, including:
date
— the date the tweet was postedtime
— the time of day the tweet was postedtweet
— the actual text of the tweetmentions
— any other twitter users mentioned in the tweetphotos
— the url of any images included in the tweetreplies_count
— the number of replies on the tweetretweets_count
— the number of retweets of the tweetlikes_count
— the number of likes on the tweetWe can also see that the photos
data is formatted a bit oddly.
For our analysis, we just want to see whether tweets with images get more interactions, so we don’t actually need the image URLs. Let’s try to create a new column called hasimage
that will contain Boolean values — True
if the tweet included an image and False
if it did not.
To accomplish this, we’ll use numpy’s built-in [where()](https://numpy.org/doc/stable/reference/generated/numpy.where.html)
function. This function takes three arguments in sequence: the condition we’re testing for, the value to assign to our new column if that condition is true, and the value to assign if it is false. It looks like this:
np.where(condition, value if condition is true, value if condition is false)
In our data, we can see that tweets without images always have the value []
in the photos
column. We can use information and np.where()
to create our new column, hasimage
, like so:
df['hasimage'] = np.where(df['photos']!= '[]', True, False)
df.head()
Above, we can see that our new column has been appended to our data set, and it has correctly marked tweets that included images as True
and others as False
.
#data science tutorials #add column #beginner #conditions #dataframe #if else #pandas #python #tutorial #tutorials #twitter