The methods team at Pew Research Center regularly works with survey data in R, and we’ve written many functions to simplify daily tasks like cleaning, weighting and analyzing data. The pewmethods R package, available to the public on the Center’s GitHub page, evolved as a way to reuse and maintain this sort of code and share it with other researchers around the Center. Since many of the problems that these functions were designed to solve are not unique to our own projects, we’ve made the pewmethods package publicly available for other researchers who might find the functions useful, too.

This post goes through the process of exploring survey data in R using pewmethods, including recoding and collapsing data and displaying weighted estimates of categorical variables. Throughout these examples, I’ll make extensive use of the tidyverse set of R packages, which is a great tool for data manipulation that we highly recommend using along with pewmethods. You can learn more about using tidyverse in this blog post.

The example dataset

The package includes a survey dataset called dec13_excerpt, which contains selected variables from a survey conducted by Pew Research Center in December 2013. The data contains demographic and some outcome variables, as well as survey weights. You can learn more about the details by calling ?dec13_excerpt.

> dec13_excerpt
# A tibble: 2,001 x 14
   psraid cregion q1    q2    q45   sex   recage receduc racethn2
    <dbl> <fct>   <fct> <fct> <fct> <fct> <fct>  <fct>   <fct>   
 1 100005 Northe… Disa… Very… Disa… Male  55-64  HS gra… White~H…
 2 100007 South   Appr… Very… Appr… Fema… 55-64  Coll+   White~H…
 3 100019 South   Appr… Very… Appr… Fema… 55-64  Some c… Black~H…
 4 100020 Midwest Appr… Not … Appr… Fema… 65+    Some c… White~H…
 5 100021 Northe… Appr… Very… Appr… Fema… 65+    HS gra… Black~H…
 6 100023 Midwest Don'… NA    Disa… Male  45-54  HS gra… White~H…
 7 100027 Northe… Appr… Very… Appr… Fema… 65+    Some c… White~H…
 8 100031 South   Disa… Very… Disa… Fema… 55-64  Coll+   White~H…
 9 100034 Midwest Disa… Very… Disa… Fema… 55-64  Some c… White~H…
10 100037 South   Disa… Very… Disa… Male  35-44  Coll+   White~H…
# … with 1,991 more rows, and 5 more variables: party <fct>,
#   partyln <fct>, weight <dbl>, llweight <dbl>, cellweight <dbl>

Most Pew Research Center survey datasets, as well as those from other organizations, will have one or more variables for the survey weight. This weight is crucial for obtaining correct numbers from the survey data, since it allows the sample to resemble the overall U.S. adult population more closely. In dec13_excerpt, the weight variable is simply called weight, and we’ll be using it to look at weighted cross-tabulations of other variables in the dataset.

Cleaning and editing survey data

Let’s look at some outcome variables:

> names(dec13_excerpt)
 [1] "psraid"     "cregion"    "q1"         "q2"        
 [5] "q45"        "sex"        "recage"     "receduc"   
 [9] "racethn2"   "party"      "partyln"    "weight"    
[13] "llweight"   "cellweight"

We see three variables that look like survey outcomes: q1, q2 and q45. Let’s take a look at q1. As dec13_excerpt was originally stored as an IBM SPSS file, we can use the get_spss_label() function to view the label associated with q1. For Pew Research Center survey data, this will either be the question wording or a brief description:

> get_spss_label(dec13_excerpt, "q1")
[1] "Q.1 Do you approve or disapprove of the way Barack Obama is handling his job as President?"

q1 is an Obama approval question, so let’s run a quick table:

> tablena(dec13_excerpt$q1)
dec13_excerpt$q1, a factor
                  Approve                Disapprove 
                      839                      1042 
Don't know/Refused (VOL.) 
                      120

The tablena() function in pewmethods works the same way as base R’s table() function, except that it tells you the specific object you just ran a table on (along with its class), and will always display any NA values rather than hiding them by default.

Now let’s look at q2:

> get_spss_label(dec13_excerpt, "q2")
[1] "Q.2 Do you [approve/disapprove] very strongly, or not so strongly?"

q2 is a direct follow-up to q1. After asking respondents whether they approved or disapproved of Obama, we asked them whether they did so very strongly or not so strongly. So q1 and q2 are best analyzed together.

We can do this by creating a new variable, which we’ll call obama_approval_scale, that combines the two into a single variable with the categories Approve very strongly, Approve not so strongly, Disapprove not so strongly and Disapprove very strongly, as well as Don’t know/Refused (VOL.). The fct_case_when() function is a straightforward and readable way to create that combined variable. fct_case_when() is a wrapper around the case_when() function from dplyr that coerces its output into a factor whose levels are in the order that they were passed into the function.

#statistics #methodology #surveys #r #survey-analysis #data analysis

The example dataset

Cleaning and editing survey data

medium.com

Exploring survey data with the pewmethods R package