This post is a companion piece to our basic introduction for researchers interested in using the new pewmethods R package, which includes tips on how to recode and collapse data and display weighted estimates of categorical variables. Here, I’ll go through the process of weighting and analyzing a survey dataset. Along the way, I’ll show you how to use pewmethods to clean and recode the variables we’ll use for weighting, create weighting parameters from external data sources, and rake and trim survey weights.

These examples make extensive use of the tidyverse set of R packages. You can learn more about using the tidyverse with this post.

The example dataset

The package includes a survey dataset called dec13_excerpt, which contains selected variables from a survey conducted by Pew Research Center in December 2013. The data contains demographics and some outcome variables, as well as survey weights. You can learn more about the details by calling ?dec13_excerpt.

> dec13_excerpt
# A tibble: 2,001 x 14
   psraid cregion q1    q2    q45   sex   recage receduc
    <dbl> <fct>   <fct> <fct> <fct> <fct> <fct>  <fct>  
 1 100005 Northe… Disa… Very… Disa… Male  55-64  HS gra…
 2 100007 South   Appr… Very… Appr… Fema… 55-64  Coll+  
 3 100019 South   Appr… Very… Appr… Fema… 55-64  Some c…
 4 100020 Midwest Appr… Not … Appr… Fema… 65+    Some c…
 5 100021 Northe… Appr… Very… Appr… Fema… 65+    HS gra…
 6 100023 Midwest Don'… NA    Disa… Male  45-54  HS gra…
 7 100027 Northe… Appr… Very… Appr… Fema… 65+    Some c…
 8 100031 South   Disa… Very… Disa… Fema… 55-64  Coll+  
 9 100034 Midwest Disa… Very… Disa… Fema… 55-64  Some c…
10 100037 South   Disa… Very… Disa… Male  35-44  Coll+  
# … with 1,991 more rows, and 6 more variables:
#   racethn2 <fct>, party <fct>, partyln <fct>,
#   weight <dbl>, llweight <dbl>, cellweight <dbl>

For simplicity, let’s assume we want to weight our survey by the marginal distribution of age and the cross-classification of sex and education. (In practice, we use a number of additional variables and cross-classifications beyond these.) Let’s run some basic tables on these variables in the dec13_excerpt dataset:

> table(dec13_excerpt$sex)

  Male Female 
   968   1033
> table(dec13_excerpt$recage)

 18-24  25-34  35-44  45-54  55-64    65+ DK/Ref 
   172    216    255    362    406    550     40
> table(dec13_excerpt$receduc)

       HS grad or less Some coll/Assoc degree 
                   646                    571 
                 Coll+                 DK/Ref 
                   779                      5

Creating weighting targets from population data

Before doing anything with the survey data itself, we need to determine our weighting target parameters — that is, we need to know what the marginal distributions of age and the cross-classification of sex and education look like in the population of interest that we’re trying to represent with a survey. We use external benchmark data to create weighting targets that reflect the population distribution for our chosen weighting variables. These targets are typically derived from population data published by the U.S. Census Bureau or other government agencies. For example, we can download public use microdata from the American Community Survey and use that data to obtain target distributions.

For this demonstration, we’ll use a condensed American Community Survey dataset called acs_2017_excerpt. This is not the original ACS dataset (which can be found here), but a summary table created using the 2017 one-year PUMS. It has columns for sex, age and education variables that have been recoded into the categories that Pew Research Center typically uses in its survey weighting. It has a total of 36 rows, one for every combination of sex (two categories), age (six categories) and education (three categories). Each row is associated with a weight proportional to that row’s share of the non-institutionalized U.S. adult population:

> acs_2017_excerpt
# A tibble: 36 x 4
   sex   recage receduc                weight
   <fct> <fct>  <fct>                   <dbl>
 1 Male  18-24  HS grad or less         1.09 
 2 Male  18-24  Some coll/Assoc degree  0.955
 3 Male  18-24  Coll+                   0.208
 4 Male  25-34  HS grad or less         1.17 
 5 Male  25-34  Some coll/Assoc degree  0.986
 6 Male  25-34  Coll+                   1.04 
 7 Male  35-44  HS grad or less         1.12 
 8 Male  35-44  Some coll/Assoc degree  0.815
 9 Male  35-44  Coll+                   0.964
10 Male  45-54  HS grad or less         1.24 
# … with 26 more rows

When you begin this process from scratch, you’ll need to acquire the benchmark data, recode variables into your desired categories, and use the appropriate weight that should be attached to the benchmark dataset. All of that work has already been done in this post…

We can use the function create_raking_targets() to create summaries of these demographic distributions from the benchmark dataset using the code below. “Raking” refers to a procedure in which the marginal distributions of a selected set of variables in the sample are iteratively adjusted to match target distributions.

#r #methodology #surveys #statistics #survey-analysis #data analysis

The example dataset

Creating weighting targets from population data

medium.com

Weighting survey data with the pewmethods R package