Prepare data for modeling

This function shapes data for use in a dgirt or dgmrp model. Most arguments give the name or names of key variables in the data. These arguments end in _name or _names and should be character vectors.

shape(item_data = NULL, item_names = NULL, time_name, geo_name,
  group_names = NULL, id_vars = NULL, time_filter = NULL,
  geo_filter = NULL, min_t_filter = 1L, min_survey_filter = 1L,
  survey_name = NULL, modifier_data = NULL, modifier_names = NULL,
  t1_modifier_names = NULL, standardize = TRUE, target_data = NULL,
  raking = NULL, max_raked_weight = NULL, weight_name = NULL,
  proportion_name = "proportion", aggregate_data = NULL,
  aggregate_item_names = NULL, constant_item = TRUE, ...)

Arguments

item_data	A table in which items appear in columns and each row represents an individual's responses in some time period and local geographic area.
item_names	Item response variables.
time_name	A time variable with numeric values.
geo_name	A geographic variable representing local areas.
group_names	Discrete grouping variables, usually demographic. Using numeric variables is allowed but not recommended.
id_vars	Additional variables that should be included in the result, other than those specified elsewhere.
time_filter	A numeric vector giving possible values of the time variable. Observed and unobserved time periods can be given. Defaults to observed values.
geo_filter	A character vector giving values of the geographic variable. Defaults to observed values.
min_t_filter	An integer minimum of time period appearances for included items.
min_survey_filter	An integer minimum of survey appearances for included items.
survey_name	A survey identifier.
modifier_data	Table giving characteristics of local geographic areas in time periods. See details below.
modifier_names	Variables giving modifiers of geographic hierarchical parameters in `modifier_data`.
t1_modifier_names	Variables to be used instead of those in `modifier_names`, only in the first period.
standardize	Whether to standardize the variables given by `modifier_names` and `t1_modifier_names` to be zero-mean and unit-variance for performance gains. (For discussion see the Stan Language Reference section "Standardizing Predictors and Outputs.")
target_data	A table giving population proportions for groups by local geographic area and time period. See details below.
raking	A formula or list of formulas specifying the variables on which to rake survey weights.
max_raked_weight	A maximum over which raked weights will be trimmed. Only applied after raking. To trim unraked weights, manipulate the input data directly.
weight_name	A variable giving survey weights.
proportion_name	The variable giving population proportions for strata in `target_data`.
aggregate_data	A table of trial and success counts by group and item. See details below.
aggregate_item_names	A subset of values of the `item` variable in `aggregate_data`, for restricting the aggregate data.
constant_item	Whether item difficulty parameters should be constant over time.
...	Further arguments.

Value

An object of class dgirtIn expected by dgirt and dgmrp.

Item Response Data

Individual-level data giving item responses is expected as argument item_data. Required arguments time_name and geo_name give the names of variables in item_data that indicate time period and local geographic area. Optional argument group_names gives other respondent characteristics to be modeled. item_data is optional if argument aggregate_data is used. Note that the dgirt() model assumes consistent coding of the polarity of item responses for identification.

Modifier Data

Data for modeling geographic hierarchical parameters can be given with argument modifier_data, in which case argument modifier_names is required and arguments t1_modifier_names and standardize are optional.

Aggregate Item Response Data

shape() aggregates the individual-level item response data given as item_data for modeling. Data already aggregated to the group level can be provided with argument aggregate_data.

The data given by aggregate_data must be in a long table of trial and success counts indexed by item, group, and time period. The variable names given by arguments group_names, geo_name, andtime_name should exist in aggregate_data. Three fixed variable names must also appear in aggregate_data: item giving item identifiers, n_grp giving counts of item-response trials, and s_grp giving counts of item-response successes. These counts should be adjusted consistently with the transformations applied during the aggregation by shape() of the individual item_data.

Reweighting

Use argument target_data to adjust the weighting of groups toward population targets via raking, using an adaptation of rake. To adjust existing survey weights in item_data, provide argument weight_name. Otherwise, observations in item_data will be assigned equal starting weights. Argument raking defines strata. If you pass it a list of formulas like list(~ x, ~ y), raking is first over x, then over y. Given an additive formula like ~ x + y, raking is over the combinations of x and y. So, list(~ x, ~ y + z) is first over x, then over y-z pairs. Argument proportion_name is optional.

Restrictions

For convenience, data in item_data, modifier_data, aggregate_data, and target_data can be restricted (subsetted) row-wise to the time periods given by argument time_filter and the local geographic areas given by argument geo_filter.

Data can also be filtered column-wise to retain item variables that appear in a minimum of time periods, using argument min_t_filter, or a minimum of surveys, with argument min_survey_filter. Argument survey_name is required when filtering by survey.

If both row-wise and column-wise restrictions are specified, shape iterates over them until they leave the data unchanged.

Examples

# model individual item responses
shaped_responses <- shape(opinion, item_names = "abortion", time_name =
  "year", geo_name = "state", group_names = "race3")
#> Applying restrictions, pass 1...
#> 	Dropped 5 rows for missingness in covariates
#> 	Dropped 3743 rows for lacking item responses
#> Applying restrictions, pass 2...
#> 	No changes

# summarize result)
summary(shaped_responses)
#> Items:
#> [1] "abortion"
#> Respondents:
#>    144,250 in `item_data`
#> Grouping variables:
#> [1] "year"  "state" "race3"
#> Time periods:
#> [1] 2006 2007 2008 2009 2010
#> Local geographic areas:
#>  [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL"
#> [16] "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE"
#> [31] "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT"
#> [46] "VA" "VT" "WA" "WI" "WV" "WY"
#> Hierarchical parameters:
#>  [1] "AL"         "AR"         "AZ"         "CA"         "CO"        
#>  [6] "CT"         "DC"         "DE"         "FL"         "GA"        
#> [11] "HI"         "IA"         "ID"         "IL"         "IN"        
#> [16] "KS"         "KY"         "LA"         "MA"         "MD"        
#> [21] "ME"         "MI"         "MN"         "MO"         "MS"        
#> [26] "MT"         "NC"         "ND"         "NE"         "NH"        
#> [31] "NJ"         "NM"         "NV"         "NY"         "OH"        
#> [36] "OK"         "OR"         "PA"         "RI"         "SC"        
#> [41] "SD"         "TN"         "TX"         "UT"         "VA"        
#> [46] "VT"         "WA"         "WI"         "WV"         "WY"        
#> [51] "race3other" "race3white"
#> Modifiers of hierarchical parameters:
#> NULL
#> Constants:
#>   Q   T   P   N   G   H   D 
#>   1   5  52 765 153   1   1 

# check sparseness of data to be modeled
get_item_n(shaped_responses, by = "year")
#>    year abortion
#> 1: 2006    33514
#> 2: 2007     9258
#> 3: 2008    32634
#> 4: 2009    13718
#> 5: 2010    55126