I want to add columns to a survey.design created with the survey package, which can be done as following:
library(survey)
data(api)
dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
dclus2 <- transform(dclus1,
api00_b = api00 + 1)
svymean(~ api00, design = dclus2)
#> mean SE
#> api00 644.17 23.542
svymean(~ api00_b, design = dclus2)
#> mean SE
#> api00_b 645.17 23.542
For a more complex task, I need to create these variable names dynamically from external vectors. The following produces an error, but I think provides an illustration of what I want to achieve:
vars <- c("api00_a", "api00_b")
dclus2 <- transform(dclus1,
vars[[2]] = api00 + 1)
How could dynamic names for the new columns be implemented?
I don't think you can use a vector like this on the left-hand side of the equal sign in R. You don't have to use transform, which calls survey:::update.survey.design, though. You could just add your new variable directly:
dclus2 <- dclus1
dclus2$variables[ ,vars[[1]]] <- dclus2$variables[,"api00"] + 1
This is the same as creating the new variable before converting to a survey.design object, as long as you do not use any survey functions for creation of the new variable. Just using Anthony's comment:
apiclus2 <- apiclus1
apiclus2[ , vars[[1]]] <- apiclus2[ , "api00" ] + 1
dclus_prep_2 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus2, fpc = ~fpc)
You might prefer to use srvyr, which allows your kind of programming with dplyr's !!and :=:
library(srvyr)
dclus_srvyr_1 <- as_survey_design(.data = apiclus1,
ids = dnum,
weights = pw,
fpc = fpc)
dclus_srvyr_2 <- mutate(dclus_srvyr_1,
!!vars[[1]] := api00 + 1)
All versions have the same result:
lapply(list(dclus2, dclus_prep_2, dclus_srvyr_2),
function(design) svymean(~api00_a, design=design))
[[1]]
mean SE
api00_a 645.17 23.542
[[2]]
mean SE
api00_a 645.17 23.542
[[3]]
mean SE
api00_a 645.17 23.542
Here's a possible solution using purrr:
library(purrr)
vars <- c("api00_a", "api00_b")
transform_func <- function(data, vars) {
transform(data, vars = api00 + 1)
}
map(vars, ~transform_func(dclus1, .))
Which gives us the following list:
[[1]]
1 - level Cluster Sampling design
With (15) clusters.
update(`_data`, ...)
[[2]]
1 - level Cluster Sampling design
With (15) clusters.
update(`_data`, ...)
You can do this with bquote. For example
vars <- c("api00_plus_1", "api00_plus_2")
exprs<-list(quote(api00+1),quote(api00+2))
names(exprs)<-vars
bquote(update(dclus1,..(exprs)), splice=TRUE)
eval(bquote(update(dclus1,..(exprs)), splice=TRUE))
Here's another chunk from inside the survey package that converts any string variables mentioned in a formula to factor
strings_to_factors<-function(formula, design){
allv<-intersect(all.vars(formula), colnames(design))
vclass<-sapply(model.frame(design)[,allv,drop=FALSE], class)
if (!any(vclass=="character")) return(design)
vfix<-names(vclass)[vclass=="character"]
l<-as.list(vfix)
names(l)<-vfix
fl<-lapply(l, function(li) bquote(factor(.(as.name(li)))))
expr<-bquote(update(design, ..(fl)), splice=TRUE)
eval(expr)
}
Related
I am trying to develop a function which will calculate the means, standard error, and confidence intervals of some survey data. I need to do this repeatedly over a number of different variables with a bunch of different filter statements.
DATA
df <- data.frame(address_id = rep(c(1,1,1,2,2,2,3,3,3,4,4,4),5),
person_id = rep(c(1,2,3),20),
sex = as.factor(rep(c("male","female"),30)),
response_var = as.factor(rep(seq(1,6,1))),
weight = runif(60, 50, 200))
Example that works without function
# create survey design
design <- survey::svydesign(data = df,
strata = ~ address_id,
id = ~ person_id,
nest = TRUE,
weights = ~ weight)
# calcualte weighted mean
mean_se <- survey::svymean(~sex, design)
# calculate confidence intervals
ci <- survey::confint(df_mean)
My function
create_mean_and_cis <- function(data, var){
design <- survey::svydesign(data = data,
strata = ~ address_id,
id = ~ person_id,
nest = TRUE,
weights = ~ weight)
mean_se <- survey::svymean(~{{var}}, design)
ci <- confint(mean_se)%>%
tibble::as_tibble()%>%
tibble::rownames_to_column("variable")
output <- mean_se%>%
tibble::as_tibble()%>%
tibble::rownames_to_column("variable")%>%
dplyr::left_join(ci)
return(output)
}
# function call
create_mean_and_cis(sex)
When I run, I get an error message saying:
Error in survey::svydesign(data = data, strata = ~address_id, id = ~person_id, :
object 'sex' not found
I can't understand what is going wrong. The tidy evaluation works perfectly when I use the curly-curly "{{var}}" within other functions. Why doesn't it work here? Can anyone help?
I have tried several variations of quasiquotation including: !!enquo(sex), sym(sex), !!sym(sex), {{sex}}, eval(parse(sex)). None of which have yielded working results.
The actual error you are getting is because you aren't passing the data argument to create_mean_and_cis ( you are doing create_mean_and_cis(sex) whereas it should be create_mean_and_cis(df, sex)).
However, this will not on its own fix the problem, since the curly-curly operator won't work inside a formula. Instead, you need to do something like:
create_mean_and_cis <- function(data, var){
var <- deparse(substitute(var))
design <- survey::svydesign(data = data,
strata = ~ address_id,
id = ~ person_id,
nest = TRUE,
weights = ~ weight)
mean_se <- survey::svymean(as.formula(paste('~', var)), design)
ci <- confint(mean_se)%>%
tibble::as_tibble()%>%
tibble::rownames_to_column("variable")
output <- mean_se%>%
tibble::as_tibble()%>%
tibble::rownames_to_column("variable")%>%
dplyr::left_join(ci)
return(output)
}
Which allows
create_mean_and_cis(df, sex)
#Joining, by = "variable"
## A tibble: 2 x 5
# variable mean SE `2.5 %` `97.5 %`
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.481 0.163 0.161 0.802
# 2 2 0.519 0.163 0.198 0.839
If you wanted to use more advanced tidy evaulation features, you can use rlang::eval_tidy
mean_se <- rlang::eval_tidy(rlang::quo(survey::svymean(~{{var}}, design)))
You wrap your expression in a quosure where you can use the {{ }} and !! style syntax. Both quo and eval_tidy will recognize this syntax. These features are unique to functions that use the rlang package; you cannot use them directly any arbitrary R function without some sort of rlang wrapper.
Or per #Lionel's suggestion, you can use rlang::inject to build the formula and pass that to svymean
formula <- rlang::inject(~!!substitute(var))
mean_se <- survey::svymean(formula, design)
I want to run logistic regression for multiple parameters and store the different metrics i.e AUC.
I wrote the function below but I get an error when I call it: Error in eval(predvars, data, env) : object 'X0' not found even if the variable exists in both my training and testing dataset. Any idea?
new.function <- function(a) {
model = glm(extry~a,family=binomial("logit"),data = train_df)
pred.prob <- predict(model,test_df, type='response')
predictFull <- prediction(pred.prob, test_df$extry)
auc_ROCR <- performance(predictFull, measure = "auc")
my_list <- list("AUC" = auc_ROCR)
return(my_list)
}
# Call the function new.function supplying 6 as an argument.
les <- new.function(X0)
The main reason why your function didn't work is that you are trying to call an object into a formula. You can fix it with paste formula function, but that is ultimately quite limiting.
I suggest instead that you consider using update. This allow you more flexibility to change with multiple variable combination, or change a training dataset, without breaking the function.
model = glm(extry~a,family=binomial("logit"),data = train_df)
new.model = update(model, .~X0)
new.function <- function(model){
pred.prob <- predict(model, test_df, type='response')
predictFull <- prediction(pred.prob, test_df$extry)
auc_ROCR <- performance(predictFull, measure = "auc")
my_list <- list("AUC" = auc_ROCR)
return(my_list)
}
les <- new.function(new.model)
The function can be further improved by calling the test_df as a separate argument, so that you can fit it with an alternative testing data.
To run the function in the way you intended, you would need to use non-standard evaluation to capture the symbol and insert it in a formula. This can be done using match.call and as.formula. Here's a fully reproducible example using dummy data:
new.function <- function(a) {
# Convert symbol to character
a <- as.character(match.call()$a)
# Build formula from character strings
form <- as.formula(paste("extry", a, sep = "~"))
model <- glm(form, family = binomial("logit"), data = train_df)
pred.prob <- predict(model, test_df, type = 'response')
predictFull <- ROCR::prediction(pred.prob, test_df$extry)
auc_ROCR <- ROCR::performance(predictFull, "auc")
list("AUC" = auc_ROCR)
}
Now we can call the function in the way you intended:
new.function(X0)
#> $AUC
#> A performance instance
#> 'Area under the ROC curve'
new.function(X1)
#> $AUC
#> A performance instance
#> 'Area under the ROC curve'
If you want to see the actual area under the curve you would need to do:
new.function(X0)$AUC#y.values[[1]]
#> [1] 0.6599759
So you may wish to modify your function so that the list contains auc_ROCR#y.values[[1]] rather than auc_ROCR
Data used
set.seed(1)
train_df <- data.frame(X0 = sample(100), X1 = sample(100))
train_df$extry <- rbinom(100, 1, (train_df$X0 + train_df$X1)/200)
test_df <- data.frame(X0 = sample(100), X1 = sample(100))
test_df$extry <- rbinom(100, 1, (test_df$X0 + test_df$X1)/200)
Created on 2022-06-29 by the reprex package (v2.0.1)
I'm trying to perform a Wald test on an interaction term in a survey-weighted model with imputed data - see below for a toy reprex.
When I run the last three lines of code with regTermTest using syntax modeled directly off the function documentation I get the following error: Error in terms.default(model) : no terms component nor attribute.
A quick Google search seems to suggest this error means I'm passing an unsupported object type to the function; I read this as perhaps regTermTest does not support passing an MIResult to model. However, it seems that as of version 3.6 of the survey package, regTermTest supports MIResult models? This page also seems to suggest that as well.
Appreciate any guidance on what I'm doing wrong. Alternatively I could be happy if someone knows how to get p-values on individual model terms from an MIResult object (e.g. like is shown in this post for a regular model object).
# load packages
library(tidyverse)
library(survey)
library(mi)
library(mitools)
# load data on school performance included in survey package
# documentation available here: https://r-survey.r-forge.r-project.org/survey/html/api.html
data(api)
# remove problematic variables that are unnecessary for this example
apisub <- apiclus1 %>% select(-c("name", "sname", "dname", "cname", "flag",
"acs.46", "acs.core"))
# create and update variable types in missing_data.frame
mdf <- missing_data.frame(apisub)
mdf <- change(mdf, "cds", what = "type", to = "irrelevant")
mdf <- change(mdf, "stype", what = "type", to = "irrelevant")
mdf <- change(mdf, "snum", what = "type", to = "irrelevant")
mdf <- change(mdf, "dnum", what = "type", to = "irrelevant")
mdf <- change(mdf, "cnum", what = "type", to = "irrelevant")
mdf <- change(mdf, "fpc", what = "type", to = "irrelevant")
mdf <- change(mdf, "pw", what = "type", to = "irrelevant")
# summarize the missing_data.frame
show(mdf)
# impute missing data
imputations <- mi(mdf)
# create imputation list to pass to svydesign
imp_list <- complete(imputations, m = 5)
# create complex survey design using imputed data
dsn <- svydesign(id = ~dnum,
weights = ~pw,
data = imputationList(imp_list),
fpc = ~fpc)
# subset the survey design to remove schools that did not meet both targets
# just as an example of subsetting
dsn_sub <- subset(dsn, both == "No")
# specify analytic model
anl <- with(dsn_sub,
svyglm(api99 ~ enroll + meals + avg.ed*ell,
family = gaussian(),
design = dsn
)
)
# combine results into a single output
res <- MIcombine(anl)
# perform wald test for main and ixn terms
regTermTest(res, ~meals)
regTermTest(res, ~avg.ed:ell)
regTermTest(res, ~avg.ed*ell)
Didn't figure out how to do this with regTermTest but I learned that you can run the following to run a global test for interaction instead:
library(aod)
> aod::wald.test(Sigma = vcov(res),
+ b = coef(res),
+ Terms = 6)
Wald test:
----------
Chi-squared test:
X2 = 0.031, df = 1, P(> X2) = 0.86
I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".
So making the data
library(mlogit)
library(nnet)
library(tidyverse)
library(mice)
df <- data.frame(vax = sample(1:6, 500, replace = T),
age = runif(500, 12, 18),
var1 = sample(1:2, 500, replace = T),
var2 = sample(1:5, 500, replace = T))
# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp
df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)
# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)
It works using nnet's multinom:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))
But throws up an error when I try to reshape the data into mlogit format
test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))
Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?
Answer
You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids works
When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.
If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:
It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.
The problems
This is where both your problems derive from:
In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom line is simple: just don't specify data:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira object, so that we can still use pool:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
Offering this as a way forward to circumvent the error with dfidx():
df5 <- df4$imp %>%
# work with a list, where each top-element is a different imputation run (imp_n)
map(~as.list(.x)) %>%
transpose %>%
# for each run, impute and return the full (imputed) data set
map(function(imp_n.x) {
df_out <- df4$data
df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
df_out$age[is.na(df_out$age)] <- imp_n.x$age
df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
return(df_out)
}) %>%
# No errors with dfidx() now
map(function(imp_n.x) {
dfidx(data = imp_n.x, choice = "vax", shape = "wide")
})
However, I'm not too familiar with mlogit(), so can't help beyond this.
Update 8/2/21
As #slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:
# To fit mlogit() for each imputed data set
df5 %>%
map(function(imp_n.x) {
# form as specified in the comments
mlogit(vax ~ 1 | age + var1 + var2,
data = imp_n.x,
reflevel = "1",
nests = list(type1 = c('1', '2'),
type2 = c('3','4'),
type3 = c('5','6')))
})
I am trying to use the lm.cluster function in the package miceadds to get robust clustered standard errors for a multiply imputed dataset.
I am able to get the standard version of it to run but I get the following error when I try to add a subset or weights:
Error in eval(substitute(subset), data, env) :
..1 used in an incorrect context, no ... to look in
Example that works without subset or weights:
require("mice")
require("miceadds")
data(data.ma01)
# imputation of the dataset: use six imputations
dat <- data.ma01[ , - c(1:2) ]
imp <- mice::mice( dat , maxit=3 , m=6 )
datlist <- miceadds::mids2datlist( imp )
# linear regression with cluster robust standard errors
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool )} )
# extract parameters and covariance matrix
betas <- lapply( mod , FUN = function(rr){ coef(rr) } )
vars <- lapply( mod , FUN = function(rr){ vcov(rr) } )
# conduct statistical inference
summary(pool_mi( qhat = betas, u = vars ))
Example that breaks with subset:
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool, subset=
(data.ma01$urban==1))} )
Error during wrapup: ..1 used in an incorrect context, no ... to look in
Example that breaks with weights:
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool,
weights=data.ma01$studwgt)} )
Error during wrapup: ..1 used in an incorrect context, no ... to look in
From searching, I think I am encountering similar issues as others when passing these commands through an lm or glm wrapper (such as: Passing Argument to lm in R within Function or R : Pass argument to glm inside an R function or Passing the weights argument to a regression function inside an R function)
However, I am not sure how to address the issue with the imputed datasets & existing lm.cluster command.
Thanks
This works fine with the estimatr package which is on CRAN and the estimatr::lm_robust() function. Two notes: (1) you can change the type of standard errors using se_type = and (2) I keep idschool in the data because we like the clusters to be in the same data.frame as we fit the model on.
library(mice)
library(miceadds)
library(estimatr)
# imputation of the dataset: use six imputations
data(data.ma01)
dat <- data.ma01[, -c(1)] # note I keep idschool in data
imp <- mice::mice( dat , maxit = 3, m = 6)
datlist <- miceadds::mids2datlist(imp)
# linear regression with cluster robust standard errors
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool)
}
)
# subset
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool, subset = urban == 1)
}
)
# weights
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool, weights = studwgt)
}
)
# note that you can use the `se_type` argument of lm_robust()
# to change the vcov estimation
# extract parameters and covariance matrix
betas <- lapply(mod, coef)
vars <- lapply(mod, vcov)
# conduct statistical inference
summary(pool_mi( qhat = betas, u = vars ))
I'm no expert, but there is an issue with the passing of the weights to lm(). I know this is not an ideal situation, but I managed to get it to work by modifying the lm.cluster() function to hard code the weights pass and then just used my own.
lm.cluster <- function (data, formula, cluster, wgts=NULL, ...)
{
TAM::require_namespace_msg("multiwayvcov")
if(is.null(wgts)) {
mod <- stats::lm(data = data, formula = formula)
} else {
data$.weights <- wgts
mod <- stats::lm(data = data, formula = formula, weights=data$.weights)
}
if (length(cluster) > 1) {
v1 <- cluster
}
else {
v1 <- data[, cluster]
}
dfr <- data.frame(cluster = v1)
vcov2 <- multiwayvcov::cluster.vcov(model = mod, cluster = dfr)
res <- list(lm_res = mod, vcov = vcov2)
class(res) <- "lm.cluster"
return(res)
}