How to use a for loop for the svyttest function in the survey package? - r

I am trying to use the svyttest function in a for loop in the survey package. I want to test for differences in proportions of responses between subpopulations in likert-scale type data. For example, in a survey question (1=strongly disagree, 5 = strongly agree), are there statistically significant differences in the proportion of "strongly disagree" responses between Groups 1 and 2?
I understand that I can also use the svyglm function from the survey package, but I have been unable to successfully use that in a for loop.
I also understand that there is a wtd.t.test in the weights package and the glm function in the stats package has a weights argument, but neither of these two options get the correct results. I need to use either the svyttest or the svyglm functions in the survey package.
For reference I have been looking
here and here for some help but have been unable to adapt these examples to my problem.
Thank you for your time and effort.
# create example survey data
ids <- 1:1000
stratas <- rep(c("strata1", "strata2","strata3","strata4"), each=250)
weight <- rep(c(5,2,1,1), each=250)
group <- rep(c(1,2), times=500)
q1 <- sample(1:5, 1000, replace = TRUE)
survey_data <- data.frame(ids, stratas, weight, group, q1)
# create example svydesign
library(survey)
survey_design <- svydesign(ids = ~0,
probs = NULL,
strata = survey_data$stratas,
weights = survey_data$weight,
data = survey_data)
# look at the proportions of q1 responses by group
prop.table(svytable(~q1+group, design = survey_design), margin = 2)
# t-test for significant differences in the proportions of the first item in q1
svyttest(q1== 1 ~ group, design = survey_design)
# trying a for loop for all five items
for(i in c(1:5)){
print(svyttest(q1== i ~ group, design = survey_design))
}
# I receive the following error:
Error in svyglm.survey.design(formula, design, family = gaussian()) :
all variables must be in design= argument

When dynamically updating a formula inside a function or a loop you need to invoke the as.formula() function to preserve the attributes of objects as variables. This should work:
# trying a for loop for all five items
for(i in c(1:5)){
print(svyttest(as.formula(paste("q1==", i, "~group")),
design = survey_design))
}

I tried some trick, you can use array, which you can use for your loop:
x=c()
for(i in c(1:5)){
x=append(x,as.formula(paste("q1==",i,"~ group")))
print(svyttest(x[[i]], design = survey_design))
}
With regards
Aleksei

I would use bquote
for(i in 1:5){
print(eval(
bquote(svyttest(q1== .(i) ~ group, design = survey_design))
))
}
In this example as.formula works just as well, but bquote is more general.

Related

How to input matrix data into brms formula?

I am trying to input matrix data into the brm() function to run a signal regression. brm is from the brms package, which provides an interface to fit Bayesian models using Stan. Signal regression is when you model one covariate using another within the bigger model, and you use the by parameter like this: model <- brm(response ~ s(matrix1, by = matrix2) + ..., data = Data). The problem is, I cannot input my matrices using the 'data' parameter because it only allows one data.frame object to be inputted.
Here are my code and the errors I obtained from trying to get around that constraint...
First off, my reproducible code leading up to the model-building:
library(brms)
#100 rows, 4 columns. Each cell contains a number between 1 and 10
Data <- data.frame(runif(100,1,10),runif(100,1,10),runif(100,1,10),runif(100,1,10))
#Assign names to the columns
names(Data) <- c("d0_10","d0_100","d0_1000","d0_10000")
Data$Density <- as.matrix(Data)%*%c(-1,10,5,1)
#the coefficients we are modelling
d <- c(-1,10,5,1)
#Made a matrix with 4 columns with values 10, 100, 1000, 10000 which are evaluation points. Rows are repeats of the same column numbers
Bins <- 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T)
Bins
As mentioned above, since 'data' only allows one data.frame object to be inputted, I've tried other ways of inputting my matrix data. These methods include:
1) making the matrix within the brm() function using as.matrix()
signalregression.brms <- brm(Density ~ s(Bins,by=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])),data = Data)
#Error in is(sexpr, "try-error") :
argument "sexpr" is missing, with no default
2) making the matrix outside the formula, storing it in a variable, then calling that variable inside the brm() function
Donuts <- as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])
signalregression.brms <- brm(Density ~ s(Bins,by=Donuts),data = Data)
#Error: The following variables can neither be found in 'data' nor in 'data2':
'Bins', 'Donuts'
3) inputting a list containing the matrix using the 'data2' parameter
signalregression.brms <- brm(Density ~ s(Bins,by=donuts),data = Data,data2=list(Bins = 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T),donuts=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])))
#Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
None of the above worked; each had their own errors and it was difficult troubleshooting them because I couldn't find answers or examples online that were of a similar nature in the context of brms.
I was able to use the above techniques just fine for gam(), in the mgcv package - you don't have to define a data.frame using 'data', you can call on variables defined outside of the gam() formula, and you can make matrices inside the gam() function itself. See below:
library(mgcv)
signalregression2 <- gam(Data$Density ~ s(Bins,by = as.matrix(Data[,c("d0_10","d0_100","d0_1000","d0_10000")]),k=3))
#Works!
It seems like brms is less flexible... :(
My question: does anyone have any suggestions on how to make my brm() function run?
Thank you very much!
My understanding of signal regression is limited enough that I'm not convinced this is correct, but I think it's at least a step in the right direction. The problem seems to be that brm() expects everything in its formula to be a column in data. So we can get the model to compile by ensuring all the things we want are present in data:
library(tidyverse)
signalregression.brms = brm(Density ~
s(cbind(d0_10_bin, d0_100_bin, d0_1000_bin, d0_10000_bin),
by = cbind(d0_10, d0_100, d0_1000, d0_10000),
k = 3),
data = Data %>%
mutate(d0_10_bin = 10,
d0_100_bin = 100,
d0_1000_bin = 1000,
d0_10000_bin = 10000))
Writing out each column by hand is a little annoying; I'm sure there are more general solutions.
For reference, here are my installed package versions:
map_chr(unname(unlist(pacman::p_depends(brms)[c("Depends", "Imports")])), ~ paste(., ": ", pacman::p_version(.), sep = ""))
[1] "Rcpp: 1.0.6" "methods: 4.0.3" "rstan: 2.21.2" "ggplot2: 3.3.3"
[5] "loo: 2.4.1" "Matrix: 1.2.18" "mgcv: 1.8.33" "rstantools: 2.1.1"
[9] "bayesplot: 1.8.0" "shinystan: 2.5.0" "projpred: 2.0.2" "bridgesampling: 1.1.2"
[13] "glue: 1.4.2" "future: 1.21.0" "matrixStats: 0.58.0" "nleqslv: 3.3.2"
[17] "nlme: 3.1.149" "coda: 0.19.4" "abind: 1.4.5" "stats: 4.0.3"
[21] "utils: 4.0.3" "parallel: 4.0.3" "grDevices: 4.0.3" "backports: 1.2.1"

Issue with the tabmeans.survey multi-categorical variables. Not recognising variables in the design

I have had an issue with analysing survey data on r using the survey and tab packages.
I think I am setting up the survey design object correctly, but when i try to run the tabmean.survey function comparing the means across more than 2 categories, the function does not recognise the variable in the design.
Here's the example using my data:
svyd<-svydesign(id=~psu, #PSU variable
strata=~strata, #Strata variable
weights=~ca_betaindin_xw, #Weight variable
data=usds)
svyd_emp<-subset(svyd, usds$samp_employ==1) #subset the data to required analytic sample
t1<-tabmeans.svy(age~ethnicity,
design = svyd_emp) #Run tabmeans.svy comparing means of age by ethnicity
Which produces this error:
Error in svyglm.survey.design(Age ~ 1, design = design) :
all variables must be in design= argument
When I try the same function with a binary variable the function works
t2<-tabmeans.svy(age~sex,
design = svyd_emp) #Run tabmeans.svy comparing means of age by sex
#WORKS
Comparing means across multi categorical variables using this function has previously worked. I can't figure out why the function is throwing up an error now. The survey.design object had the variables listed in the object.
I cannot share my data but I have reproduced the same issue using the 'api' dataset in the survey package.
data(api)
sdesign<-svydesign(id=~dnum+snum,
strata=~stype,
weights=~pw,
data=apistrat,
nest = TRUE)
t3<-tabmeans.svy(api00~stype, # stype has 3 categories = DOESNT WORK
design=sdesign)
t4<-tabmeans.svy(api00~sch.wide,
design=sdesign) # sch.wide has 2 categories = WORKS
Appreciate any thoughts or suggestions on how to get around this issue.
Many thanks
Thanks for the reproducible example. When I run it, I get
> t3<-tabmeans.svy(api00~stype, # stype has 3 categories = DOESNT WORK
+ design=sdesign)
Error in svyglm.survey.design(Age ~ 1, design = design) :
all variables must be in design= argument
> traceback()
4: stop("all variables must be in design= argument")
3: svyglm.survey.design(Age ~ 1, design = design)
2: svyglm(Age ~ 1, design = design)
1: tabmeans.svy(api00 ~ stype, design = sdesign)
which is disconcerting, because why is it trying to find an Age variable? (This was masked a little in your example, because you have an age variable).
Looking at the code for tabmeans.svy I see
if (num.groups == 2) {
fit <- svyttest(formula, design = design)
diffmeans <- -fit$estimate
diffmeans.ci <- -rev(as.numeric(fit$conf.int))
p <- fit$p.value
}
else {
fit1 <- svyglm(Age ~ 1, design = design)
fit2 <- svyglm(Age ~ Sex, design = design)
fit <- do.call(anova, c(list(object = fit1, object2 = fit2),
anova.svyglm.list))
p <- as.numeric(fit$p)
}
which explains the problem: if there are more than two groups it ignores your variables and instead tests for an effect of Sex on Age.
I suspect a cut-and-paste error by the maintainer. I have filed a GitHub issue. Unfortunately, I can't see a simple work-around.

How to get around error "factor has new levels" in cross-validation glm?

My goal is to use cross-validation to evaluate the performance of a linear model.
My problem is that my training and testing sets might not always have the same variable levels.
Here is a reproducible data example:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
summary(data)
Now let's make a glm model:
model_glm <- glm(x~., data = data)
And let's use cross-validation on this model:
library(boot)
cross_validation_glm <- cv.glm(data = data, glmfit = model_glm, K = 10)
And this is the kind of error output that you will get:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor z has new levels F
if you don't get this error, re-run the cross validation and at some point you will get a similar error.
The nature of the problem here is that when you do cross-validation, the train and test subsets might not have the exact same variable levels. Here our variable z has three levels (D,E,F).
In the total amount of our data there is much more D's than E's and F's.
Thus whenever you take a small subset of the whole data (to do cross-validation).
There is a very good chance that your z variable are all going to be set at the D's level.
Thus Eand F levels gets dropped, thus we get the error (This answer is helpful to understand the problem: https://stackoverflow.com/a/51555998/10972294).
My question is: how to avoid the drop in the first place?
If it is not possible, what are the alternatives?
(Keep in mind that this a reproducible example, the actual data I am using has many variables like z, I would like to avoid deleting them.)
To answer your question in the comment, I don't know if there is a function or not. Most likely there is one, but I have no idea on which package would contain it. For this example, this function should work:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
#optional tag row for later identification:
#data$rowid<-1:nrow(data)
stratified <- function(df, column, percent){
#split dataframe into groups based on column
listdf<-split(df, df[[column]])
testsubgroups<-lapply(listdf, function(x){
#pick the number of samples per group, round up.
numsamples <- ceiling(percent*nrow(x))
#selects the rows
whichones <-sample(1:nrow(x), numsamples, replace = FALSE)
testsubgroup <-x[whichones,]
})
#combine the subgroups into one data frame
testgroup<-do.call(rbind, testsubgroups)
testgroup
}
testgroup<-stratified(data, "z", 0.8)
This will just split the initial data by column z, if you are interested is grouping by multiple columns then this could be extended by using the group_by function from the dplyr package, but that would be another question.
Comment on the statistics: If you just have a few examples for any particular factor, what type of fit do you expect? A poor fit with wide confidence limits.

Can I get unwtd.count included when running the svymean from the R Survey package?

I've written an R script to loop through a bunch of variables in a survey and output weighted values, CVs, CIs etc.
I would like it to also output the unweighted observations count.
I know it's a bit of a lazy question because I can calculate unweighted counts on my own and join them back in. I'm just trying to replicate a stata script that would return 'obs'
svy:tab jdvariable, per cv ci obs column format(%14.4g)
This is my calculated values table:
myresult_year_calc <- svyby(make.formula(newmetricname), # variable to pass to function
by = ~year, # grouping
design = subset(csurvey, geoname %in% jv_geo), # design object with subset definition
vartype = c("ci","cvpct"), # report variation as ci, and cv percentage
na.rm.all=TRUE,
FUN = svymean # specify function from survey package
)
By using unwtd.count instead of FUN, I get the counts I want.
myresult_year_obs <- svyby(make.formula(newmetricname), # variable to pass to function
by = ~year, # grouping
design = subset(csurvey, geoname %in% jv_geo), # design object with subset definition
vartype = c("ci","cvpct"), # report variation as ci, and cv percentage
na.rm.all=TRUE,
unwtd.count
)
Honestly in writing this question I made it 98% through a solution, but I'll ask anyway in case someone knows a more efficient way.
myresult_year_calc and myresult_year_obs both return what I expect, and if I use merge(myresult_year_calc, myresult_year_obs by"year") I get the table I want. This actually just gives me one count, per year in this example instead of one count for 'Yes' responses and one count for 'No'.
Is there any way to get both means and unweighted counts with a single command?
I figured this out by creating a second dsgn function where weights = ~0. When I ran svyby using the svytotal function with the unweighted design it followed the formula.
dsgn2 <- svydesign(ids = ~0,
weights = ~0,
data = data,
na.rm = T)
unweighted_n <- svyby(~interaction(group1,group2), ~as.factor(mean_rating), design = dsgn2, FUN = svytotal, na.rm = T)

Calculated values on imputed data

I'd like to do something like the following: (myData is a data table)
#create some data
myData = data.table(invisible.covariate=rnorm(50),
visible.covariate=rnorm(50),
category=factor(sample(1:3,50, replace=TRUE)),
treatment=sample(0:1,50, replace=TRUE))
myData[,outcome:=invisible.covariate+visible.covariate+treatment*as.integer(category)]
myData[,invisible.covariate:=NULL]
#process it
myData[treatment == 0,untreated.outcome:=outcome]
myData[treatment == 1,treated.outcome:=outcome]
myPredictors = matrix(0,ncol(myData),ncol(myData))
myPredictors[5,] = c(1,1,0,0,0,0)
myPredictors[6,] = c(1,1,0,0,0,0)
myImp = mice(myData,predictorMatrix=myPredictors)
fit1 = with(myImp, lm(treated.outcome ~ category)) #this works fine
for_each_imputed_dataset(myImp, #THIS IS NOT A REAL FUNCTION but I hope you get the idea
function(imputed_data_table) {
imputed_data_table[,treatment.effect:=treated.outcome-untreated.outcome]
})
fit2 = with(myImp, lm(treatment.effect ~ category))
#I want fit2 to be an object similar to fit1
...
I would like to add a calculated value to each imputed data set, then do statistics using that calculated value. Obviously the structure above is probably not how you'd do it. I'd be happy with any solution, whether it involves preparing the data table somehow before the mice, a step before the "fit =" as sketched above, or some complex function inside the "with" call.
The complete() function will generate the "complete" imputed data set for each of the requested iterations. But note that mice expects to work with data.frames, so it returns data.frames and not data.tables. (Of course you can convert if you like). But here is one way to fit all those models
imp = mice(myData,predictorMatrix=predictors)
fits<-lapply(seq.int(imp$m), function(i) {
lm(I(treated.outcome-untreated.outcome)~category, complete(imp, i))
})
fits
The results will be in a list and you can extract particular lm objects via fits[[1]], fits[[2]], etc

Resources