Issue with the tabmeans.survey multi-categorical variables. Not recognising variables in the design - r

I have had an issue with analysing survey data on r using the survey and tab packages.
I think I am setting up the survey design object correctly, but when i try to run the tabmean.survey function comparing the means across more than 2 categories, the function does not recognise the variable in the design.
Here's the example using my data:
svyd<-svydesign(id=~psu, #PSU variable
strata=~strata, #Strata variable
weights=~ca_betaindin_xw, #Weight variable
data=usds)
svyd_emp<-subset(svyd, usds$samp_employ==1) #subset the data to required analytic sample
t1<-tabmeans.svy(age~ethnicity,
design = svyd_emp) #Run tabmeans.svy comparing means of age by ethnicity
Which produces this error:
Error in svyglm.survey.design(Age ~ 1, design = design) :
all variables must be in design= argument
When I try the same function with a binary variable the function works
t2<-tabmeans.svy(age~sex,
design = svyd_emp) #Run tabmeans.svy comparing means of age by sex
#WORKS
Comparing means across multi categorical variables using this function has previously worked. I can't figure out why the function is throwing up an error now. The survey.design object had the variables listed in the object.
I cannot share my data but I have reproduced the same issue using the 'api' dataset in the survey package.
data(api)
sdesign<-svydesign(id=~dnum+snum,
strata=~stype,
weights=~pw,
data=apistrat,
nest = TRUE)
t3<-tabmeans.svy(api00~stype, # stype has 3 categories = DOESNT WORK
design=sdesign)
t4<-tabmeans.svy(api00~sch.wide,
design=sdesign) # sch.wide has 2 categories = WORKS
Appreciate any thoughts or suggestions on how to get around this issue.
Many thanks

Thanks for the reproducible example. When I run it, I get
> t3<-tabmeans.svy(api00~stype, # stype has 3 categories = DOESNT WORK
+ design=sdesign)
Error in svyglm.survey.design(Age ~ 1, design = design) :
all variables must be in design= argument
> traceback()
4: stop("all variables must be in design= argument")
3: svyglm.survey.design(Age ~ 1, design = design)
2: svyglm(Age ~ 1, design = design)
1: tabmeans.svy(api00 ~ stype, design = sdesign)
which is disconcerting, because why is it trying to find an Age variable? (This was masked a little in your example, because you have an age variable).
Looking at the code for tabmeans.svy I see
if (num.groups == 2) {
fit <- svyttest(formula, design = design)
diffmeans <- -fit$estimate
diffmeans.ci <- -rev(as.numeric(fit$conf.int))
p <- fit$p.value
}
else {
fit1 <- svyglm(Age ~ 1, design = design)
fit2 <- svyglm(Age ~ Sex, design = design)
fit <- do.call(anova, c(list(object = fit1, object2 = fit2),
anova.svyglm.list))
p <- as.numeric(fit$p)
}
which explains the problem: if there are more than two groups it ignores your variables and instead tests for an effect of Sex on Age.
I suspect a cut-and-paste error by the maintainer. I have filed a GitHub issue. Unfortunately, I can't see a simple work-around.

Related

problems running coxph in R: Error: all variables must be in design= argument

I am trying to do a weighted survival analysis in R with the NHIS-dataset but while running the coxph-function the error all variables must be in design= argument occurs.
I created a sampleweight for my dataset as recommended on the NHIS-website by dividing the sampleweight by 18 for the 18 years I use.
Then I created a design object with the svydesign-function:
data <-
data %>%
mutate(
sampleweight = SAMPWEIGHT/18
)
data_design <- svydesign(id = ~data$PSU, weight = ~data$sampleweight, strata = ~data$STRATA, nest = TRUE, data = data)
I then created a survival-object with the Surv-function, where censored is a dummy-variable divided into dead/ not dead:
surv <- Surv(data_design$variables$AGE,
event = 1-(as.numeric(data_design$variables$censored)))
Then I plotted a Kaplan-Meier-Curve and also did a logrank-test, with BMI_NEW created by my own as a BMI variable:
km <- svykm(surv~BMI_NEW, design = data_design)
svyjskm(diabetes_km1,...)
logrank_BMI <- svylogrank(surv~BMI_NEW, design = data_design)
Everythings works well until now. I tried to do a cox-regression then but it doesn't work. Here is my code:
cox_fit <- svycoxph(surv~BMI_NEW, design = data_design)
I then get the error-message: "all variables must be in design= argument"
I am not sure why this error occures, because BMI_NEW is part of the data_design and it works for svykm and svylogrank.
Is there anybody who has an idea what I am doing wrong?
Thank you!

Error while using the weights option in nlme in r

Sorry this is crossposting from https://stats.stackexchange.com/questions/593717/nlme-regression-with-weights-syntax-in-r, but I thought it might be more appropriate to post it here.
I am trying to fit a power curve to model some observations in an nlme. However, I know some observations to be less reliable than others (reliability of each OBSID reflected in the WEIV in the dummy data), relatively independent of variance, and I quantified this beforehand and wish to include it as weights in my model. Moreover, I know a part of my variance is correlated with my independent variable so I cannot use directly the variance as weights.
This is my model:
coeffs_start = lm(log(DEPV)~log(INDV), filter(testdummy10,DEPV!=0))$coefficients
nlme_fit <- nlme(DEPV ~ a*INDV^b,
data = testdummy10,
fixed=a+b~ 1,
random = a~ 1,
groups = ~ PARTID,
start = c(a=exp(coeffs_start[1]), b=coeffs_start[2]),
verbose = F,
method="REML",
weights=varFixed(~WEIV))
This is some sample dummy data (I know it is not a great fit but it's fake data anyway) : https://github.com/FlorianLeprevost/dummydata/blob/main/testdummy10.csv
This runs well without the "weights" argument, but when I add it I get this error and I am not sure why because I believe it is the correct syntax:
Error in recalc.varFunc(object[[i]], conLin) :
dims [product 52] do not match the length of object [220]
In addition: Warning message:
In conLin$Xy * varWeights(object) :
longer object length is not a multiple of shorter object length
Thanks in advance!
This looks like a very long-standing bug in nlme. I have a patched version on Github, which you can install via remotes::install_github() as below ...
remotes::install_github("bbolker/nlme")
testdummy10 <- read.csv("testdummy10.csv") |> subset(DEPV>0 & INDV>0)
coeffs_start <- coef(lm(log(DEPV)~log(INDV), testdummy10))
library(nlme)
nlme_fit <- nlme(DEPV ~ a*INDV^b,
data = testdummy10,
fixed=a+b~ 1,
random = a~ 1,
groups = ~ PARTID,
start = c(a=exp(coeffs_start[1]),
b=coeffs_start[2]),
verbose = FALSE,
method="REML",
weights=varFixed(~WEIV))
packageVersion("nlme") ## 3.1.160.9000

How to use a for loop for the svyttest function in the survey package?

I am trying to use the svyttest function in a for loop in the survey package. I want to test for differences in proportions of responses between subpopulations in likert-scale type data. For example, in a survey question (1=strongly disagree, 5 = strongly agree), are there statistically significant differences in the proportion of "strongly disagree" responses between Groups 1 and 2?
I understand that I can also use the svyglm function from the survey package, but I have been unable to successfully use that in a for loop.
I also understand that there is a wtd.t.test in the weights package and the glm function in the stats package has a weights argument, but neither of these two options get the correct results. I need to use either the svyttest or the svyglm functions in the survey package.
For reference I have been looking
here and here for some help but have been unable to adapt these examples to my problem.
Thank you for your time and effort.
# create example survey data
ids <- 1:1000
stratas <- rep(c("strata1", "strata2","strata3","strata4"), each=250)
weight <- rep(c(5,2,1,1), each=250)
group <- rep(c(1,2), times=500)
q1 <- sample(1:5, 1000, replace = TRUE)
survey_data <- data.frame(ids, stratas, weight, group, q1)
# create example svydesign
library(survey)
survey_design <- svydesign(ids = ~0,
probs = NULL,
strata = survey_data$stratas,
weights = survey_data$weight,
data = survey_data)
# look at the proportions of q1 responses by group
prop.table(svytable(~q1+group, design = survey_design), margin = 2)
# t-test for significant differences in the proportions of the first item in q1
svyttest(q1== 1 ~ group, design = survey_design)
# trying a for loop for all five items
for(i in c(1:5)){
print(svyttest(q1== i ~ group, design = survey_design))
}
# I receive the following error:
Error in svyglm.survey.design(formula, design, family = gaussian()) :
all variables must be in design= argument
When dynamically updating a formula inside a function or a loop you need to invoke the as.formula() function to preserve the attributes of objects as variables. This should work:
# trying a for loop for all five items
for(i in c(1:5)){
print(svyttest(as.formula(paste("q1==", i, "~group")),
design = survey_design))
}
I tried some trick, you can use array, which you can use for your loop:
x=c()
for(i in c(1:5)){
x=append(x,as.formula(paste("q1==",i,"~ group")))
print(svyttest(x[[i]], design = survey_design))
}
With regards
Aleksei
I would use bquote
for(i in 1:5){
print(eval(
bquote(svyttest(q1== .(i) ~ group, design = survey_design))
))
}
In this example as.formula works just as well, but bquote is more general.

Error trying to do cross validation after a classification tree

I am trying to run a simple classification tree using the tree package. I have taken the code from a textbook, copied one by one, but it doesn't work, no matter what I do.
library(ISLR)
library(tree)
C = Carseats
C$HighSales = ifelse(C$Sales<=8,"No","Yes")
C = C[,-1]
set.seed(2)
train = sample(1:nrow(C), 200)
carseats.test = C[-train,]
high.test = C$HighSales[-train]
tree.carseats = tree(HighSales~., C, subset = train)
tree.predict = predict(tree.carseats, carseats.test, type = "class")
table(tree.predict,high.test)
(93+48)/200
set.seed(3)
cv.cs = cv.tree(tree.carseats, FUN = prune.misclass)
I am getting the following error:
Error in as.data.frame.default(data, optional = TRUE) :
cannot coerce class ‘"function"’ to a data.frame
I have looked at the help of the function. It requires a tree object, which is what I put inside.
What can be the problem ? The code is identical to the textbook and to other websites who quote the book.
There are two problems. One is related to the formula in tree:
formula - A formula expression. The left-hand-side (response) should be either a numerical vector when a regression tree will be fitted or a factor, when a classification tree is produced. The right-hand-side should be a series of numeric or factor variables separated by +; there should be no interaction terms. Both . and - are allowed: regression trees can have offset terms.
So, we should instead have
C$HighSales <- factor(ifelse(C$Sales <= 8, "No", "Yes"))
Next, there's a problem with how cv.tree deals with variables (see here). Doing something like
mydf <- C
tree.carseats <- tree(HighSales ~ ., mydf, subset = train)
works. The issue is that there's a function called C and cv.tree refers exactly to this function rather than your dataset.

Cross validation help: Error in model.frame.default(as.formula(delete.response(Terms)), newdata, : variable lengths differ (found for 'fun factor')

So I have a specific error that I can't figure out. By searching I am finding that the model and the cross validation set do not have the data with the same levels to fit the model. I am trying to understand completely with my use case. Basically I am building a QDA model to predict vehicle country based on numeric values. This code will run for anyone since it is a public google sheets document. For those of you who follow Doug Demuro on YouTube you may find this a tad bit interesting.
#load dataset into r
library(gsheet)
url = 'https://docs.google.com/spreadsheets/d/1KTArYwDWrn52fnc7B12KvjRb6nmcEaU6gXYehWfsZSo/edit'
doug_df = read.csv(text=gsheet2text(url, format='csv'), stringsAsFactors=FALSE,header=FALSE)
#begin cleanup. remove first blank rows of data
doug_df = doug_df[-c(1,2,3), ]
attach(doug_df)
#name columns appropriately
names(doug_df) = c("year","make","model","styling","acceleration","handling","fun factor","cool factor","total weekend score","features","comfort","quality","practicality","value","total daily score","dougscore","video duration","filming city","filming state","vehicle country")
#removing categorical columns and columns not being used for discriminate analysis to include totals columns
library(dplyr)
doug_df = doug_df %>% dplyr::select (-c(make,model,`total weekend score`,`total daily score`,dougscore,`video duration`,`filming city`,`filming state`))
#convert from character to numeric
num.cols <- c("year","styling","acceleration","handling","fun factor","cool factor","features","comfort","quality","practicality","value")
doug_df[num.cols] <- sapply(doug_df[num.cols], as.numeric)
`vehicle country` = as.factor(`vehicle country`)
#create a new column to reflect groupings for response variable
doug_df$country.group=ifelse(`vehicle country`=='Germany','Germany',
ifelse(`vehicle country`=='Italy','Italy',
ifelse(`vehicle country`=='Japan','Japan',
ifelse(`vehicle country`=='UK','UK',
ifelse(`vehicle country`=='USA','USA','Other')))))
#remove the initial country column
doug_df = doug_df %>% dplyr::select (-c(`vehicle country`))
#QDA with multiple predictors
library(MASS)
qdafit1 = qda(country.group~styling+acceleration+handling+`fun factor`+`cool factor`+features+comfort+quality+value,data=doug_df)
#predict using model and compute error
n=dim(doug_df)[1]
fittedclass = predict(qdafit1,data=doug_df)$class
table(doug_df$country.group,fittedclass)
Error = sum(doug_df$country.group != fittedclass)/n; Error
#conduct k 10 fold cross validation
allpredictedCV1 = rep("NA",n)
cvk = 10
groups = c(rep(1:cvk,floor(n/cvk)))
set.seed(4)
cvgroups = sample(groups,n,replace=TRUE)
for (i in 1:cvk) {
qdafit1 = qda(country.group~styling+acceleration+handling+`fun factor`+`cool factor`+features+comfort+quality+value,data=doug_df,subset=(cvgroups!=i))
newdata1i = data.frame(doug_df[cvgroups==i,])
allpredictedCV1[cvgroups==i] = as.character(predict(qdafit1,newdata1i)$class)
}
table(doug_df$country.group,allpredictedCV1)
CVmodel1 = sum(allpredictedCV1!=doug_df$country.group)/n; CVmodel1
This is throwing the error for the last part of the code w/ the cross validation:
Error in model.frame.default(as.formula(delete.response(Terms)), newdata, : variable lengths differ (found for 'fun factor')
Can someone help explain it a bit more in depth to me what is happening? I think that the variable fun factor doesn't have the same levels in each fold of the cross validation as it did the model. Now I need to know my options to fix it. Thanks in advance!
EDIT
In addition to the above, I am getting a very similar error for when I try to predict a dummy car review.
#build a dummy review and predict it using multiple models
dummy_review = data.frame(year=2014,styling=8,acceleration=6,handling=6,`fun factor`=8,`cool factor`=8,features=4,comfort=4,quality=6,practicality=3,value=5)
#predict vehicle country for dummy data using model 1
predict(qdafit1,dummy_review)$class
This returns the following error:
Error in model.frame.default(as.formula(delete.response(Terms)), newdata, : variable lengths differ (found for 'fun factor')

Resources