Can I get unwtd.count included when running the svymean from the R Survey package? - r

I've written an R script to loop through a bunch of variables in a survey and output weighted values, CVs, CIs etc.
I would like it to also output the unweighted observations count.
I know it's a bit of a lazy question because I can calculate unweighted counts on my own and join them back in. I'm just trying to replicate a stata script that would return 'obs'
svy:tab jdvariable, per cv ci obs column format(%14.4g)
This is my calculated values table:
myresult_year_calc <- svyby(make.formula(newmetricname), # variable to pass to function
by = ~year, # grouping
design = subset(csurvey, geoname %in% jv_geo), # design object with subset definition
vartype = c("ci","cvpct"), # report variation as ci, and cv percentage
na.rm.all=TRUE,
FUN = svymean # specify function from survey package
)
By using unwtd.count instead of FUN, I get the counts I want.
myresult_year_obs <- svyby(make.formula(newmetricname), # variable to pass to function
by = ~year, # grouping
design = subset(csurvey, geoname %in% jv_geo), # design object with subset definition
vartype = c("ci","cvpct"), # report variation as ci, and cv percentage
na.rm.all=TRUE,
unwtd.count
)
Honestly in writing this question I made it 98% through a solution, but I'll ask anyway in case someone knows a more efficient way.
myresult_year_calc and myresult_year_obs both return what I expect, and if I use merge(myresult_year_calc, myresult_year_obs by"year") I get the table I want. This actually just gives me one count, per year in this example instead of one count for 'Yes' responses and one count for 'No'.
Is there any way to get both means and unweighted counts with a single command?

I figured this out by creating a second dsgn function where weights = ~0. When I ran svyby using the svytotal function with the unweighted design it followed the formula.
dsgn2 <- svydesign(ids = ~0,
weights = ~0,
data = data,
na.rm = T)
unweighted_n <- svyby(~interaction(group1,group2), ~as.factor(mean_rating), design = dsgn2, FUN = svytotal, na.rm = T)

Related

How to use a for loop for the svyttest function in the survey package?

I am trying to use the svyttest function in a for loop in the survey package. I want to test for differences in proportions of responses between subpopulations in likert-scale type data. For example, in a survey question (1=strongly disagree, 5 = strongly agree), are there statistically significant differences in the proportion of "strongly disagree" responses between Groups 1 and 2?
I understand that I can also use the svyglm function from the survey package, but I have been unable to successfully use that in a for loop.
I also understand that there is a wtd.t.test in the weights package and the glm function in the stats package has a weights argument, but neither of these two options get the correct results. I need to use either the svyttest or the svyglm functions in the survey package.
For reference I have been looking
here and here for some help but have been unable to adapt these examples to my problem.
Thank you for your time and effort.
# create example survey data
ids <- 1:1000
stratas <- rep(c("strata1", "strata2","strata3","strata4"), each=250)
weight <- rep(c(5,2,1,1), each=250)
group <- rep(c(1,2), times=500)
q1 <- sample(1:5, 1000, replace = TRUE)
survey_data <- data.frame(ids, stratas, weight, group, q1)
# create example svydesign
library(survey)
survey_design <- svydesign(ids = ~0,
probs = NULL,
strata = survey_data$stratas,
weights = survey_data$weight,
data = survey_data)
# look at the proportions of q1 responses by group
prop.table(svytable(~q1+group, design = survey_design), margin = 2)
# t-test for significant differences in the proportions of the first item in q1
svyttest(q1== 1 ~ group, design = survey_design)
# trying a for loop for all five items
for(i in c(1:5)){
print(svyttest(q1== i ~ group, design = survey_design))
}
# I receive the following error:
Error in svyglm.survey.design(formula, design, family = gaussian()) :
all variables must be in design= argument
When dynamically updating a formula inside a function or a loop you need to invoke the as.formula() function to preserve the attributes of objects as variables. This should work:
# trying a for loop for all five items
for(i in c(1:5)){
print(svyttest(as.formula(paste("q1==", i, "~group")),
design = survey_design))
}
I tried some trick, you can use array, which you can use for your loop:
x=c()
for(i in c(1:5)){
x=append(x,as.formula(paste("q1==",i,"~ group")))
print(svyttest(x[[i]], design = survey_design))
}
With regards
Aleksei
I would use bquote
for(i in 1:5){
print(eval(
bquote(svyttest(q1== .(i) ~ group, design = survey_design))
))
}
In this example as.formula works just as well, but bquote is more general.

Run svymean on all variables [duplicate]

This question already has an answer here:
Is there a better alternative than string manipulation to programmatically build formulas?
(1 answer)
Closed 2 years ago.
------ Short story--------
I would like to run svymean on all variables in the dataset (assuming they are all numeric). I've pulled this narrative from this guide over here: https://stylizeddata.com/how-to-use-survey-weights-in-r/
I know I can run svymean on all the variables by listing them out like this:
svymean(~age+gender, ageDesign, na.rm = TRUE)
However, my real dataset is 500 variables long (they are all numeric), and I need to get the means all at once more efficiently. I tried the following but it does not work.
svymean(~., ageDesign, na.rm = TRUE)
Any ideas's?
--------- Long explanation with real data-----
library(haven)
library(survey)
library(dplyr)
Import NHANES demographic data
nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))
Copy and rename variables so they are more intuitive. "fpl" is percent of the
of the federal poverty level. It ranges from 0 to 5.
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA
Since there are 47 variables, we will select only the variables we will use in
this analysis.
nhanesAnalysis <- nhanesDemo %>%
select(fpl,
age,
gender,
persWeight,
psu,
strata)
Survey Weights
Here we use "svydesign" to assign the weights. We will use this new design
variable "nhanesDesign" when running our analyses.
nhanesDesign <- svydesign(id = ~psu,
strata = ~strata,
weights = ~persWeight,
nest = TRUE,
data = nhanesAnalysis)
Here we use "subset" to tell "nhanesDesign" that we want to only look at a
specific subpopulation (i.e., those age between 18-79 years). This is
important to do. If you don't do this and just restrict it in a different way
your estimates won't have correct SEs.
ageDesign <- subset(nhanesDesign, age > 17 &
age < 80)
Statistics
We will use "svymean" to calculate the population mean for age. The na.rm
argument "TRUE" excludes missing values from the calculation. We see that
the mean age is 45.648 and the standard error is 0.5131.
svymean(~age, ageDesign, na.rm = TRUE)
I know I can run svymean on all the variables by listing them out like this:
svymean(~age+gender, ageDesign, na.rm = TRUE)
However, my real dataset is 500 variables long, and I need to get the means all at once more efficiently. I tried the following but it does not work.
svymean(~., ageDesign, na.rm = TRUE)
You can use reformulate to construct the formula dynamically.
library(survey)
svymean(reformulate(names(nhanesAnalysis)), ageDesign, na.rm = TRUE)
# mean SE
#fpl 3.0134 0.1036
#age 45.4919 0.5273
#gender 1.5153 0.0065
#persWeight 80773.3847 5049.1504
#psu 1.5102 0.1330
#strata 126.1877 0.1506
This gives the same output as specifying each column individually in the function.
svymean(~age + fpl + gender + persWeight + psu + strata, ageDesign, na.rm = TRUE)

How to apply weights associated with the NIS (National inpatient sample) in R

I am trying to apply weights given with NIS data using the R package "survey", but I have been unsuccessful. I am fairly new to R and survey commands.
This is what I have tried:
# Create the unweighted dataset
d <- read.dta13(path)
# This produces the correct weighted amount of cases I need.
sum(d$DISCWT) # This produces the correct weighted amount of cases I need.
library(survey)
# Create survey object
dsvy <- svydesign(id = ~ d$HOSP_NIS, strata = ~ d$NIS_STRATUM, weights = ~ d$DISCWT, nest = TRUE, data = d)
d$count <- 1
svytotal(~d$count, dsvy)
However I get the following error after running the survey total:
Error in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1], :
Stratum (1131) has only one PSU at stage 1
Any help would be greatly appreciated, thank you!
The error indicates that you have specified a design where one of the strata has just a single primary sampling unit. It's not possible to get an unbiased estimate of variance for a design like that: the contribution of stratum 1131 will end up as 0/0.
As you see, R's default response is to give an error, because a reasonably likely explanation is that the data or the svydesign statement is wrong. Sometimes, as here, that's not what you want, and the global option 'survey.lonely.psu' describes other ways to respond. You want to set
options(survey.lonely.psu = "adjust")
This and other options are documented at help(surveyoptions)

Plot how the estimated survival depends upon the value of a covariate of interest. Problems with relevel

I want to plot how the estimated survival from a Cox model depends upon the value of a covariate of interest, while the rest of variables are fixed to their average values (if they are continuous variables) or lowest values for dummy. Following this example http://www.sthda.com/english/wiki/cox-proportional-hazards-model , I have construct a new data frame with three rows, one for each value of my variable of interest; and the other covariates are fixed. Among these covariates I have two factor vectors. I created the new dataset and later it is passed to survfit() via the newdata argument.
When I passed the data frame to survfit(), I obtain the following error message error in relevel.default(occupation) : 'relevel' only for factors. Where is the source of problem? If the source of problem is related to the factor vectors, how I can solve it? Below find an example of the code. Unfortunately, I cannot share the data or find a dataset that produces the same error message:
I have transformed the factor variables into integer vectors in the cox model and in the new dataset. it did not work.
I have deleated all the factor variables and it works.
I have tried to implement this strategy, but it did not work: Plotting predicted survival curves for continuous covariates in ggplot
fit <- coxph(Surv(entry, exit, event == 1) ~ status_plot +
exp_national + relevel(occupation, 5) + age + gender + EDUCATION , data = data)
data_rank <- with(data,
data.frame(status_plot = c(1,2,3), # factor vector of interest
exp_national=rep(mean(exp_national, na.rm = TRUE), 3),
occupation = c(5,5,5), # factor with 6 categories, number 5 is the category of reference in the cox model
age=rep(mean(age, na.rm = TRUE), 3),
gender = c(1,1,1),
EDUCATION=rep(mean(EDUCATION, na.rm = TRUE), 3) ))
surv.fin <- survfit(fit, newdata=data_rank) # this produces the error
Looking at the code it appears you probably attempted to take the mean of a factor. So do post at least str(data) as an edit to the body of your question. You should also realize that you can give a single value to a column in a data.frame call and have it recycled to the correct length, you all the meanss could be entered as a single item rather thanrep`-ng.

Library "TableOne" multiple comparisons. Calculate line by line p-values

I received a comment from a reviewer who wanted to have all the p-values for each line of specific variables levels in a demographic characteristic table (Table 1). Even though the request appears quite strange (and inexact) to me, I would like to agree with his suggestion.
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
print(tableOne, nonnormal = c("bili","chol","copper","alk.phos","trig"), exact = c("status","stage"), smd = TRUE)
the output:
I need to have the p-values for each level of the variables status, edema and stage, with Bonferroni correction. I went through the documentation without success.
In addition, is it correct to use chi-squared to compare sample sizes across rows?
UPDATE:
I'm not sure if my approach is correct, however I would like to share it with you. I generated for the variable status a dummy variable for each strata, than I calculated the chisq .
library(tableone)
## Load data
library(survival); data(pbc)
d <- pbc[,c("status", "trt")]
# Convert dummy variables
d$status.0 <- ifelse(d$status==0, 1,0)
d$status.1 <- ifelse(d$status==1, 1,0)
d$status.2 <- ifelse(d$status==2, 1,0)
t <- rbind(
chisq.test(d$status.0, d$trt),
# p-value = 0.7202
chisq.test(d$status.1, d$trt),
# p-value = 1
chisq.test(d$status.2, d$trt)
#p-value = 0.7818
)
t
BONFERRONI ADJ FOR MULTIPLE COMPARISONS:
p <- t[,"p.value"]
p.adjust(p, method = "bonferroni")
This question was posted some time ago, so I supose you already answered the reviewer.
I don't really understand why computing adjusted p values for just three varibles. In fact, adjusting p values depends on the number of comparisons made. If you use p.adjust() with a vector of 3 p values, results will not really be "adjusted" by the amount of comparison made (you really did more than a dozen and a half!)
I show how to extract all p-values so you can compute the adjusted ones.
To extract pValues from package tableOne there is a way calling object attributes (explained first), and two quick and dirty ways (at the bottom part).
To extract them, first I copy your code to create your tableOne:
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
You can see what your "tableOne" object has via attributes()
attributes(tableOne)
You can see a tableOne usually has a table for continuous and categorical variables. You can use attributes() in them too
attributes(tableOne$CatTable)
# you can notice $pValues
Now you know "where" the pValues are, you can extract them with attr()
attr(tableOne$CatTable, "pValues")
Something similar with numerical variables:
attributes(tableOne$ContTable)
# $pValues are there
attr(tableOne$ContTable, "pValues")
You have pValues for Normal and NonNormal variables.
As you set them before, you can extract both
mypCont <- attr(tableOne$ContTable, "pValues") # put them in an object
nonnormal = c("bili","chol","copper","alk.phos","trig") # copied from your code
mypCont[rownames(mypCont) %in% c(nonnormal), "pNonNormal"] # extract NonNormal
"%!in%" <- Negate("%in%")
mypCont[rownames(mypCont) %!in% c(nonnormal), "pNonNormal"] # extract Normal
All that said, and your pValues extracted, I think there are two much more convenient quick and dirty ways to accomplish the same:
Quick and dirty way A: using dput() with your printed tableOne. Then search in the console where the pValues are and copy-paste them to the script, to store them in an object
Quick and dirty way B: If you look in tableOne vignette there is an "Exporting" section, you can use print(tableOne, quote = TRUE) and then just copy and paste to a spreadsheet (like LibreOffice, Excel...).
Then I would select the column with pValue, transpose it, and get it back to R, to compute adjusted p values with p.adjust() and copy them back to the spreadsheet for journal submission

Resources