Run svymean on all variables [duplicate] - r

This question already has an answer here:
Is there a better alternative than string manipulation to programmatically build formulas?
(1 answer)
Closed 2 years ago.
------ Short story--------
I would like to run svymean on all variables in the dataset (assuming they are all numeric). I've pulled this narrative from this guide over here: https://stylizeddata.com/how-to-use-survey-weights-in-r/
I know I can run svymean on all the variables by listing them out like this:
svymean(~age+gender, ageDesign, na.rm = TRUE)
However, my real dataset is 500 variables long (they are all numeric), and I need to get the means all at once more efficiently. I tried the following but it does not work.
svymean(~., ageDesign, na.rm = TRUE)
Any ideas's?
--------- Long explanation with real data-----
library(haven)
library(survey)
library(dplyr)
Import NHANES demographic data
nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))
Copy and rename variables so they are more intuitive. "fpl" is percent of the
of the federal poverty level. It ranges from 0 to 5.
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA
Since there are 47 variables, we will select only the variables we will use in
this analysis.
nhanesAnalysis <- nhanesDemo %>%
select(fpl,
age,
gender,
persWeight,
psu,
strata)
Survey Weights
Here we use "svydesign" to assign the weights. We will use this new design
variable "nhanesDesign" when running our analyses.
nhanesDesign <- svydesign(id = ~psu,
strata = ~strata,
weights = ~persWeight,
nest = TRUE,
data = nhanesAnalysis)
Here we use "subset" to tell "nhanesDesign" that we want to only look at a
specific subpopulation (i.e., those age between 18-79 years). This is
important to do. If you don't do this and just restrict it in a different way
your estimates won't have correct SEs.
ageDesign <- subset(nhanesDesign, age > 17 &
age < 80)
Statistics
We will use "svymean" to calculate the population mean for age. The na.rm
argument "TRUE" excludes missing values from the calculation. We see that
the mean age is 45.648 and the standard error is 0.5131.
svymean(~age, ageDesign, na.rm = TRUE)
I know I can run svymean on all the variables by listing them out like this:
svymean(~age+gender, ageDesign, na.rm = TRUE)
However, my real dataset is 500 variables long, and I need to get the means all at once more efficiently. I tried the following but it does not work.
svymean(~., ageDesign, na.rm = TRUE)

You can use reformulate to construct the formula dynamically.
library(survey)
svymean(reformulate(names(nhanesAnalysis)), ageDesign, na.rm = TRUE)
# mean SE
#fpl 3.0134 0.1036
#age 45.4919 0.5273
#gender 1.5153 0.0065
#persWeight 80773.3847 5049.1504
#psu 1.5102 0.1330
#strata 126.1877 0.1506
This gives the same output as specifying each column individually in the function.
svymean(~age + fpl + gender + persWeight + psu + strata, ageDesign, na.rm = TRUE)

Related

r mice - "sample" imputation method not working correctly

I am using mice to impute missing data in a large dataset (24k obs, 98 vars). I am using the "sample" imputation method to impute some variables (and other methods for the others - many categorical). When I check my imputed data, those variables that I've applied "sample" to are not always imputed and I have missingness in them. I know for sure that I'm applying "sample" to them (I double checked the methods), and I made sure to remove all predictors of them in the prediction matrix. From my understanding, where they are in the visit sequence shouldn't matter (but I make sure they're immediately after variables with no missingness).
I can't give you a reprex because when I try to recreate the problem, it doesn't happen and everything is imputed just fine. I tried simulating my own data and I tried subsetting the dataset to a group of the variables that I want to use the sample method on. That's part of why I'm so stumped - I coded everything the same and it worked with the subset. I didn't think that the sample method would be at all dependent on the presence of any other vars.
EDIT:
This is the code I'm using
#produce prediction matrix
pred1 <- quickpred_ext(data1, mincor = 0.08, include = "age")
pred2 <- pred1
# for vars to not be imputed, set all predictors to 0
data_no_impute <- data1 %>%
select(contains(c("exp_", "outcome_"))) %>%
select(sort(names(.))) %>%
names
data_level3 <- data1 %>%
select(contains(c("f4", "f5", "f6")),
k22) %>%
select(sort(names(.))) %>%
names
pred2[data_no_impute,] <- 0
pred2[data_level3,] <- 0
#produce initial methods and visit sequence
initial <- mice(data1, max = 0, print = F, vis = "monotone",
defaultMethod = c("pmm", "logreg", "polyreg", "polr"))
#edit methods to be blank for vars I don't want to impute, "sample" for level 3
meth1 <- initial$meth
meth2 <- meth1
meth2[data_level3] <- "sample"
meth2[data_no_impute] <- ""
visits1 <- initial$visitSequence
visits2 <- visits1
visits2 <- append(visits2, data_level3,22)
#run mice test
mice_test <- mice(data1, m = 2, print = F,
predictorMatrix = pred2,
method = meth2,
vis = visits2,
nnet.MaxNWts = 3000)
#pull second completed dataset
imput1 <- mice::complete(mice_test, 2, include = F)
#look at missingness patterns
missingness_pattern2 <- md.pattern(imput1, plot = F)

Error: The data used by step_impute_linear() did not have any rows where the imputation values were all complete

I am using the recipe function and get an error when using the step_impute_linear() inside of the recipe function to impute NA's. Note that step_impute_median or step_impute_mean work without a problem. Also it does not matter if I use:
step_impute_linear(all_predictors()) or,
step_impute_linear(all_numeric(),.) etc.
None of the combinations work.
Also not that other methods like:
step_impute_knn(all_nominal(),impute_with = all_predictors(),-has_role("ID"))
fail too.
I also checked the data and not all of the rows contain missing data also not all of the columns do.
dt_rec <- recipe(
OFFER_STATUS~ ., data = dt_training) %>%
# 1. Define Role
update_role(MO_ID, new_role = "ID") %>%
update_role(SO_ID, new_role = "ID") %>%
# turn dates into decimals
step_mutate_at(where(is.Date), fn = decimal_date) %>%
# impute all numeric columns with their median
# 2. Impute
# step_impute_median(all_numeric(),-has_role("ID"))%>%
step_impute_linear(all_numeric(),impute_with = .,-has_role("ID"))
# ignoring novel factors
# 3. Handle factor levels
step_novel(all_predictors(), -all_numeric()) %>%
# impute all other nominal (character + factor) columns with the value "none"
step_unknown(all_nominal(), new_level = "none") %>%
step_string2factor(all_nominal(), -all_outcomes(), -has_role("ID")) %>%
# remove constant columns
step_zv(all_predictors()) %>%
# 4. Discretize
# remove variables that have a high correlation with each other
# as this will lead to multicollinearity
step_corr(all_numeric(), threshold = 0.99) %>%
# normalization --> centering and scaling numeric variables
# mean = 0 and Sd = 1
step_normalize(all_numeric()) %>%
# 5. Dummy variables
# creating dummary variables for nominal predictors
step_dummy(all_nominal(), -all_outcomes())
# 6. Normalization
# 7. Multivariate transformation
step_pca(all_numeric_predictors())
dt_rec
dt_rec %>% summary()
When you use a function like step_impute_linear(), you are saying "impute the values for my variable with other variables". If some of those other variables also have missing data, the model is not going to be able to fit successfully. If you have a set of variables, say x, y, and z, that all have some missing data and that you want to impute using each other, I recommend that you:
impute one or more of the variables (say x) with a method that only depends on that variable, like using the median or similar
impute other variables using only the predictors that are now complete with no missing data (say impute y and z based on x)
It's not going to work out if you try to fit a whole set of linear models for a set of variables using each other, all of which have missing data.

Tidymodels. step_impute_linear(), can it be used when every column contains NAs

My data contain >100 columns and every one of them contains NA's, and when I try to use step_impute_linear() it returns a mistake
Warning message:
There were missing values in the predictor(s) used to impute;
imputation did not occur.
Can, I, somehow make it work?
I think you'll need to use at least two steps of imputation.
First you will need to choose some variables to impute with something very simple, like the median or mode. I would choose the variables with lower rates of missingness for this.
Next you can choose some variables to impute with linear models, using only complete variables (the ones you imputed first with, say, the median). I would choose variables with higher rates of missingness for this, I think.
Here is an example analysis where I took this approach:
bb_rec <-
recipe(is_home_run ~ launch_angle + launch_speed + plate_x + plate_z +
bb_type + bearing + pitch_mph +
is_pitcher_lefty + is_batter_lefty +
inning + balls + strikes + game_date,
data = bb_train
) %>%
step_date(game_date, features = c("week"), keep_original_cols = FALSE) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_impute_median(all_numeric_predictors(), -launch_angle, -launch_speed) %>%
step_impute_linear(launch_angle, launch_speed,
impute_with = imp_vars(plate_x, plate_z, pitch_mph)
) %>%
step_nzv(all_predictors())
If you want to try out different strategies for types of imputation, I suggest setting up workflowsets and test on resampling folds.

Can I get unwtd.count included when running the svymean from the R Survey package?

I've written an R script to loop through a bunch of variables in a survey and output weighted values, CVs, CIs etc.
I would like it to also output the unweighted observations count.
I know it's a bit of a lazy question because I can calculate unweighted counts on my own and join them back in. I'm just trying to replicate a stata script that would return 'obs'
svy:tab jdvariable, per cv ci obs column format(%14.4g)
This is my calculated values table:
myresult_year_calc <- svyby(make.formula(newmetricname), # variable to pass to function
by = ~year, # grouping
design = subset(csurvey, geoname %in% jv_geo), # design object with subset definition
vartype = c("ci","cvpct"), # report variation as ci, and cv percentage
na.rm.all=TRUE,
FUN = svymean # specify function from survey package
)
By using unwtd.count instead of FUN, I get the counts I want.
myresult_year_obs <- svyby(make.formula(newmetricname), # variable to pass to function
by = ~year, # grouping
design = subset(csurvey, geoname %in% jv_geo), # design object with subset definition
vartype = c("ci","cvpct"), # report variation as ci, and cv percentage
na.rm.all=TRUE,
unwtd.count
)
Honestly in writing this question I made it 98% through a solution, but I'll ask anyway in case someone knows a more efficient way.
myresult_year_calc and myresult_year_obs both return what I expect, and if I use merge(myresult_year_calc, myresult_year_obs by"year") I get the table I want. This actually just gives me one count, per year in this example instead of one count for 'Yes' responses and one count for 'No'.
Is there any way to get both means and unweighted counts with a single command?
I figured this out by creating a second dsgn function where weights = ~0. When I ran svyby using the svytotal function with the unweighted design it followed the formula.
dsgn2 <- svydesign(ids = ~0,
weights = ~0,
data = data,
na.rm = T)
unweighted_n <- svyby(~interaction(group1,group2), ~as.factor(mean_rating), design = dsgn2, FUN = svytotal, na.rm = T)

randomForest Categorical Predictor Limits

I understand and appreciate that R's randomForest function can only handle categorical predictors with less than 54 categories. However, when I trim my categorical predictor down to less than 54 categories, I still get the error. The only questions I've seen around categorical predictor limits on stackoverflow is how to get around this category limit, but I'm trying to trim my number of categories to follow the function's limitations and I am still get the error.
The following script creates a data frame so we can predict 'profession'. Understandably, I get the "Can not handle categorical predictors with more than 53 categories" error when trying to run randomForest() on 'df' due to the 'college_id' variable.
But when I trim my data set to only include the top 40 college IDs, I get the same error. Am I missing some basic data frame concept that retains all of the categories even though only 40 are now populated in the 'df2' data frame? What is a workaround option that I can use?
library(dplyr)
library(randomForest)
# create data frame
df <- data.frame(profession = sample(c("accountant", "lawyer", "dentist"), 10000, replace = TRUE),
zip = sample(c("32801", "32807", "32827", "32828"), 10000, replace = TRUE),
salary = sample(c(50000:150000), 10000, replace = TRUE),
college_id = as.factor(c(sample(c(1001:1040), 9200, replace = TRUE),
sample(c(1050:9999), 800, replace = TRUE))))
# results in error, as expected
rfm <- randomForest(profession ~ ., data = df)
# arrange college_ids by count and retain the top 40 in the 'df' data frame
sdf <- df %>%
dplyr::group_by(college_id) %>%
dplyr::summarise(n = n()) %>%
dplyr::arrange(desc(n))
sdf <- sdf[1:40, ]
df2 <- dplyr::inner_join(df, sdf, by = "college_id")
df2$n <- NULL
# confirm that df2 only contains 40 categories of 'college_id'
nrow(df2[which(!duplicated(df2$college_id)), ])
# THIS IS WHAT I WANT TO RUN, BUT STILL RESULTS IN ERROR
rfm2 <- randomForest(profession ~ ., data = df2)
I think you still had all the factor levels in your variable. Try adding this line before you fit the forest again:
df2$college_id <- factor(df2$college_id)

Resources