Using the survey package, I am having issues creating an imputationList that svydesign will accept. Here is a reproducible example:
library(tibble)
library(survey)
library(mitools)
# Data set 1
# Note that I am excluding the "income" variable from the "df"s and creating
# it separately so that it varies between the data sets. This simulates the
# variation with multiple imputation. Since I am using the same seed
# (i.e., 123), all the other variables will be the same, the only one that
# will vary will be "income."
set.seed(123)
df1 <- tibble(id = seq(1, 100, by = 1),
gender = as.factor(rbinom(n = 100, size = 1, prob = 0.50)),
working = as.factor(rbinom(n = 100, size = 1, prob = 0.40)),
pweight = sample(50:500, 100, replace = TRUE))
# Data set 2
set.seed(123)
df2 <- tibble(id = seq(1, 100, by = 1),
gender = as.factor(rbinom(n = 100, size = 1, prob = 0.50)),
working = as.factor(rbinom(n = 100, size = 1, prob = 0.40)),
pweight = sample(50:500, 100, replace = TRUE))
# Data set 3
set.seed(123)
df3 <- tibble(id = seq(1, 100, by = 1),
gender = as.factor(rbinom(n = 100, size = 1, prob = 0.50)),
working = as.factor(rbinom(n = 100, size = 1, prob = 0.40)),
pweight = sample(50:500, 100, replace = TRUE))
# Create list of imputed data sets
impList <- imputationList(df1,
df2,
df3)
# Apply NHIS weights
weights <- svydesign(id = ~id,
weight = ~pweight,
data = impList)
I get the following error:
Error in eval(predvars, data, env) :
numeric 'envir' arg not of length one
To get it to work, I needed to directly add imputationList to svydesign as follows:
weights <- svydesign(id = ~id,
weight = ~pweight,
data = imputationList(list(df1,
df2,
df3))
the step by step instructions available at http://asdfree.com/national-health-interview-survey-nhis.html walk through exactly how to create a multiply-imputed nhis design, and the analysis examples below that include svyglm calls. avoid using library(data.table) and library(dplyr) with library(survey)
Related
How do I make graphs like this in R?
Lets say I have a dataset like this:
data <- tibble(date=sample(seq(as.Date("2006-01-01"),
as.Date("2019-01-01"), by="day"),
10000, replace = T),
treatment=sample(c(0,1),10000, replace= T),
after=ifelse(date>as.Date("2015-03-01"), 1, 0),
score=rnorm(10000)+ifelse(treatment*after==1, 0.2, 0)
)
and is doing a difference in differences analysis:
did <- lm(score~treatment+after+treatment*after, data=data)
summary(did)
How can I make a plot with placebo tests?
Just using plot_model function in sjPlot.
data <- tibble(date=sample(seq(as.Date("2006-01-01"),
as.Date("2019-01-01"), by="day"),
10000, replace = T),
treatment=sample(c(0,1),10000, replace= T),
after=ifelse(date>as.Date("2015-03-01"), 1, 0),
score=rnorm(10000)+ifelse(treatment*after==1, 0.2, 0)
)
did <- lm(score~treatment+after+treatment*after, data=data)
summary(did)
sjPlot::plot_model(did,vline = 'black',show.values = T) + ylim(-.25, .5)
vline means to add a horizontal line at x = 1;
show.values means whether values should be plotted or not.
You can check the details of argument of plot_model from here.
I have a mids object output from MICE that I need to run a series of stratified negative binomial models (glm.nb()) on. The stratification variable ("pressure") is itself imputed. The below code subsets the data within the glm.nb() function.
Is this the correct approach to run the stratified models?
library("mice") # For MI
library("MASS") # For glm.nb()
set.seed(09212020)
df <- data.frame("ethnicity" = sample(1:4, 50, replace = TRUE) ,
"education" = sample(1:4, 50, replace = TRUE) ,
"age" = sample(50:90, 50, replace = TRUE) ,
"pressure" = sample(0:12, 50, replace = TRUE),
"outcome_rate" = sample(0:6, 50, replace = TRUE),
"exposure_quart" = sample(1:4, 50, replace=TRUE,
prob=c(0.1, 0.2, 0.65, 0.05)))
# Insert NAs for MICE
df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15),
size = length(cc), replace = TRUE) ]))
# Run MICE
imp <- mice(data = df, m = 2, maxit = 1)
# Attempt to run stratified model on the mids object
# This is the portion I'm seeking to have confirmed/updated
m1 <- with(imp, (glm.nb(outcome_rate ~ exposure_quart + ethnicity + education + age + pressure,
subset = imp$imp$pressure < 9)))
m1
I have code to loop a logistic regression over several selected dependant variables (called outcome1-4). I would like to only run the model if a condition in an independent variable is met. Let's say I want at least two females for each outcome and type combination.
Dummy data:
set.seed(5)
df <- data.frame(
id = c(1:100),
age = sample(20:80, 100, replace = TRUE),
sex = sample(c("M", "F"), 100, replace = TRUE, prob = c(0.7, 0.3)),
type = sample(letters[1:4], 100, replace = TRUE),
outcome1 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.68, 0.32)),
outcome2 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.65, 0.35)),
outcome3 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.60, 0.40)),
outcome4 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.45, 0.55)))
Code to loop GLM (cred to https://stats.idre.ucla.edu/r/codefragments/looping_strings/):
outcomelist <- names(df)[5:8]
modelall <- lapply(outcomelist, function(x) {
glm(substitute(i ~ type + sex, list(i = as.name(x))), family = "binomial", data = df)})
I have found lots of questions concerning the loop but not any with additional condition. I am thinking subset but not being a pro on lapply yet I don't know where to put it.
If this is not an additional question I would like each model to be named the name of the outcome variable in the list (instead of 1 to 4) since it otherwise will be difficult to keep track of the models when the condition is added.
Appreciate any help!
One possibility is to clean the data prior to running lapply():
df.new <- df
for(ii in 1:length(outcomelist)){
temp <- outcomelist[ii]
# check the condition for outcome variable ii
condition <- any(aggregate(df$sex=="F", by=list(df$type, df[,temp]), FUN="sum")$x < 2)
if(condition){
# if the condition is met, remove the variable from df and outcomelist
df.new[,temp] <- NULL
outcomelist[ii] <- NA
}
}
# lose irrelevant outcomes
outcomelist <- na.omit(outcomelist)
modelall <- lapply(outcomelist, function(x) {
glm(substitute(i ~ type + sex, list(i = as.name(x))), family = "binomial", data = df.new)})
# name the list
names(modelall) <- outcomelist
I am struggling with data manipulation in R. My dataset consists of variables type(5 factors), intensity(3 factors), damage(continous). I want to calculate mean damage(demage1, demage2 and damage3 separately) with respect to intensity and type. In onther words I want to summarize the average damage by type and intensity. I have created this small reproducible example of my data:
type <- sample(seq(from = 1, to = 5, by = 1), size = 50, replace = TRUE)
intensity <- sample(seq(from = 1, to = 3, by = 1), size = 50, replace = TRUE)
damage1 <- sample(seq(from = 1, to = 50, by = 1), size = 50, replace = TRUE)
damage2 <- sample(seq(from = 1, to = 200, by = 1), size = 50, replace = TRUE)
damage3 <- sample(seq(from = 1, to = 500, by = 1), size = 50, replace = TRUE)
dat <- cbind(type, intensity, damage1, damage2, damage3)
then to manipulate the data I have used the pipe operator %>% buy my commands seem not to work very well:
dat <- as.data.frame(dat)
dat %>%
filter(type == 1) %>%
group_by(intensity, damage) %>%
summarise(mean_damage = mean(Value))
I have read about multiple usefull functions here:
efficient reshaping using data tables
manipulating data tables
Do Faster Data Manipulation using These 7 R Packages
But I wasnt able to make any progress here. My question are:
What is wrong with my code?
Am I even going in the right direction here?
Is there some alternative how to do this?
In some classification tasks, using mlr package, I need to deal with a data.frame similar to this one:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
I need to cross-validate the model keeping together the values with the same ID, I know from the tutorial that:
https://mlr-org.github.io/mlr-tutorial/release/html/task/index.html#further-settings
We could include a blocking factor in the task. This would indicate that some observations "belong together" and should not be separated when splitting the data into training and test sets for resampling.
The question is how can I include this blocking factor in the makeClassifTask?
Unfortunately, I couldn't find any example.
What version of mlr do you have? Blocking should be part of it since a while. You can find it directly as an argument in makeClassifTask
Here is an example for your data:
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = cv10)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})
#all entries are empty, blocking indeed works!
The answer by #jakob-r no longer works. My guess is something changed with cv10.
Minor edit to use "blocking.cv = TRUE"
Complete working example:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
resDesc <- makeResampleDesc("CV",iters=10,blocking.cv = TRUE)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = resDesc)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})