I'm working on a school project where I need to impute missing data and after the imputation with mice I'm trying to produce completed data sets with the complete-function.
When I run them one by one everything works fine, but I'd like to use a for-loop in case I want to have more than just m = 5 imputations. Now, when trying to run the for-loop, I always get the error
Error in complete(imputation[1]) : Input data must have class 'mids'.
However when I look up the class it is mids, what's going wrong here?
This is my code:
imputation <- mice(data = data, m = 5, method = "norm", maxit = 1, seed = 500)
m <- 5
for(i in 1:m){
completeData[m] <- complete(imputation[m])
print(summary(completeData[m]))
}
Could someone maybe help me out here?
We are getting error because the class is not mids:
imputation[1]
# $call
# mice(data = walking, m = 5, maxit = 0, seed = 500)
class(imputation[1])
# [1] "list"
From the manual for ?complete:
Usage
complete(x, action = 1, include = FALSE)
library(mice)
# dummy data imputation
data(walking)
imputation <- mice(walking, max = 0, m = 5, seed = 500)
# using for loop
m <- 5
for(i in 1:m){
completeData <- complete(imputation, m)
print(summary(completeData))
}
# I prefer to use lapply
lapply(seq(imputation$m), function(i) summary(complete(imputation, i)))
Related
I am writing a simple for loop for R as follows, however only the last value is stored at err_cv.
code:
err_cv <- numeric(3)
k <- as.numeric(list(2, 5, 10))
for (i in length(k)){
cv.folds <- cvFolds(30, k[i])
cv.out <- cvFit(lm, formula = y~., data = data, folds = cv.folds, cost = rmspe)
err_cv[i] <- cv.out$cv
}
result
> err_cv
[1] 0.000000 0.000000 1.126309
What did I do wrong? I am just starting to learn R, coming from Python. Thanks!
I have been trying to make the output of a wfe model as tidy so I can easily incorporate it into ggplot and etc. This is a problem I've had with other packages and statistical models which are not included in broom.
So let's say I create a dataset like this: (taken from wfe's file):
library (wfe)
## generate panel data with number of units = N, number of time = Time
N <- 10 # number of distinct units
Time <- 15 # number of distinct time
## treatment effect
beta <- 1
## generate treatment variable
treat <- matrix(rbinom(N*Time, size = 1, 0.25), ncol = N)
## make sure at least one observation is treated for each unit
while ((sum(apply(treat, 2, mean) == 0) > 0) | (sum(apply(treat, 2, mean) == 1) > 0) |
(sum(apply(treat, 1, mean) == 0) > 0) | (sum(apply(treat, 1, mean) == 1) > 0)) {
treat <- matrix(rbinom(N*Time, size = 1, 0.25), ncol = N)
}
treat.vec <- c(treat)
## unit fixed effects
alphai <- rnorm(N, mean = apply(treat, 2, mean))
## geneate two random covariates
x1 <- matrix(rnorm(N*Time, 0.5,1), ncol=N)
x2 <- matrix(rbeta(N*Time, 5,1), ncol=N)
x1.vec <- c(x1)
x2.vec <- c(x2)
## generate outcome variable
y <- matrix(NA, ncol = N, nrow = Time)
for (i in 1:N) {
y[, i] <- alphai[i] + treat[, i] + x1[,i] + x2[,i] + rnorm(Time)
}
y.vec <- c(y)
## generate unit and time index
unit.index <- rep(1:N, each = Time)
time.index <- rep(1:Time, N)
Data.obs <- as.data.frame(cbind(y.vec, treat.vec, unit.index, time.index, x1.vec, x2.vec))
colnames(Data.obs) <- c("y", "tr", "unit", "time", "x1", "x2")
Now I run a model from the function wfe (again, code from the package's help file):
mod.did <- wfe(y~ tr+x1+x2, data = Data.obs, treat = "tr",
unit.index = "unit", time.index = "time", method = "unit",
qoi = "ate", estimator ="did", hetero.se=TRUE, auto.se=TRUE,
White = TRUE, White.alpha = 0.05, verbose = TRUE)
## summarize the results
summary(mod.did)
My question is how to turn this output into a tidy object I could plot.
If I call tidy(mod.did) I get the following error:
Error: No tidy method for objects of class wfedid
Which I understand, but I am unsure as to how to solve. I tried mapping the individual parameters (coefficient, se, etc.) into a new list object but that did not work, so I hope that someone here knows of a more systematic way of doing this.
In case it helps, here's a dput of the output: https://pastebin.com/HTkKEUUQ
Thanks!
Here's a start at a tidy method:
library(dplyr); library(tibble)
tidy.wfedid <- function(x, conf.int=FALSE, conf.level=0.95, ...) {
cc <- (coef(summary(x))
%>% as.data.frame()
%>% setNames(c("estimate","std.error","statistic","p.value"))
%>% tibble::rownames_to_column("term")
%>% as_tibble()
)
return(cc)
}
Note that (1) I haven't implemented the confidence interval stuff (you could do this by using mutate to add columns (conf.low, conf.high) = term ± std.error*qnorm((1+conf.level)/2); (2) this gives the standard "tidy" method, which gives a coefficient table. If you want predictions and confidence intervals on predictions you will need to write an augment method ...
I have a df as follows:
t r
1 0 100.00000
2 1 135.86780
3 2 149.97868
4 3 133.77316
5 4 97.08129
6 5 62.15988
7 6 50.19177
and so on...
I want to apply a rolling regression using lm(r~t).
However, I want to estimate one model for each iteration, where the iterations occur over a set time window t+k. Essentially, the first model should be estimated with t=0,t=1,...t=5, if k = 5, and the second model estimated with t=1, t=2,...,t=6, and so on.
In other words, it iterates from a starting point with a set window t+k where k is some pre-specified window length and applies the lm function over that particular window length iteratively.
I have tried using lapply like this:
mdls = lapply(df, function(x) lm(r[x,]~t))
However, I got the following error:
Error in r[x, ] : incorrect number of dimensions
If I remove the [x,], each iteration gives me the same model, in other words using all the observations.
If I use rollapply:
coefs = rollapply(df, 3, FUN = function(x) coef(lm(r~t, data =
as.data.frame(x))), by.column = FALSE, align = "right")
res = rollapply(df, 3, FUN = function(z) residuals(lm(r~t, data =
as.data.frame(z))), by.column = FALSE, align = "right")
Where:
t = seq(0,15,1)
r = (100+50*sin(0.8*t))
df = as.data.frame(t,r)
I get 15 models, but they are all estimated over the entire dataset, providing the same intercepts and coefficients. This is strange as I managed to make rollapply work just before testing it in a new script. For some reason it does not work again, so I am perplexed as to whether R is playing tricks on me, or whether there is something wrong with my code.
How can I adjust these methods to make sure they iterate according to my wishes?
I enclose a possible solution. The idea is to use a vector 1: nrow (df) in the function rollapply to indicate which rows we want to select.
df = data.frame(t = 0:6, r = c(100.00000, 135.86780, 149.97868, 133.77316, 97.08129, 62.15988, 50.19177))
N = nrow(df)
require(zoo)
# Coefficients
coefs <- rollapply(data = 1:N, width = 3, FUN = function(x){
r = df$r[x]
t = df$t[x]
out <- coef(lm(r~t))
return(out)
})
# Residuals
res <- rollapply(data = 1:N, width = 3, FUN = function(x){
r = df$r[x]
t = df$t[x]
out <- residuals(lm(r~t))
return(out)
})
I have a data set with many missing observations and I used the Amelia package to create imputed data sets. I'd like to know if it's possible to run the same model in parallel with a different data set per chain and combine the results into a single Stan object.
# Load packages
library(Amelia)
library(rstan)
# Load built-in data
data(freetrade)
# Create 2 imputed data sets (polity is an ordinal variable)
df.imp <- amelia(freetrade, m = 2, ords = "polity")
# Check the first data set
head(df.imp$imputations[[1]])
# Run the model in Stan
code <- '
data {
int<lower=0> N;
vector[N] tariff;
vector[N] polity;
}
parameters {
real b0;
real b1;
real<lower=0> sigma;
}
model {
b0 ~ normal(0,100);
b1 ~ normal(0,100);
tariff ~ normal(b0 + b1 * polity, sigma);
}
'
# Create a list from the first and second data sets
df1 <- list(N = nrow(df.imp$imputations[[1]]),
tariff = df.imp$imputations[[1]]$tariff,
polity = df.imp$imputations[[1]]$polity)
df2 <- list(N = nrow(df.imp$imputations[[2]]),
tariff = df.imp$imputations[[2]]$tariff,
polity = df.imp$imputations[[2]]$polity)
# Run the model
m1 <- stan(model_code = code, data = df1, chains = 1, iter = 1000)
My question is how to run the last line of code on both data sets at the same time, running 2 chains and combining the output with the same stan() function. Any suggestions?
You can run the models separately, and then combine them using sflist2stanfit().
E.g.
seed <- 12345
s1 <- stan_model(model_code = code) # compile the model
m1 <- sampling(object = s1, data = df1, chains = 1,
seed = seed, chain_id = 1, iter = 1000)
m2 <- sampling(object = s1, data = df2, chains = 1,
seed = seed, chain_id = 2, iter = 1000)
f12 <- sflist2stanfit(list(m1, m2))
You will have to use one of the packages for Parallel computing in R.
According to this post, it should then work:
Will RStan run on a supercomputer?
Here is an example that may work (I use this code with JAGS, will test it with Stan later):
library( doParallel )
cl <- makeCluster( 2 ) # for 2 processes
registerDoParallel( cl )
library(rstan)
# make a function to combine the results
stan.combine <- function(...) { return( sflist2stanfit( list(...) ) ) }
mydatalist <- list(df1 , df2)
myseeds <- c(123, 456)
# now start the chains
nchains <- 2
m_both <- foreach(i=1:nchains ,
.packages = c( 'rstan' ),
.combine = "stan.combine") %dopar% {
result <- stan(model_code = code,
data = mydatalist[[i]], # use the right dataset
seed=myseeds[i], # use different seeds
chains = 1, iter = 1000)
return(result) }
Let me know whether it works with Stan. As I said, I haven't tested it yet.
I'm trying to apply the solution I found here to generate machine learning models:
Best way to name objects programmatically using R?
Here's a dummy data set:
data_pred <- data.frame(x1 = 1:10, x2 = 11:20, x3 = 21:30)
data_resp <- data.frame(y1 = c(1:5, NA, 7:10), y2 = c(NA, 2, NA, 4:10))
Here was my for() loop method of modeling the predictors in data_pred on each individual column of measured responses in data_resp using the caret package:
# data_pred contains predictors
# data_resp contains one column per measurement
# 1 matching row per observation in both data_pred and data_resp
for (i in 1:ncol(data_resp)) {
train(x = data_pred[!is.na(data_resp[, i]), ],
y = data_resp[!is.na(data_resp[, i], i],
... )
}
Now I'm trying to do the same with lapply, which I think has numerous advantages. I'm having an issue with translating the !is.na() criteria on the fly so that I'm only modeling with non-NA cases for each response. Here was my initial function to test the lapply method:
rf_func <- function(y) {
train(x = data_pred,
y = y,
method = "rf",
tuneGrid = data.frame(.mtry = 3:6),
nodesize = 3,
ntrees = 500,
trControl = trControl) }
Then create an empty list to store results and apply the function to data_resp:
models <- list(NULL)
models$rf <- lapply(as.list(data_resp), rf_func)
That works fine since randomForest can handle NAs, but other methods cannot, so I need to remove those rows from each data_resp element as well as the corresponding rows from my predictors.
I tried this without success:
train(x = data_pred_scale[!is.na(y), ],
y = y[!is.na(y)],
... }
I also tried y[[!is.na(y)]]
How do I translate the data.frame method (df[!is.na(df2), ]) to lapply?
several different ways to go about it. A simple approach is with an anonymous function:
lapply(data_resp, function(x) rf_func(x[!is.na(x)]))
In fiddling around quite a bit with a single element of my as.list(data_frame) to simulate what lapply would be passing, I came up with this, which I think is working:
rf_func <- function(y) {
train(x = data_pred_scale[!(unlist(lapply(y, is.na))), ],
y = y[!(unlist(lapply(y, is.na)))],
method = "rf",
tuneGrid = data.frame(.mtry = 3:6),
nodesize = 3,
ntrees = 500,
trControl = trControl) }
models$rf <- lapply(as.list(data_resp), rf_func)
It does seem to be working. I [hackishly] compared the non-NA data set to the trainingData results in each caret model like so:
nas <- NULL
for(i in 1:ncol(data_resp)) {nas <- c(nas, length(data_resp[!is.na(data_resp[, i]), i]))}
model_nas <- NULL
for(i in 1:length(nas)) {model_nas <- c(model_nas, nrow(models$rf[[i]]$trainingData))}
identical(nas, model_nas)
[1] TRUE
So, is y[!unlist(lapply(y, is.na)))] the best/most elegant way to do this sort of thing It's pretty ugly...
Edit: Based on #Ricardo Saporta 's answer, I was able to come up with this (probably obvious to the veterans, but bear with me):
rf_func <- function(x, y) {
train(x = x,
y = y,
method = "rf",
tuneGrid = data.frame(.mtry = 3:6),
nodesize = 3,
ntrees = 500,
trControl = trControl) }
models$rf <- lapply(data_resp, function (y) {
rf_func(data_pred_scale[!is.na(y), ], y[!is.na(y)] )
}
)
Is there still a better way, or is that fairly decent? (Certainly prettier than my first mess up above.)