Loop through various data subsets in lm() in R - r

I would like to loop over various regressions referencing different data subsets, however I'm unable to appropriately call different subsets. For example:
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
x.list <- list(dat$x1,dat$x2,dat$x3)
dat1 <- dat[-9,]
fit <- list()
for(i in 1:length(x.list)){ fit[[i]] <- summary(lm(y ~ x.list[[i]], data = dat))}
for(i in 1:length(x.list)){ fit[[i]] <- summary(lm(y ~ x.list[[i]], data = dat1))}
Is there a way to call in "dat1" such that it subsets the other variables accordingly? Thanks for any recs.

I'm not sure it makes sense to copy your covariates into a new list like that. Here's a way to loop over columns and to dynamically build formulas
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
dat1 <- dat[-9,]
#x.list not used
fit <- list()
for(i in c("x1","x2","x3")){ fit[[i]] <- summary(lm(reformulate(i,"y"), data = dat))}
for(i in c("x1","x2","x3")){ fit[[i]] <- summary(lm(reformulate(i,"y"), data = dat1))}

How about this?
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
mods <- lapply(list(y ~ x1, y ~ x2, y ~ x3), lm, data = dat1)
If you have lots of predictors, create the formulas something like this:
lapply(paste('y ~ ', 'x', 1:10, sep = ''), as.formula)
If your data was in long format, it would be similarly simple to do by using lapply on a split data.frame.
dat <- data.frame(y = rnorm(30), x = rnorm(30), f = rep(1:3, each = 10))
lapply(split(dat, dat$f), function(x) lm(y ~ x, data = x))

Sorry being late - but have you tried to apply the data.table solution similar to yours in:
R data.table loop subset by factor and do lm()
I have just applied the links solution by altering your data which should illustrate how I understood your question:
set.seed(1)
df <- data.frame(x1 = letters[1:3],
x2 = sample(c("a","b","c"), 30, replace = TRUE),
x3 = sample(c(20:50), 30, replace = TRUE),
y = sample(c(20:50), 30, replace = TRUE))
dt <- data.table(df,key="x1")
fits <- lapply(unique(dt$x1),
function(z)lm(y~x2+x3, data=dt[J(z),], y=T))
fit <- dt[, lm(y ~ x2 + x3)]
# Using id as a "by" variable you get a model per id
coef_tbl <- dt[, as.list(coef(lm(y ~ x2 + x3))), by=x1]
# coefficients
sapply(fits,coef)
anova_tbl = dt[, as.list(anova(lm(y ~ x2 + x3))), by=x1]
row_names = dt[, row.names(anova(lm(y ~ x2 + x3))), by=x1]
anova_tbl[, variable := row_names$V1]
It extends your solution.

Related

Estimating separate regressions and generating predictions by group for different data sets

I want to use the same regression model for each group (id) using data "train" and use the estimated coefficients to create predicted values in data "test". I can do it using a loop, but how can I do it with lapply or something else?
train <- data.frame(x = 1:100, y = rnorm(100), id = rep(c('a', 'b'), 50))
test <- data.frame(x = 1:50, id = rep(c('a', 'b'), 25))
for (i in c('a', 'b')){
model <- lm(y ~ x, data = subset(train, id == i))
test$pred[test$id == i] <- predict(model, subset(test, id == i))
}
If you still insist on using lapply you can use the following solution:
test <- test[order(test$id),]
test$pred <- unlist(lapply(split(test, test$id), function(dt) {
model <- lm(y ~ x, data = subset(train, id == dt$id))
dt$pred <- predict(model, dt)
}))

R Imputation With MICE

set.seed(1)
library(data.table)
data=data.table(STUDENT = 1:1000,
OUTCOME = sample(20:90, r = T),
X1 = runif(1000),
X2 = runif(1000),
X3 = runif(1000))
data[, X1 := fifelse(X1 > .9, NA_real_, X1)]
data[, X2 := fifelse(X2 > .78 & X2 < .9, NA_real_, X1)]
data[, X3 := fifelse(X3 < .1, NA_real_, X1)]
Say you have data as shown and you wish to impute values for X1, X2, X3 and leave out STUDENT and OUTCOME for the imputation processing.
I can do
library(mice)
dataIMPUTE=mice(data[, c("X1", "X2", "X3")], m = 1)
but how do I get together the imputing values from dataIMPUTE with STUDENT and OUTCOME? I am afraid that I will merge wrong and that is why I ask if you have advice for this.
One possibility is to use the complete data set in the imputation, but change the predictorMatrix so that STUDENT and OUTCOME are not used in the imputation model.
First, you need to run mice to extract the predictorMatrix (without calculating the imputation). Then you can set all columns to 0 that shouldn't be included in the imputation model. However, all your variables are still contained in your dataIMPUTE object:
set.seed(1)
library(data.table)
data=data.table(STUDENT = 1:1000,
OUTCOME = sample(20:90, r = T),
X1 = runif(1000),
X2 = runif(1000),
X3 = runif(1000))
index_1 <- sample(1:1000, 100)
index_2 <- sample(1:1000, 100)
index_3 <- sample(1:1000, 100)
data[index_1, X1 := NA_real_]
data[index_2, X2 := NA_real_]
data[index_3, X3 := NA_real_]
library(mice)
init <- mice(data, maxit = 0, print = FALSE)
# extract the predictor matrix
pred_mat <- init$predictorMatrix
# remove STUDENT and OUTCOME as predictors
pred_mat[, c("STUDENT", "OUTCOME")] <- 0
# do the imputation
dataIMPUTE = mice(data, pred = pred_mat, m = 1)

Linear regression with ongoing data, in R

Modell
y ~ x1 + x2 + x3
about 1000 rows
What Iwant to do is to do an prediction "step-by-step"
Using Row 0:20 to predict y of 21:30 and then using 11:30 to predict y of 31:40 and so on.
You can use the predict function:
mod = lm(y ~ ., data=df[1:990,])
pred = predict(mod, newdata=df[991:1000,2:4])
Edit: to change the range of training data in a loop:
index = seq(10,990,10)
pred = matrix(nrow=10, ncol=length(index))
for(i in index){
mod = lm(y ~ ., data=df[1:i,])
pred[,i/10] = predict(mod, newdata=df[(i+1):(i+10),2:4])
MSE[i/10] = sum((df$y[(i+1):(i+10)]-pred[,i/10])^2)}
mean(MSE)
Are you looking for something like this?
# set up mock data
set.seed(1)
df <- data.frame(y = rnorm(1000),
x1 = rnorm(1000),
x2 = rnorm(1000),
x3 = rnorm(1000))
# for loop
prd <- list()
for(i in 1:970){
# training data
trn <- df[i:(i+20), ]
# test data
tst <- df[(i+21):(i+30), ]
# lm model
mdl <- lm(y ~ x1 + x2 + x3, trn)
# append a list of data.frame with both predicted and actual values
# for later confrontation
prd[[i]] <- data.frame(prd = predict(mdl, tst[-1]),
act = tst[[1]])
}
# your list
prd
You can also try something fancier with the package slider:
# define here your model and how you wanna handle the preditions
sliding_lm <- function(..., frm, n_trn, n_tst){
df <- data.frame(...)
trn <- df[1:n_trn, ]
tst <- df[n_trn+1:n_tst, ]
mdl <- lm(y ~ x1 + x2 + x3, trn)
data.frame(prd = predict(mdl, tst[-1]),
act = tst[[1]])
}
n_trn <- 20 # number of training obs
n_tst <- 10 # number of test obs
frm <- y ~ x1 + x2 + x3 # formula of your model
prd <- slider::pslide(df, sliding_lm,
frm = frm,
n_trn = n_trn,
n_tst = n_tst,
.after = n_trn + n_tst,
.complete = TRUE)
Note that the last 30 entries in the list are NULL, because you look only at complete windows [30 observations with training and test]

Regression of variables in a dataframe

I have a dataframe:
df = data.frame(x1 = rnorm(50), x2 = rnorm(50), x3 = rnorm(50), x4 = rnorm(50))
I would like to regress each variable versus all the other variables, for instance:
fit1 <- lm(x1 ~ ., data = df)
fit2 <- lm(x2 ~ ., data = df)
etc. (Of course, the real dataframe has a lot more variables).
I tried putting them in a loop, but it didn't work. I also tried using lapply but couldn't produce the desired result either. Does anyone know the trick?
You can use reformulate to dynamically build formuals
df = data.frame(x1 = rnorm(50), x2 = rnorm(50), x3 = rnorm(50), x4 = rnorm(50))
vars <- names(df)
result <- lapply(vars, function(resp) {
lm(reformulate(".",resp), data=df)
})
alternatively you could use do.call to get "prettier" formauls in each of the models
vars <- names(df)
result <- lapply(vars, function(resp) {
do.call("lm", list(reformulate(".",resp), data=quote(df)))
})
each of these methods returns a list. You can extract individual models with result[[1]], result[[2]], etc
Or you can try this...
df = data.frame(x1 = rnorm(50), x2 = rnorm(50), x3 = rnorm(50), x4 = rnorm(50))
models = list()
for (i in (1: ncol(df))){
formula = paste(colnames(df)[i], "~ .", sep="")
models[[i]] = lm(formula, data = df)
}
This will save all models as a list
To retrieve stored models:
eg : model regressed on x4
#retrieve model - replace modelName with the name of the required column
modelName = "x4"
out = models[[which( colnames(df)== modelName )]]
Output :
Call:
lm(formula = formula, data = df)
Coefficients:
(Intercept) x1 x2 x3
-0.17383 0.07602 -0.09759 -0.23920

Rename terms in fitted model object

I have a list of regression models which all have the same number of terms (that is, the same number of predictive variables). Substantively, that they all have different model terms is right. But when it comes to putting them in a regression table, I want them all the models to share a single formula, simply for the sake of presentation.
Some indicative data
library(plyr)
d1 <- data.frame(y = rnorm(100),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
x4 = runif(100))
Fit the models
mods.form <- paste("y ~ x", 1:4, sep = "")
mod.list <- llply(mods.form, function(i) lm(i, d1))
Here are the terms I want to modify
llply(mod.list, function(i) attr(terms(i), "variables"))
[[1]]
list(y, x1)
[[2]]
list(y, x2)
[[3]]
list(y, x3)
[[4]]
list(y, x4)
I want every model in the list to have the same variable names as the first model, so I tried:
mod.list2 <- llply(mod.list, function(i) attr(terms(i), "variables") = list("y", "x1"))
which provides this error
Error in attr(terms(i), "variables") = list("y", "x1") :
could not find function "terms<-"
Is there a simple solution here?
Perhaps this is what you are looking for:
Using the dataframe that you provided
d1 <- data.frame(y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100),
x3 = rnorm(100),
x4 = rnorm(100))
First, rename each x variable to some desired name "x"
names(d1) <- c("y", rep("x", times=length(d1)-1))
Then, use lapply on list d1 for each x variable, passing y as an argument
in to an anonymous function
mod.list <- lapply(d1[2:ncol(d1)], function(x,y){
lm("y ~ x",d1)
}, y=d1[, 'y'])
Finally, calling llply on the mod.list we get:
> llply(mod.list, function(x){
+ attr(terms(x), "variables")
+ })
$x
list(y, x)
$x.1
list(y, x)
$x.2
list(y, x)
$x.3
list(y, x)

Resources