I have a list of regression models which all have the same number of terms (that is, the same number of predictive variables). Substantively, that they all have different model terms is right. But when it comes to putting them in a regression table, I want them all the models to share a single formula, simply for the sake of presentation.
Some indicative data
library(plyr)
d1 <- data.frame(y = rnorm(100),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
x4 = runif(100))
Fit the models
mods.form <- paste("y ~ x", 1:4, sep = "")
mod.list <- llply(mods.form, function(i) lm(i, d1))
Here are the terms I want to modify
llply(mod.list, function(i) attr(terms(i), "variables"))
[[1]]
list(y, x1)
[[2]]
list(y, x2)
[[3]]
list(y, x3)
[[4]]
list(y, x4)
I want every model in the list to have the same variable names as the first model, so I tried:
mod.list2 <- llply(mod.list, function(i) attr(terms(i), "variables") = list("y", "x1"))
which provides this error
Error in attr(terms(i), "variables") = list("y", "x1") :
could not find function "terms<-"
Is there a simple solution here?
Perhaps this is what you are looking for:
Using the dataframe that you provided
d1 <- data.frame(y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100),
x3 = rnorm(100),
x4 = rnorm(100))
First, rename each x variable to some desired name "x"
names(d1) <- c("y", rep("x", times=length(d1)-1))
Then, use lapply on list d1 for each x variable, passing y as an argument
in to an anonymous function
mod.list <- lapply(d1[2:ncol(d1)], function(x,y){
lm("y ~ x",d1)
}, y=d1[, 'y'])
Finally, calling llply on the mod.list we get:
> llply(mod.list, function(x){
+ attr(terms(x), "variables")
+ })
$x
list(y, x)
$x.1
list(y, x)
$x.2
list(y, x)
$x.3
list(y, x)
Related
I'm trying to run a regression with a constraint to set all coefficients greater than zero. To do this, I am utilizing the nls function. However, I am having an error:
"Error in nls(formula = y ~ . - 1, data = X, start = low, lower = low, :
parameters without starting value in 'data': ."
I believe everything is correct here, I tried to set a lower and upper bound on all variables, so I am not sure what is wrong.
Attempt 1:
library(magrittr)
X <- data.frame(
x1 = seq(10),
x2 = seq(10),
x3 = seq(10),
x4 = seq(10),
x5 = seq(10),
y = seq(10)
)
low <- dplyr::select(X, -y) %>% names %>% lapply( function(e) 0)
up <- dplyr::select(X, -y) %>% names %>% lapply( function(e) Inf)
names(low) <- dplyr::select(X, -y) %>% names -> names(up)
fit1 <- nls(formula = y ~ . -1 , data = X,
start = low,
lower = low,
upper = up,
algorithm = "port"
)
Attempt 2:
Here I try to set the formula manually but then I get a new error:
"Error in qr(.swts * gr) :
dims [product 5] do not match the length of object [10]"
library(magrittr)
X <- data.frame(
x1 = seq(10),
x2 = seq(10),
x3 = seq(10),
x4 = seq(10),
x5 = seq(10),
y = seq(10)
)
n <- X %>% dplyr::select( -y ) %>% names %>% paste0( collapse = " + " )
f <- "y ~ %s -1" %>% sprintf( n ) %>% as.formula
low <- dplyr::select(X, -y) %>% names %>% lapply( function(e) 0)
up <- dplyr::select(X, -y) %>% names %>% lapply( function(e) Inf)
names(low) <- dplyr::select(X, -y) %>% names -> names(up)
fit1 <- nls(formula = f , data = X,
start = low,
lower = low,
upper = up,
algorithm = "port"
)
How can I fix this? Thanks!
1) There are several problems here:
nls does not use the same formula notation as lm. Have fixed below.
the example does not have identifiable parameters, i.e. they are not unique so the calculation will fail. Below we change the example.
although 0 starting values seem to work here in general numeric optimization with constraints tends to work better if the starting values are in the interior of the feasible region.
Using the above we have
set.seed(123)
X <- data.frame(
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10),
x4 = rnorm(10),
x5 = rnorm(10),
y = rnorm(10)
)
fo <- y ~ b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5
st <- c(b1 = 1, b2 = 1, b3 = 1, b4 = 1, b5 = 1)
nls(fo, X, start = st, lower = numeric(5), algorithm = "port")
giving:
Nonlinear regression model
model: y ~ b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5
data: X
b1 b2 b3 b4 b5
0.0000 0.1222 0.0000 0.2338 0.1457
residual sum-of-squares: 6.477
Algorithm "port", convergence message: relative convergence (4)
2) The nnls (non-negative least squares) package can do this directly. We use X defined in (1).
nnls(as.matrix(X[-6]), X$y)
giving the following
Nonnegative least squares model
x estimates: 0 0.1221646 0 0.2337857 0.1457373
residual sum-of-squares: 6.477
reason terminated: The solution has been computed sucessfully.
This is a partial answer: you can combine it with #G.Grothendieck's answer to answer your question about "what if you have too many variables to type out by hand".
As implied by the comment thread, the model you're trying to set up doesn't include an intercept by default. The easiest way to handle this is probably to add a column of 1s to your data frame (mydata <- data.frame(x0 = 1, mydata))
## define variable names and parameter names
nx <- ncol(X)-1
vars <- names(X)[1:nx] ## assumes response is *last* column
pars <- gsub("x", "b", vars)
## construct formula
form <- reformulate(response = "y",
sprintf("%s*%s", pars, vars))
lwr <- setNames(rep(0, nx), pars)
upr <- setNames(rep(Inf, nx), pars)
start <- setNames(rep(1, nx), pars)
I'm fitting several linear models in r in the following way:
set.seed(12345)
n = 100
x1 = rnorm(n)
x2 = rnorm(n)+0.1
y = x + rnorm(n)
df <- data.frame(x1, x2, y)
x_str <- c("x1", "x1+x2")
regf_lm <- function(df,y_var, x_str ) {
frmla <- formula(paste0(y_var," ~ ", x_str ))
fit <- lm(frmla, data = df )
summary(fit) #fit
}
gbind_lm <- function(vv) {
n <- vv %>% length()
fits <- list()
coefs <- list()
ses <- list()
for (i in 1:n ) {
coefs[[i]] <- vv[[i]]$coefficients[,1]
ses[[i]] <- vv[[i]]$coefficients[,2]
fits[[i]] <- vv[[i]]
}
list("fits" = fits, "coefs" = coefs, "ses" = ses)
}
stargazer_lm <- function(mylist, fname, title_str,m_type = "html",...) {
stargazer(mylist$fits, coef = mylist$coefs,
se = mylist$ses,
type = m_type, title = title_str,
out = paste0("~/projects/outputs",fname), single.row = T ,...)
}
p_2 <- map(x_str,
~ regf_lm (df = df ,
y_var = "y", x_str = .))
m_all <- do.call(c, list(p_2)) %>% gbind_lm()
stargazer_lm(m_all,"name.html","My model", m_type = "html")
In regf_lm, if I use summary(fit) on the last line, I'm able to generate reg output with columns for estimated coefficients, std. error, etc. But Stargazer() does not work with summary(lm()) (returns error $ operator is invalid for atomic vectors). However, if I just use "fit" on the last line in regf_lm, the output shows only the estimated coefficients and not std error, R sq...and gbind_lm() won't work because I cannot extract ses or fit.
Any advice is greatly appreciated.
You can directly export model statistics in tidy format with the package broom
library(broom)
set.seed(12345)
n = 100
x1 = rnorm(n)
x2 = rnorm(n)+0.1
y = x1 + rnorm(n)
df <- data.frame(x1, x2, y)
x_str <- c("x1", "x1+x2")
regf_lm <- function(df,y_var, x_str ) {
frmla <- formula(paste0(y_var," ~ ", x_str ))
fit <- lm(frmla, data = df )
return(list(fit,select(broom::tidy(fit),std.error))) #fit
}
exm_model <- regf_lm(iris,'Sepal.Width','Sepal.Length')
stargazer(exm_model[[1]], coef = exm_model[[2]], title = 'x_model',
out ='abc', single.row = T)
This piece of code worked on my local with no problem, I think you can apply this in your workflow.
Modell
y ~ x1 + x2 + x3
about 1000 rows
What Iwant to do is to do an prediction "step-by-step"
Using Row 0:20 to predict y of 21:30 and then using 11:30 to predict y of 31:40 and so on.
You can use the predict function:
mod = lm(y ~ ., data=df[1:990,])
pred = predict(mod, newdata=df[991:1000,2:4])
Edit: to change the range of training data in a loop:
index = seq(10,990,10)
pred = matrix(nrow=10, ncol=length(index))
for(i in index){
mod = lm(y ~ ., data=df[1:i,])
pred[,i/10] = predict(mod, newdata=df[(i+1):(i+10),2:4])
MSE[i/10] = sum((df$y[(i+1):(i+10)]-pred[,i/10])^2)}
mean(MSE)
Are you looking for something like this?
# set up mock data
set.seed(1)
df <- data.frame(y = rnorm(1000),
x1 = rnorm(1000),
x2 = rnorm(1000),
x3 = rnorm(1000))
# for loop
prd <- list()
for(i in 1:970){
# training data
trn <- df[i:(i+20), ]
# test data
tst <- df[(i+21):(i+30), ]
# lm model
mdl <- lm(y ~ x1 + x2 + x3, trn)
# append a list of data.frame with both predicted and actual values
# for later confrontation
prd[[i]] <- data.frame(prd = predict(mdl, tst[-1]),
act = tst[[1]])
}
# your list
prd
You can also try something fancier with the package slider:
# define here your model and how you wanna handle the preditions
sliding_lm <- function(..., frm, n_trn, n_tst){
df <- data.frame(...)
trn <- df[1:n_trn, ]
tst <- df[n_trn+1:n_tst, ]
mdl <- lm(y ~ x1 + x2 + x3, trn)
data.frame(prd = predict(mdl, tst[-1]),
act = tst[[1]])
}
n_trn <- 20 # number of training obs
n_tst <- 10 # number of test obs
frm <- y ~ x1 + x2 + x3 # formula of your model
prd <- slider::pslide(df, sliding_lm,
frm = frm,
n_trn = n_trn,
n_tst = n_tst,
.after = n_trn + n_tst,
.complete = TRUE)
Note that the last 30 entries in the list are NULL, because you look only at complete windows [30 observations with training and test]
I have a dataframe:
df = data.frame(x1 = rnorm(50), x2 = rnorm(50), x3 = rnorm(50), x4 = rnorm(50))
I would like to regress each variable versus all the other variables, for instance:
fit1 <- lm(x1 ~ ., data = df)
fit2 <- lm(x2 ~ ., data = df)
etc. (Of course, the real dataframe has a lot more variables).
I tried putting them in a loop, but it didn't work. I also tried using lapply but couldn't produce the desired result either. Does anyone know the trick?
You can use reformulate to dynamically build formuals
df = data.frame(x1 = rnorm(50), x2 = rnorm(50), x3 = rnorm(50), x4 = rnorm(50))
vars <- names(df)
result <- lapply(vars, function(resp) {
lm(reformulate(".",resp), data=df)
})
alternatively you could use do.call to get "prettier" formauls in each of the models
vars <- names(df)
result <- lapply(vars, function(resp) {
do.call("lm", list(reformulate(".",resp), data=quote(df)))
})
each of these methods returns a list. You can extract individual models with result[[1]], result[[2]], etc
Or you can try this...
df = data.frame(x1 = rnorm(50), x2 = rnorm(50), x3 = rnorm(50), x4 = rnorm(50))
models = list()
for (i in (1: ncol(df))){
formula = paste(colnames(df)[i], "~ .", sep="")
models[[i]] = lm(formula, data = df)
}
This will save all models as a list
To retrieve stored models:
eg : model regressed on x4
#retrieve model - replace modelName with the name of the required column
modelName = "x4"
out = models[[which( colnames(df)== modelName )]]
Output :
Call:
lm(formula = formula, data = df)
Coefficients:
(Intercept) x1 x2 x3
-0.17383 0.07602 -0.09759 -0.23920
I would like to loop over various regressions referencing different data subsets, however I'm unable to appropriately call different subsets. For example:
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
x.list <- list(dat$x1,dat$x2,dat$x3)
dat1 <- dat[-9,]
fit <- list()
for(i in 1:length(x.list)){ fit[[i]] <- summary(lm(y ~ x.list[[i]], data = dat))}
for(i in 1:length(x.list)){ fit[[i]] <- summary(lm(y ~ x.list[[i]], data = dat1))}
Is there a way to call in "dat1" such that it subsets the other variables accordingly? Thanks for any recs.
I'm not sure it makes sense to copy your covariates into a new list like that. Here's a way to loop over columns and to dynamically build formulas
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
dat1 <- dat[-9,]
#x.list not used
fit <- list()
for(i in c("x1","x2","x3")){ fit[[i]] <- summary(lm(reformulate(i,"y"), data = dat))}
for(i in c("x1","x2","x3")){ fit[[i]] <- summary(lm(reformulate(i,"y"), data = dat1))}
How about this?
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
mods <- lapply(list(y ~ x1, y ~ x2, y ~ x3), lm, data = dat1)
If you have lots of predictors, create the formulas something like this:
lapply(paste('y ~ ', 'x', 1:10, sep = ''), as.formula)
If your data was in long format, it would be similarly simple to do by using lapply on a split data.frame.
dat <- data.frame(y = rnorm(30), x = rnorm(30), f = rep(1:3, each = 10))
lapply(split(dat, dat$f), function(x) lm(y ~ x, data = x))
Sorry being late - but have you tried to apply the data.table solution similar to yours in:
R data.table loop subset by factor and do lm()
I have just applied the links solution by altering your data which should illustrate how I understood your question:
set.seed(1)
df <- data.frame(x1 = letters[1:3],
x2 = sample(c("a","b","c"), 30, replace = TRUE),
x3 = sample(c(20:50), 30, replace = TRUE),
y = sample(c(20:50), 30, replace = TRUE))
dt <- data.table(df,key="x1")
fits <- lapply(unique(dt$x1),
function(z)lm(y~x2+x3, data=dt[J(z),], y=T))
fit <- dt[, lm(y ~ x2 + x3)]
# Using id as a "by" variable you get a model per id
coef_tbl <- dt[, as.list(coef(lm(y ~ x2 + x3))), by=x1]
# coefficients
sapply(fits,coef)
anova_tbl = dt[, as.list(anova(lm(y ~ x2 + x3))), by=x1]
row_names = dt[, row.names(anova(lm(y ~ x2 + x3))), by=x1]
anova_tbl[, variable := row_names$V1]
It extends your solution.