library("partykit")
library(data.table) ## optional, but what I'll use to coerce the list into a DT
library(future.apply) ## for parallel stuff
library(glmnet)
plan(multisession) ## use all available cores
#sample size
N <- 200
#coefficients
betas <- list()
betas$b0 <- 1
betas$b1_up <- 2.4
betas$b1_down <- 2
betas$b2_up <- 2.4
betas$b2_down <- 2
ols_formula <- y ~a| x1 + x2+ x3+ x4+ x5+ x6 +x7 +x8+ x9 + x10
#lmtee # function to fit mob tree
#""0 +"" ---> supress the 'double' interecept
ols <- function(y, a, start = NULL, weights = NULL, offset = NULL,..){
lm(y ~ 0 + a)
}
reg_simulation_mob <- function(...){
#data
data <- data. Frame(
a= rbinom(N,1,0.5),
x1 = rnorm(N),
x2 = rnorm(N),
x3=rnorm(N),
x4=rnorm(N),
x5=rnorm(N),
x6=rnorm(N),
x7=rnorm(N),
x8=rnorm(N),
x9=rnorm(N),
x10=rnorm(N),
e = rnorm(N))
#dependent variable
data$y <- betas$b0 + with(data, ifelse(a>0, betas$b1_up * x1 + betas$b2_up * x2 , betas$b1_down * x1 + betas$b2_down * x2 )+ e )
#Estimate mob()-OLS
ols_mob <- lm(ols_formula, data = data, fit =ols)
# return(ols$coefficients)
return(ols_mob)
}
# N repetitions
nreps <- 4
## Parallel version
results <- future_lapply(1:nreps, reg_simulation_mob,future.seed =NULL)
results
The trees are not splitting, moreover I intend to regress y over a treatment with x1...x10 as covariates. The result I obtained is given below
Call:
lm(formula = ols_formula, data = data, fit = ols)
Coefficients:
(Intercept)
0.8658
a | x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10TRUE
NA
Related
I'm trying to run a regression with a constraint to set all coefficients greater than zero. To do this, I am utilizing the nls function. However, I am having an error:
"Error in nls(formula = y ~ . - 1, data = X, start = low, lower = low, :
parameters without starting value in 'data': ."
I believe everything is correct here, I tried to set a lower and upper bound on all variables, so I am not sure what is wrong.
Attempt 1:
library(magrittr)
X <- data.frame(
x1 = seq(10),
x2 = seq(10),
x3 = seq(10),
x4 = seq(10),
x5 = seq(10),
y = seq(10)
)
low <- dplyr::select(X, -y) %>% names %>% lapply( function(e) 0)
up <- dplyr::select(X, -y) %>% names %>% lapply( function(e) Inf)
names(low) <- dplyr::select(X, -y) %>% names -> names(up)
fit1 <- nls(formula = y ~ . -1 , data = X,
start = low,
lower = low,
upper = up,
algorithm = "port"
)
Attempt 2:
Here I try to set the formula manually but then I get a new error:
"Error in qr(.swts * gr) :
dims [product 5] do not match the length of object [10]"
library(magrittr)
X <- data.frame(
x1 = seq(10),
x2 = seq(10),
x3 = seq(10),
x4 = seq(10),
x5 = seq(10),
y = seq(10)
)
n <- X %>% dplyr::select( -y ) %>% names %>% paste0( collapse = " + " )
f <- "y ~ %s -1" %>% sprintf( n ) %>% as.formula
low <- dplyr::select(X, -y) %>% names %>% lapply( function(e) 0)
up <- dplyr::select(X, -y) %>% names %>% lapply( function(e) Inf)
names(low) <- dplyr::select(X, -y) %>% names -> names(up)
fit1 <- nls(formula = f , data = X,
start = low,
lower = low,
upper = up,
algorithm = "port"
)
How can I fix this? Thanks!
1) There are several problems here:
nls does not use the same formula notation as lm. Have fixed below.
the example does not have identifiable parameters, i.e. they are not unique so the calculation will fail. Below we change the example.
although 0 starting values seem to work here in general numeric optimization with constraints tends to work better if the starting values are in the interior of the feasible region.
Using the above we have
set.seed(123)
X <- data.frame(
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10),
x4 = rnorm(10),
x5 = rnorm(10),
y = rnorm(10)
)
fo <- y ~ b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5
st <- c(b1 = 1, b2 = 1, b3 = 1, b4 = 1, b5 = 1)
nls(fo, X, start = st, lower = numeric(5), algorithm = "port")
giving:
Nonlinear regression model
model: y ~ b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5
data: X
b1 b2 b3 b4 b5
0.0000 0.1222 0.0000 0.2338 0.1457
residual sum-of-squares: 6.477
Algorithm "port", convergence message: relative convergence (4)
2) The nnls (non-negative least squares) package can do this directly. We use X defined in (1).
nnls(as.matrix(X[-6]), X$y)
giving the following
Nonnegative least squares model
x estimates: 0 0.1221646 0 0.2337857 0.1457373
residual sum-of-squares: 6.477
reason terminated: The solution has been computed sucessfully.
This is a partial answer: you can combine it with #G.Grothendieck's answer to answer your question about "what if you have too many variables to type out by hand".
As implied by the comment thread, the model you're trying to set up doesn't include an intercept by default. The easiest way to handle this is probably to add a column of 1s to your data frame (mydata <- data.frame(x0 = 1, mydata))
## define variable names and parameter names
nx <- ncol(X)-1
vars <- names(X)[1:nx] ## assumes response is *last* column
pars <- gsub("x", "b", vars)
## construct formula
form <- reformulate(response = "y",
sprintf("%s*%s", pars, vars))
lwr <- setNames(rep(0, nx), pars)
upr <- setNames(rep(Inf, nx), pars)
start <- setNames(rep(1, nx), pars)
I have the following toy data -
data<-data.frame(y=rnorm(1000),x1=rnorm(1000), x2=rnorm(1000), x3=rnorm(1000), x4=rnorm(1000))
On this data, I am creating two models as below -
fit1 = lm(y ~ x1 + x3, data)
fit2 = lm(y ~ x2 + x3 + x4, data)
Finally, I am comparing these models using anova
anova(fit1, fit2)
Since I am running these tests on ~1 million separate datasets, I want to improve the performance and don't want to use lm and anova functions.
To speed up computation, I am using RcppArmadillo::fastLM, instead of lm but there is no available anova function to compare models. Is there a faster function (compared to lm), which also has a corresponding anova function?
Any suggestions to improve performance will be appreciated.
Here are the performance results of various lm versions used below. For linear regression fastLM is an obvious winner. Function anova2 by #Jay is much faster than anova function and works on fastLM. Function f by #Donald is also faster than anova function and works on fastLM.
microbenchmark::microbenchmark(
base_lm_1 = fit1 <- lm(y ~ x1 + x3, data),
base_lm_2 = fit2 <-lm(y ~ x2 + x3 + x4, data),
lm.fit_1 = lm.fit1 <- with(data, .lm.fit(y = y, x = cbind(1, x1, x3))),
lm.fit_2 = lm.fit2 <- with(data, .lm.fit(y = y, x = cbind(1, x2, x3, x4))),
lm.fit_3 = lm.fit3 <- lm_fun(y ~ x1 + x3, data),
lm.fit_4 = lm.fit4 <- lm_fun(y ~ x2 + x3 + x4, data),
fastLm1 = fastLM1 <- with(data, RcppArmadillo::fastLm(y = y, X = cbind(1, x1, x3))),
fastLm2 = fastLM2 <- with(data, RcppArmadillo::fastLm(y = y, X = cbind(1, x2, x3, x4))),
anova_base = anova(fit1, fit2),
Jay_fastLM = anova2(fastLM1, fastLM1),
Jay_lm.fit = anova2(lm.fit3, lm.fit4),
Donald = f(fastLM1, fastLM1),
times = 100L,
control=list(warmup=100L)
)
Unit: microseconds
expr min lq mean median uq max neval cld
base_lm_1 1472.480 1499.2560 1817.2659 1532.3840 1582.2615 28370.870 100 e
base_lm_2 1657.745 1706.5505 1796.3631 1744.3945 1825.7435 4761.020 100 e
lm.fit_1 94.212 106.9020 112.3093 111.2235 116.7010 147.192 100 a
lm.fit_2 124.220 129.8080 134.4455 132.9830 138.2510 156.166 100 a
lm.fit_3 853.906 873.9035 902.5856 889.9715 917.9375 1028.415 100 cd
lm.fit_4 991.238 1006.7015 1213.7061 1021.5325 1045.8980 19379.399 100 d
fastLm1 368.289 390.7805 434.1467 422.0855 476.9085 584.761 100 a c
fastLm2 416.645 441.8660 481.0027 462.8850 514.0755 617.619 100 a c
anova_base 2021.982 2099.8755 2322.2707 2190.3340 2246.7800 15345.093 100 f
Jay_fastLM 202.026 218.2580 229.6244 226.3405 238.9490 303.964 100 ab
Jay_lm.fit 200.028 216.0805 234.0143 229.7580 246.1870 292.268 100 ab
Donald 549.425 582.8105 612.6990 605.4400 625.5340 1079.989 100 bc
We could hack stats:::anova.lmlist so that it works for lists produced by .lm.fit (notice the dot) and RcppArmadillo::fastLm. Please check stats:::anova.lmlist, if I didn't delete stuff you need!
anova2 <- function(object, ...) {
objects <- list(object, ...)
ns <- vapply(objects, function(x) length(x$residuals), 0L)
stopifnot(!any(ns != ns[1L]))
resdf <- vapply(objects, df.residual, 0L)
resdev <- vapply(objects, function(x) crossprod(residuals(x)), 0)
bigmodel <- order(resdf)[1L]
dfs <- c(NA, -diff(resdf))[bigmodel]
ssq <- c(NA, -diff(resdev))[bigmodel]
df.scale <- resdf[bigmodel]
scale <- resdev[bigmodel]/resdf[bigmodel]
fstat <- (ssq/dfs)/scale
p <- pf(fstat, abs(dfs), df.scale, lower.tail=FALSE)
return(c(F=fstat, p=p))
}
These wrappers make .lm.fit and RcppArmadillo::fastLm a little more convenient:
lm_fun <- function(fo, dat) {
X <- model.matrix(fo, dat)
fit <- .lm.fit(X, dat[[all.vars(fo)[1]]])
fit$df.residual <- dim(X)[1] - dim(X)[2]
return(fit)
}
fastLm_fun <- function(fo, dat) {
fit <- RcppArmadillo::fastLm(model.matrix(fo, dat), dat[[all.vars(fo)[1]]])
return(fit)
}
Use it
fit1 <- lm_fun(y ~ x1 + x3, data)
fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
anova2(fit1, fit2)
# F p
# 0.3609728 0.5481032
## Or use `fastLm_fun` here, but it's slower.
Compare
anova(lm(y ~ x1 + x3, data), lm(y ~ x2 + x3 + x4, data))
# Analysis of Variance Table
#
# Model 1: y ~ x1 + x3
# Model 2: y ~ x2 + x3 + x4
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 997 1003.7
# 2 996 1003.4 1 0.36365 0.361 0.5481
lm_fun (which outperforms fastLm_fun) combined with anova2 appears to be around 60% faster than the conventional approach:
microbenchmark::microbenchmark(
conventional={
fit1 <- lm(y ~ x1 + x3, data)
fit2 <- lm(y ~ x2 + x3 + x4, data)
anova(fit1, fit2)
},
anova2={ ## using `.lm.fit`
fit1 <- lm_fun(y ~ x1 + x3, data)
fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
anova2(fit1, fit2)
},
anova2_Fast={ ## using `RcppArmadillo::fastLm`
fit1 <- fastLm_fun(y ~ x1 + x3, data)
fit2 <- fastLm_fun(y ~ x2 + x3 + x4, data)
anova2(fit1, fit2)
},
anova_Donald={
fit1 <- lm_fun(y ~ x1 + x3, data)
fit2 <- lm_fun(y ~ x2 + x3 + x4, data)
anova_Donald(fit1, fit2)
},
times=1000L,
control=list(warmup=100L)
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# conventional 2.885718 2.967053 3.205947 3.018668 3.090954 40.15720 1000 d
# anova2 1.180683 1.218121 1.285131 1.233335 1.267833 23.81955 1000 a
# anova2_Fast 1.961897 2.012756 2.179458 2.037854 2.087893 26.65279 1000 c
# anova_Donald 1.368699 1.409198 1.561751 1.430562 1.472148 33.12247 1000 b
Data:
set.seed(42)
data <- data.frame(y=rnorm(1000), x1=rnorm(1000), x2=rnorm(1000), x3=rnorm(1000),
x4=rnorm(1000))
Modell
y ~ x1 + x2 + x3
about 1000 rows
What Iwant to do is to do an prediction "step-by-step"
Using Row 0:20 to predict y of 21:30 and then using 11:30 to predict y of 31:40 and so on.
You can use the predict function:
mod = lm(y ~ ., data=df[1:990,])
pred = predict(mod, newdata=df[991:1000,2:4])
Edit: to change the range of training data in a loop:
index = seq(10,990,10)
pred = matrix(nrow=10, ncol=length(index))
for(i in index){
mod = lm(y ~ ., data=df[1:i,])
pred[,i/10] = predict(mod, newdata=df[(i+1):(i+10),2:4])
MSE[i/10] = sum((df$y[(i+1):(i+10)]-pred[,i/10])^2)}
mean(MSE)
Are you looking for something like this?
# set up mock data
set.seed(1)
df <- data.frame(y = rnorm(1000),
x1 = rnorm(1000),
x2 = rnorm(1000),
x3 = rnorm(1000))
# for loop
prd <- list()
for(i in 1:970){
# training data
trn <- df[i:(i+20), ]
# test data
tst <- df[(i+21):(i+30), ]
# lm model
mdl <- lm(y ~ x1 + x2 + x3, trn)
# append a list of data.frame with both predicted and actual values
# for later confrontation
prd[[i]] <- data.frame(prd = predict(mdl, tst[-1]),
act = tst[[1]])
}
# your list
prd
You can also try something fancier with the package slider:
# define here your model and how you wanna handle the preditions
sliding_lm <- function(..., frm, n_trn, n_tst){
df <- data.frame(...)
trn <- df[1:n_trn, ]
tst <- df[n_trn+1:n_tst, ]
mdl <- lm(y ~ x1 + x2 + x3, trn)
data.frame(prd = predict(mdl, tst[-1]),
act = tst[[1]])
}
n_trn <- 20 # number of training obs
n_tst <- 10 # number of test obs
frm <- y ~ x1 + x2 + x3 # formula of your model
prd <- slider::pslide(df, sliding_lm,
frm = frm,
n_trn = n_trn,
n_tst = n_tst,
.after = n_trn + n_tst,
.complete = TRUE)
Note that the last 30 entries in the list are NULL, because you look only at complete windows [30 observations with training and test]
In R I made a multiple linear regression model with 10 variables, after I have the model with the coefficients I would like to know how to calculate the miminum value of each independent variable that would give me the global minimum value of the output "y" from the model?
I tried optim package which I read does this but I get an error
Error in fn(par, ...) : could not find function "fn"
### calculate model
fit <- lm(Y ~., data=df2)
### create equation in function with parameters and coefficients to get ### global mínimum and with which values each value gets to that mínimum
f <- function(X,X1,X2,X3,X4,X5,X6,X7,X8,X9) -210.50200438 - 1.08319034 * X + 0.07467384 * X1 + 0.02106764 * X2 + 0.08280658 * X3 + 2.07585559 * X4 + 5.09223561 * X5 + -7.57592770 * X6 + 1.62970544 * X7 + 0.45169896 * X8 - 0.32157245 * X9
### Variables
X <- seq(19,75,by=1)
X1 <- seq(3780,9183,by=100)
X2 <- seq(7534,15840,by=100)
X3 <- seq(2810,5100,by=100)
X4 <- seq(185,596,by=1)
X5 <- seq(1.20,48.1,by=0.1)
X6 <- seq(38,78,by=1)
X7 <- seq(33,100,by=1)
X8 <- seq(160,358,by=1)
X9 <- seq(42,458,by=1)
z <- outer(X,X1,X2,X3,X4,X5,X6,X7,X8,X9,f)
optim(X,X1,X2,X3,X4,X5,X6,X7,X8,X9,f)
I want to know what is the global minimum of "y" and the values for each independent variables in order to get to that minimum, however I get that mistake with optim, if that is not what I should use please tell me how could I achieve this.
Thanks
As shared in the comments by #Luis, take the minimum value of your variable if the coefficient is positive and the maximum value if the coefficient is negative.
### Variables
X <- seq(19, 75, by = 1)
X1 <- seq(3780, 9183, by = 100)
X2 <- seq(7534, 15840, by = 100)
X3 <- seq(2810, 5100, by = 100)
X4 <- seq(185, 596, by = 1)
X5 <- seq(1.20, 48.1, by = 0.1)
X6 <- seq(38, 78, by = 1)
X7 <- seq(33, 100, by = 1)
X8 <- seq(160, 358, by = 1)
X9 <- seq(42, 458, by = 1)
f <- function(X, X1, X2, X3, X4, X5, X6, X7, X8, X9)
(- 210.50200438
- 1.08319034 * X
+ 0.07467384 * X1
+ 0.02106764 * X2
+ 0.08280658 * X3
+ 2.07585559 * X4
+ 5.09223561 * X5
- 7.57592770 * X6
+ 1.62970544 * X7
+ 0.45169896 * X8
- 0.32157245 * X9)
f(max(X), min(X1), min(X2), min(X3), min(X4), min(X5), max(X6), min(X7), min(X8), max(X9))
I have been able to run regression with some coefficients constrained to positive territory, but I'm doing alot of rolling regressions where I face the problem. Here is my sample code:
library(penalized)
set.seed(1)
x1=rnorm(100)*10
x2=rnorm(100)*10
x3=rnorm(100)*10
y=sin(x1)+cos(x2)-x3+rnorm(100)
data <- data.frame(y, x1, x2, x3)
win <- 10
coefs <- matrix(NA, ncol=4, nrow=length(y))
for(i in 1:(length(y)-win)) {
d <- data[(1+i):(win+i),]
p <- win+i
# Linear Regression
coefs[p,] <- as.vector(coef(penalized(y, ~ x1 + x2 + x3, ~1,
lambda1=0, lambda2=0, positive = c(F, F, T), data=data)))}
This is how I usually populate matrix with coefs from rolling regression and now I receive error:
Error in coefs[p, ] <- as.vector(coef(penalized(y, ~x1 + x2 + x3, ~1, :
number of items to replace is not a multiple of replacement length
I assume that this error is produced because there is not always Intercept + 3 coefficients coming out of that penalized regression function. Is there away to get penalized function to show 0 coefs as well? or other way to populated matrix / data.frame?
Perhaps you are unaware of the which argument for coef for "penfit" object. Have a look at:
getMethod(coef, "penfit")
#function (object, ...)
#{
# .local <- function (object, which = c("nonzero", "all", "penalized",
# "unpenalized"), standardize = FALSE)
# {
# coefficients(object, which, standardize)
# }
# .local(object, ...)
#}
#<environment: namespace:penalized>
We can set which = "all" to report all coefficients. The default is which = "nonzero", which is causing the "replacement length differs" issue.
The following works:
library(penalized)
set.seed(1)
x1 = rnorm(100)*10
x2 = rnorm(100)*10
x3 = rnorm(100)*10
y = sin(x1) + cos(x2) - x3 + rnorm(100)
data <- data.frame(y, x1, x2, x3)
win <- 10
coefs <- matrix(NA, ncol=4, nrow=length(y))
for(i in 1:(length(y)-win)) {
d <- data[(1+i):(win+i),]
p <- win + i
pen <- penalized(y, ~ x1 + x2 + x3, ~1, lambda1 = 0, lambda2 = 0,
positive = c(F, F, T), data = data)
beta <- coef(pen, which = "all")
coefs[p,] <- unname(beta)
}