I'm trying to run a regression with a constraint to set all coefficients greater than zero. To do this, I am utilizing the nls function. However, I am having an error:
"Error in nls(formula = y ~ . - 1, data = X, start = low, lower = low, :
parameters without starting value in 'data': ."
I believe everything is correct here, I tried to set a lower and upper bound on all variables, so I am not sure what is wrong.
Attempt 1:
library(magrittr)
X <- data.frame(
x1 = seq(10),
x2 = seq(10),
x3 = seq(10),
x4 = seq(10),
x5 = seq(10),
y = seq(10)
)
low <- dplyr::select(X, -y) %>% names %>% lapply( function(e) 0)
up <- dplyr::select(X, -y) %>% names %>% lapply( function(e) Inf)
names(low) <- dplyr::select(X, -y) %>% names -> names(up)
fit1 <- nls(formula = y ~ . -1 , data = X,
start = low,
lower = low,
upper = up,
algorithm = "port"
)
Attempt 2:
Here I try to set the formula manually but then I get a new error:
"Error in qr(.swts * gr) :
dims [product 5] do not match the length of object [10]"
library(magrittr)
X <- data.frame(
x1 = seq(10),
x2 = seq(10),
x3 = seq(10),
x4 = seq(10),
x5 = seq(10),
y = seq(10)
)
n <- X %>% dplyr::select( -y ) %>% names %>% paste0( collapse = " + " )
f <- "y ~ %s -1" %>% sprintf( n ) %>% as.formula
low <- dplyr::select(X, -y) %>% names %>% lapply( function(e) 0)
up <- dplyr::select(X, -y) %>% names %>% lapply( function(e) Inf)
names(low) <- dplyr::select(X, -y) %>% names -> names(up)
fit1 <- nls(formula = f , data = X,
start = low,
lower = low,
upper = up,
algorithm = "port"
)
How can I fix this? Thanks!
1) There are several problems here:
nls does not use the same formula notation as lm. Have fixed below.
the example does not have identifiable parameters, i.e. they are not unique so the calculation will fail. Below we change the example.
although 0 starting values seem to work here in general numeric optimization with constraints tends to work better if the starting values are in the interior of the feasible region.
Using the above we have
set.seed(123)
X <- data.frame(
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10),
x4 = rnorm(10),
x5 = rnorm(10),
y = rnorm(10)
)
fo <- y ~ b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5
st <- c(b1 = 1, b2 = 1, b3 = 1, b4 = 1, b5 = 1)
nls(fo, X, start = st, lower = numeric(5), algorithm = "port")
giving:
Nonlinear regression model
model: y ~ b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5
data: X
b1 b2 b3 b4 b5
0.0000 0.1222 0.0000 0.2338 0.1457
residual sum-of-squares: 6.477
Algorithm "port", convergence message: relative convergence (4)
2) The nnls (non-negative least squares) package can do this directly. We use X defined in (1).
nnls(as.matrix(X[-6]), X$y)
giving the following
Nonnegative least squares model
x estimates: 0 0.1221646 0 0.2337857 0.1457373
residual sum-of-squares: 6.477
reason terminated: The solution has been computed sucessfully.
This is a partial answer: you can combine it with #G.Grothendieck's answer to answer your question about "what if you have too many variables to type out by hand".
As implied by the comment thread, the model you're trying to set up doesn't include an intercept by default. The easiest way to handle this is probably to add a column of 1s to your data frame (mydata <- data.frame(x0 = 1, mydata))
## define variable names and parameter names
nx <- ncol(X)-1
vars <- names(X)[1:nx] ## assumes response is *last* column
pars <- gsub("x", "b", vars)
## construct formula
form <- reformulate(response = "y",
sprintf("%s*%s", pars, vars))
lwr <- setNames(rep(0, nx), pars)
upr <- setNames(rep(Inf, nx), pars)
start <- setNames(rep(1, nx), pars)
Related
library("partykit")
library(data.table) ## optional, but what I'll use to coerce the list into a DT
library(future.apply) ## for parallel stuff
library(glmnet)
plan(multisession) ## use all available cores
#sample size
N <- 200
#coefficients
betas <- list()
betas$b0 <- 1
betas$b1_up <- 2.4
betas$b1_down <- 2
betas$b2_up <- 2.4
betas$b2_down <- 2
ols_formula <- y ~a| x1 + x2+ x3+ x4+ x5+ x6 +x7 +x8+ x9 + x10
#lmtee # function to fit mob tree
#""0 +"" ---> supress the 'double' interecept
ols <- function(y, a, start = NULL, weights = NULL, offset = NULL,..){
lm(y ~ 0 + a)
}
reg_simulation_mob <- function(...){
#data
data <- data. Frame(
a= rbinom(N,1,0.5),
x1 = rnorm(N),
x2 = rnorm(N),
x3=rnorm(N),
x4=rnorm(N),
x5=rnorm(N),
x6=rnorm(N),
x7=rnorm(N),
x8=rnorm(N),
x9=rnorm(N),
x10=rnorm(N),
e = rnorm(N))
#dependent variable
data$y <- betas$b0 + with(data, ifelse(a>0, betas$b1_up * x1 + betas$b2_up * x2 , betas$b1_down * x1 + betas$b2_down * x2 )+ e )
#Estimate mob()-OLS
ols_mob <- lm(ols_formula, data = data, fit =ols)
# return(ols$coefficients)
return(ols_mob)
}
# N repetitions
nreps <- 4
## Parallel version
results <- future_lapply(1:nreps, reg_simulation_mob,future.seed =NULL)
results
The trees are not splitting, moreover I intend to regress y over a treatment with x1...x10 as covariates. The result I obtained is given below
Call:
lm(formula = ols_formula, data = data, fit = ols)
Coefficients:
(Intercept)
0.8658
a | x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10TRUE
NA
I am trying to plot only a select few coefficients from a regression, but my regression has 100s of variables, so I'm trying to think of a way to extract the coefficients using better coding.
I have this regression below:
n <- 1e3
d <- data.frame(
# Covariates
x1 = rnorm(n),
x2 = rnorm(n),
x3 = rnorm(n),
x4 = rnorm(n),
# Individuals and firms
id = factor(sample(20, n, replace=TRUE)),
firm = factor(sample(13, n, replace=TRUE)),
# Noise
u = rnorm(n)
)
id.eff <- rnorm(nlevels(d$id))
firm.eff <- rnorm(nlevels(d$firm))
d$y <- d$x1 + 0.5*d$x2 + id.eff[d$id] + firm.eff[d$firm] + d$u
est <- felm(y ~ x1 + x2 + x3 + x4 | id + firm, data = d)
But I only want to plot x1 & x2, so I extract the column names I want:
d_names = names(d)
N <- -4
d_cleaned_names <- head(d_names, -N)
d_cleaned_names_filter <- d_cleaned_names[1:2]
I then add quotation marks and paste it into one continuous character:
d_cleaned_names_filter_quote <- shQuote(d_cleaned_names_filter)
d_cleaned_filter_quote_names = paste(d_cleaned_names_filter_quote, collapse = ',' )
and when I plot this:
jtools::plot_coefs(est,plot.distributions = TRUE, inner_ci_level = .9, coefs = c(d_cleaned_names_filter_quote))
I get the error:
Error in if (rescale.distributions == FALSE && max(heights)/min(heights) > :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In max(heights) : no non-missing arguments to max; returning -Inf
2: In min(heights) : no non-missing arguments to min; returning Inf
but this works perfectly writing 'x1' & 'x2' out manually. However, this is not that feasible for a large dataset:
jtools::plot_coefs(est,plot.distributions = TRUE, inner_ci_level = .9, coefs = c('x1', 'x2'))
Any advice would be very welcome.
I have been trying to create a series of coplots using a nested for loop but the loop takes too long to run (the original data set is very big). I have looked at similar questions and they suggest using the sapply function but I am still unclear about how to convert between the 2. I understand I need to create a plotting function to use (see below) but what I don't understand is how the i's and j's of the nested for loop into sapply arguements.
I have made some sample data, the nested for loop that I have been using and the plotting function I created that are below. Could someone walk me through how I convert my nested for loop into sapply arguements. I have been doing all of this in R. Many Thanks
y = rnorm(n = 200, mean = 10, sd = 2)
x1 = rnorm(n = 200, mean = 5, sd = 2)
x2 = rnorm(n = 200, mean = 2.5, sd = 2)
x3 = rep(letters[1:4], each = 50)
x4 = rep(LETTERS[1:8], each = 25)
dat = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3, x4 = x4)
for(i in dat[, 2:3]){
for(j in dat[, 4:5]){
coplot(y ~ i | j, rows = 1, data = dat)
}
}
coplop_fun = function(data, x, y, x, na.rm = TRUE){
coplot(.data[[y]] ~ .data[[x]] | .data[[z]], data = data, rows = 1)
}
I think you might be able to use mapply here and not sapply. mapply is similar to sapply but allows for you to pass two inputs instead of one.
y = rnorm(n = 200, mean = 10, sd = 2)
x1 = rnorm(n = 200, mean = 5, sd = 2)
x2 = rnorm(n = 200, mean = 2.5, sd = 2)
x3 = rep(letters[1:4], each = 50)
x4 = rep(LETTERS[1:8], each = 25)
dat = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3, x4 = x4)
for(i in dat[, 2:3]){
for(j in dat[, 4:5]){
coplot(y ~ i | j, rows = 1, data = dat)
}
}
mapply(function(x,j){coplot(dat[["y"]]~x|j,rows =1)}, dat[,2:3],dat[,4:5])
We can use a combination of functions expand.grid, formula and apply to accept character column names into coplot.
# combinations of column names for plotting
vars <- expand.grid(y = "y", x = c("x1", "x2"), z = c("x3", "x4"))
# cycle through column name variations, construct formula for each combination
apply(vars, MARGIN = 1,
FUN = function(x) coplot(
formula = formula(paste(x[1], "~", x[2], "|", x[3])),
data = dat, row = 1
)
)
Here's a tidyverse version of #nya's solution with expand.grid() and apply(). Each row in ds_plot_parameters represents a single plot. The equation variable is the string eventually passed to coplot().
Each equation is passed to purrr::walk(), which then calls coplot()
to produce one graph each. as.equation() converts the string to an equation.
ds_plot_parameters <-
tidyr::expand_grid(
v = c("x1", "x2"),
w = c("x3", "x4")
) |>
dplyr::mutate(
equation = paste0("y ~ ", v, " | ", w),
)
ds_plot_parameters$equation |>
purrr::walk(
\(e) coplot(as.formula(e), rows = 1, data = dat)
)
Gravy:
If you want to more input to the graph, then expand ds_plot_parameters to include other things like graph & axis titles.
ds_plot_parameters <-
tidyr::expand_grid(
v = c("x1", "x2"),
w = c("x3", "x4")
) |>
dplyr::mutate(
equation = paste0("y ~ ", v, " | ", w),
label_y = "Outcome (mL)",
label_x = paste(v, " (log 10)")
)
ds_plot_parameters |>
dplyr::select(
# Make sure this order exactly matches the function signature
equation,
label_x,
label_y,
) |>
purrr::pwalk(
.f = \(equation, label_x, label_y) {
coplot(
formula = as.formula(equation),
xlab = label_x,
ylab = label_y,
rows = 1,
data = dat
)
}
)
ds_plot_parameters
# # A tibble: 4 x 5
# v w equation label_y label_x
# <chr> <chr> <chr> <chr> <chr>
# 1 x1 x3 y ~ x1 | x3 Outcome (mL) x1 (log 10)
# 2 x1 x4 y ~ x1 | x4 Outcome (mL) x1 (log 10)
# 3 x2 x3 y ~ x2 | x3 Outcome (mL) x2 (log 10)
# 4 x2 x4 y ~ x2 | x4 Outcome (mL) x2 (log 10)
I have been able to run regression with some coefficients constrained to positive territory, but I'm doing alot of rolling regressions where I face the problem. Here is my sample code:
library(penalized)
set.seed(1)
x1=rnorm(100)*10
x2=rnorm(100)*10
x3=rnorm(100)*10
y=sin(x1)+cos(x2)-x3+rnorm(100)
data <- data.frame(y, x1, x2, x3)
win <- 10
coefs <- matrix(NA, ncol=4, nrow=length(y))
for(i in 1:(length(y)-win)) {
d <- data[(1+i):(win+i),]
p <- win+i
# Linear Regression
coefs[p,] <- as.vector(coef(penalized(y, ~ x1 + x2 + x3, ~1,
lambda1=0, lambda2=0, positive = c(F, F, T), data=data)))}
This is how I usually populate matrix with coefs from rolling regression and now I receive error:
Error in coefs[p, ] <- as.vector(coef(penalized(y, ~x1 + x2 + x3, ~1, :
number of items to replace is not a multiple of replacement length
I assume that this error is produced because there is not always Intercept + 3 coefficients coming out of that penalized regression function. Is there away to get penalized function to show 0 coefs as well? or other way to populated matrix / data.frame?
Perhaps you are unaware of the which argument for coef for "penfit" object. Have a look at:
getMethod(coef, "penfit")
#function (object, ...)
#{
# .local <- function (object, which = c("nonzero", "all", "penalized",
# "unpenalized"), standardize = FALSE)
# {
# coefficients(object, which, standardize)
# }
# .local(object, ...)
#}
#<environment: namespace:penalized>
We can set which = "all" to report all coefficients. The default is which = "nonzero", which is causing the "replacement length differs" issue.
The following works:
library(penalized)
set.seed(1)
x1 = rnorm(100)*10
x2 = rnorm(100)*10
x3 = rnorm(100)*10
y = sin(x1) + cos(x2) - x3 + rnorm(100)
data <- data.frame(y, x1, x2, x3)
win <- 10
coefs <- matrix(NA, ncol=4, nrow=length(y))
for(i in 1:(length(y)-win)) {
d <- data[(1+i):(win+i),]
p <- win + i
pen <- penalized(y, ~ x1 + x2 + x3, ~1, lambda1 = 0, lambda2 = 0,
positive = c(F, F, T), data = data)
beta <- coef(pen, which = "all")
coefs[p,] <- unname(beta)
}
I have a list of regression models which all have the same number of terms (that is, the same number of predictive variables). Substantively, that they all have different model terms is right. But when it comes to putting them in a regression table, I want them all the models to share a single formula, simply for the sake of presentation.
Some indicative data
library(plyr)
d1 <- data.frame(y = rnorm(100),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
x4 = runif(100))
Fit the models
mods.form <- paste("y ~ x", 1:4, sep = "")
mod.list <- llply(mods.form, function(i) lm(i, d1))
Here are the terms I want to modify
llply(mod.list, function(i) attr(terms(i), "variables"))
[[1]]
list(y, x1)
[[2]]
list(y, x2)
[[3]]
list(y, x3)
[[4]]
list(y, x4)
I want every model in the list to have the same variable names as the first model, so I tried:
mod.list2 <- llply(mod.list, function(i) attr(terms(i), "variables") = list("y", "x1"))
which provides this error
Error in attr(terms(i), "variables") = list("y", "x1") :
could not find function "terms<-"
Is there a simple solution here?
Perhaps this is what you are looking for:
Using the dataframe that you provided
d1 <- data.frame(y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100),
x3 = rnorm(100),
x4 = rnorm(100))
First, rename each x variable to some desired name "x"
names(d1) <- c("y", rep("x", times=length(d1)-1))
Then, use lapply on list d1 for each x variable, passing y as an argument
in to an anonymous function
mod.list <- lapply(d1[2:ncol(d1)], function(x,y){
lm("y ~ x",d1)
}, y=d1[, 'y'])
Finally, calling llply on the mod.list we get:
> llply(mod.list, function(x){
+ attr(terms(x), "variables")
+ })
$x
list(y, x)
$x.1
list(y, x)
$x.2
list(y, x)
$x.3
list(y, x)