When I calculate a linear model in R via the lm() function, it is possible to pass a character vector of variables into the lm() formula. (E.g. like described here or here.) However, if I apply the same method to the selection() function of the sampleSelection package, it appears the following error:
Error in detectModelType(selection, outcome) :
argument 'selection' must be a formula in function 'selection()'
Question: Is there a way to pass a character vector of variables into the selection() formula?
Below, you can find a reproducible example, which illustrates the problem:
# Example data
N <- 1000
y <- rnorm(N, 2000, 200)
y_prob <- c(rep(0, N / 2), rep(1, N / 2)) == 1
x1 <- y + rnorm(N, 0, 300)
x2 <- y + rnorm(N, 0, 300)
x3 <- y + rnorm(N, 0, 300)
x4 <- y + rnorm(N, 0, 300)
x5 <- y + rnorm(N, 0, 300)
y[1:(N / 2)] <- 0
data <- data.frame(y, x1, x2, x3, x4, x5, y_prob)
x_vars <- colnames(data)[colnames(data) %in% c("y", "y_prob") == FALSE]
# Estimate linear model via lm() --> works without any problems
lm(paste("y", "~", paste(x_vars, collapse = " + ")))
# Estimate Heckman model via selection()
library("sampleSelection")
# Passing of vector does not work
selection(paste("y_prob", "~", paste(x_vars[1:4], collapse = " + ")),
paste("y", "~", paste(x_vars[3:5], collapse = " + ")), data)
# Formula has to be written manually
selection(y_prob ~ x1 + x2 + x3 + x4, y ~ x3 + x4 + x5, data)
Wrap your paste calls with as.formula
selection(as.formula(paste("y_prob", "~", paste(x_vars[1:4], collapse = " + "))),
as.formula(paste("y", "~", paste(x_vars[3:5], collapse = " + "))), data)
Call:
selection(selection = as.formula(paste("y_prob", "~", paste(x_vars[1:4], collapse = " + "))), outcome = as.formula(paste("y", "~", paste(x_vars[3:5], collapse = " + "))), data = data)
Coefficients:
S:(Intercept) S:x1 S:x2 S:x3 S:x4 O:(Intercept) O:x3 O:x4 O:x5 sigma
-1.936e-01 -5.851e-05 7.020e-05 5.475e-05 2.811e-05 2.905e+02 2.286e-01 2.437e-01 2.165e-01 4.083e+02
rho
1.000e+00
Related
I'm trying to run a regression with a constraint to set all coefficients greater than zero. To do this, I am utilizing the nls function. However, I am having an error:
"Error in nls(formula = y ~ . - 1, data = X, start = low, lower = low, :
parameters without starting value in 'data': ."
I believe everything is correct here, I tried to set a lower and upper bound on all variables, so I am not sure what is wrong.
Attempt 1:
library(magrittr)
X <- data.frame(
x1 = seq(10),
x2 = seq(10),
x3 = seq(10),
x4 = seq(10),
x5 = seq(10),
y = seq(10)
)
low <- dplyr::select(X, -y) %>% names %>% lapply( function(e) 0)
up <- dplyr::select(X, -y) %>% names %>% lapply( function(e) Inf)
names(low) <- dplyr::select(X, -y) %>% names -> names(up)
fit1 <- nls(formula = y ~ . -1 , data = X,
start = low,
lower = low,
upper = up,
algorithm = "port"
)
Attempt 2:
Here I try to set the formula manually but then I get a new error:
"Error in qr(.swts * gr) :
dims [product 5] do not match the length of object [10]"
library(magrittr)
X <- data.frame(
x1 = seq(10),
x2 = seq(10),
x3 = seq(10),
x4 = seq(10),
x5 = seq(10),
y = seq(10)
)
n <- X %>% dplyr::select( -y ) %>% names %>% paste0( collapse = " + " )
f <- "y ~ %s -1" %>% sprintf( n ) %>% as.formula
low <- dplyr::select(X, -y) %>% names %>% lapply( function(e) 0)
up <- dplyr::select(X, -y) %>% names %>% lapply( function(e) Inf)
names(low) <- dplyr::select(X, -y) %>% names -> names(up)
fit1 <- nls(formula = f , data = X,
start = low,
lower = low,
upper = up,
algorithm = "port"
)
How can I fix this? Thanks!
1) There are several problems here:
nls does not use the same formula notation as lm. Have fixed below.
the example does not have identifiable parameters, i.e. they are not unique so the calculation will fail. Below we change the example.
although 0 starting values seem to work here in general numeric optimization with constraints tends to work better if the starting values are in the interior of the feasible region.
Using the above we have
set.seed(123)
X <- data.frame(
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10),
x4 = rnorm(10),
x5 = rnorm(10),
y = rnorm(10)
)
fo <- y ~ b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5
st <- c(b1 = 1, b2 = 1, b3 = 1, b4 = 1, b5 = 1)
nls(fo, X, start = st, lower = numeric(5), algorithm = "port")
giving:
Nonlinear regression model
model: y ~ b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5
data: X
b1 b2 b3 b4 b5
0.0000 0.1222 0.0000 0.2338 0.1457
residual sum-of-squares: 6.477
Algorithm "port", convergence message: relative convergence (4)
2) The nnls (non-negative least squares) package can do this directly. We use X defined in (1).
nnls(as.matrix(X[-6]), X$y)
giving the following
Nonnegative least squares model
x estimates: 0 0.1221646 0 0.2337857 0.1457373
residual sum-of-squares: 6.477
reason terminated: The solution has been computed sucessfully.
This is a partial answer: you can combine it with #G.Grothendieck's answer to answer your question about "what if you have too many variables to type out by hand".
As implied by the comment thread, the model you're trying to set up doesn't include an intercept by default. The easiest way to handle this is probably to add a column of 1s to your data frame (mydata <- data.frame(x0 = 1, mydata))
## define variable names and parameter names
nx <- ncol(X)-1
vars <- names(X)[1:nx] ## assumes response is *last* column
pars <- gsub("x", "b", vars)
## construct formula
form <- reformulate(response = "y",
sprintf("%s*%s", pars, vars))
lwr <- setNames(rep(0, nx), pars)
upr <- setNames(rep(Inf, nx), pars)
start <- setNames(rep(1, nx), pars)
I have been trying to create a series of coplots using a nested for loop but the loop takes too long to run (the original data set is very big). I have looked at similar questions and they suggest using the sapply function but I am still unclear about how to convert between the 2. I understand I need to create a plotting function to use (see below) but what I don't understand is how the i's and j's of the nested for loop into sapply arguements.
I have made some sample data, the nested for loop that I have been using and the plotting function I created that are below. Could someone walk me through how I convert my nested for loop into sapply arguements. I have been doing all of this in R. Many Thanks
y = rnorm(n = 200, mean = 10, sd = 2)
x1 = rnorm(n = 200, mean = 5, sd = 2)
x2 = rnorm(n = 200, mean = 2.5, sd = 2)
x3 = rep(letters[1:4], each = 50)
x4 = rep(LETTERS[1:8], each = 25)
dat = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3, x4 = x4)
for(i in dat[, 2:3]){
for(j in dat[, 4:5]){
coplot(y ~ i | j, rows = 1, data = dat)
}
}
coplop_fun = function(data, x, y, x, na.rm = TRUE){
coplot(.data[[y]] ~ .data[[x]] | .data[[z]], data = data, rows = 1)
}
I think you might be able to use mapply here and not sapply. mapply is similar to sapply but allows for you to pass two inputs instead of one.
y = rnorm(n = 200, mean = 10, sd = 2)
x1 = rnorm(n = 200, mean = 5, sd = 2)
x2 = rnorm(n = 200, mean = 2.5, sd = 2)
x3 = rep(letters[1:4], each = 50)
x4 = rep(LETTERS[1:8], each = 25)
dat = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3, x4 = x4)
for(i in dat[, 2:3]){
for(j in dat[, 4:5]){
coplot(y ~ i | j, rows = 1, data = dat)
}
}
mapply(function(x,j){coplot(dat[["y"]]~x|j,rows =1)}, dat[,2:3],dat[,4:5])
We can use a combination of functions expand.grid, formula and apply to accept character column names into coplot.
# combinations of column names for plotting
vars <- expand.grid(y = "y", x = c("x1", "x2"), z = c("x3", "x4"))
# cycle through column name variations, construct formula for each combination
apply(vars, MARGIN = 1,
FUN = function(x) coplot(
formula = formula(paste(x[1], "~", x[2], "|", x[3])),
data = dat, row = 1
)
)
Here's a tidyverse version of #nya's solution with expand.grid() and apply(). Each row in ds_plot_parameters represents a single plot. The equation variable is the string eventually passed to coplot().
Each equation is passed to purrr::walk(), which then calls coplot()
to produce one graph each. as.equation() converts the string to an equation.
ds_plot_parameters <-
tidyr::expand_grid(
v = c("x1", "x2"),
w = c("x3", "x4")
) |>
dplyr::mutate(
equation = paste0("y ~ ", v, " | ", w),
)
ds_plot_parameters$equation |>
purrr::walk(
\(e) coplot(as.formula(e), rows = 1, data = dat)
)
Gravy:
If you want to more input to the graph, then expand ds_plot_parameters to include other things like graph & axis titles.
ds_plot_parameters <-
tidyr::expand_grid(
v = c("x1", "x2"),
w = c("x3", "x4")
) |>
dplyr::mutate(
equation = paste0("y ~ ", v, " | ", w),
label_y = "Outcome (mL)",
label_x = paste(v, " (log 10)")
)
ds_plot_parameters |>
dplyr::select(
# Make sure this order exactly matches the function signature
equation,
label_x,
label_y,
) |>
purrr::pwalk(
.f = \(equation, label_x, label_y) {
coplot(
formula = as.formula(equation),
xlab = label_x,
ylab = label_y,
rows = 1,
data = dat
)
}
)
ds_plot_parameters
# # A tibble: 4 x 5
# v w equation label_y label_x
# <chr> <chr> <chr> <chr> <chr>
# 1 x1 x3 y ~ x1 | x3 Outcome (mL) x1 (log 10)
# 2 x1 x4 y ~ x1 | x4 Outcome (mL) x1 (log 10)
# 3 x2 x3 y ~ x2 | x3 Outcome (mL) x2 (log 10)
# 4 x2 x4 y ~ x2 | x4 Outcome (mL) x2 (log 10)
In this part of my Shiny app, I'll do a 'linear model' (lm()) regression, using the variables the user selects. There are three inputs:
input$lmTrendFun is a selectInput(), with the options c("Linear", "Exponential", "Logarithmic", "Quadratic", "Cubic"):
selectInput("lmTrendFun", "Select the model for your trend line.",
choices = c("Linear", "Exponential", "Logarithmic", "Quadratic", "Cubic"))
The second input is input$lmDep, and it's a selectInput() too. I created a updateSelectInput first inside an observe() reactive function, so the choices are the column names from the imported tibble.
The third input is input$lmInd and it's a checkboxGroupInput(), the choices being all the column names other than the one that's already the input$lmInd.
From that I want this output: the lm() (or rather, summary.lm() or summary(lm())) result for those variables. If I knew which they were, it would be simple:
if(input$lmTrendFun == "Linear"){
form <- yname ~ x1 + x2
}else if(input$lmTrendFun == "Exponential"){
form <- yname~ exp(x1) + exp(x2)
}else if(input$lmTrendFun == "Logarithmic"){
form <- yname~ log(x1) + log(x2)
}else if(input$lmTrendFun == "Quadratic"){
form <- yname ~ poly(x1, 2) + poly(x2, 2)
}else if(input$lmTrendFun == "Cubic"){
form <- y ~ poly(x1, 3) + poly(x2, 3)
}
[...]
lm(form, data = .)
where the data (.) has the columns yname, x1 and x2.
However, I don't. So I believe I need some more generic function that can create the formula. How can this be done?
formulizer <- function() as.formula(paste0( input$lmDep, "~", switch(input$lmTrendFun,
Linear = paste0(input$lmInd, collapse=" + "),
Logarithmic = paste0("exp(", input$lmInd,")", collapse=" + "),
Quadratic = paste0("poly(", input$lmInd,", 2)", collapse=" + "),
Cubic = paste0("poly(", input$lmInd,", 3)", collapse=" + ") )))
> input <- list(lmInd=paste0("V", 1:5), lmTrendFun="Linear", lmDep="Vp")
> formulaizer()
Vp ~ V1 + V2 + V3 + V4 + V5
<environment: 0x7fad1cf63d48>
> input <- list(lmInd=paste0("V", 1:5), lmTrendFun="Logarithmic", lmDep="Vp")
> formulizer()
Vp ~ exp(V1) + exp(V2) + exp(V3) + exp(V4) + exp(V5)
<environment: 0x7fad01e694d0>
> input <- list(lmInd=paste0("V", 1:5), lmTrendFun="Quadratic", lmDep="Vp")
> formulizer()
Vp ~ poly(V1, 2) + poly(V2, 2) + poly(V3, 2) + poly(V4, 2) +
poly(V5, 2)
<environment: 0x7fad01f51d20>
> input <- list(lmInd=paste0("V", 1:5), lmTrendFun="Cubic", lmDep="Vp")
> formulizer()
Vp ~ poly(V1, 3) + poly(V2, 3) + poly(V3, 3) + poly(V4, 3) +
poly(V5, 3)
<environment: 0x7fad01f59690>
Consider switch with vectorized paste0 to build terms with transformations and then pass terms into reformulate. Adjust below inputs to actual Shiny variables:
dep_term <- ...
ind_terms <- ...
form <- switch(input$lmTrendFun,
Linear = reformulate(ind_terms, response="yname"),
Exponential = reformulate(paste0("exp(", ind_terms, ")"), response=dep_term),
Logarithmic = reformulate(paste0("log(", ind_terms, ")"), response=dep_term),
Quadratic = reformulate(paste0("poly(", ind_terms, ", 2)"), response=dep_term),
Cubic = reformulate(paste0("poly(", ind_terms, ", 3)"), response=dep_term)
)
Online Demo
I have a problem with step forward regression and My understanding is that i don't pass argument Data correctly.
I have the function:
ForwardStep <- function(df,yName, Xs, XsMin) {
Data <- df[, c(yName,Xs)]
fit <- glm(formula = paste(yName, " ~ ", paste0(XsMin, collapse = " + ")),
data = Data, family = binomial(link = "logit") )
ScopeFormula <- list(lower = paste(yName, " ~ ", paste0(XsMin, collapse = " + ")),
upper = paste(yName, " ~ ", paste0(Xs, collapse = " + ")))
result <- step(fit, direction = "forward", scope = ScopeFormula, trace = 1 )
return(result)
}
When I try to run it with following arguments
df <- data.frame(Y= rep(c(0,1),25),time = rpois(50,2), x1 = rnorm(50, 0,1),
x2 = rnorm(50,.5,2), x3 = rnorm(50,0,1))
yName = "Y"
Xs <- c("x1","x2","x3")
XsMin <- 1
res <- ForwardStep(df,Yname,Xs,XsMin)
I am getting an Error:
Error in is.data.frame(data) : object 'Data' not found
But if I first define Data in Global Env it works perfectly fine.
Data <- df[, c(yName,Xs)]
res <- ForwardStep(df,Yname,Xs,XsMin)
I guess that I have wrong implementation of function step however I don't exactly know how to do it the right way.
You need to realize that formulas always have an associated environment, see help("formula"). One should never pass text to the formula parameter of model functions, never ever. If you do that, you will encounter scoping issues sooner or later. Usually, I'd recommend computing on the language instead, but you can also create the formulas from text in the correct scope:
ForwardStep <- function(df,Yname, Xs, XsMin) {
Data <- df[, c(Yname,Xs)]
f1 <- as.formula(paste(Yname, " ~ ", paste0(XsMin, collapse = " + ")))
fit <- glm(formula = f1,
data = Data, family = binomial(link = "logit") )
f2 <- as.formula(paste(Yname, " ~ ", paste0(XsMin, collapse = " + ")))
f3 <- as.formula(paste(Yname, " ~ ", paste0(Xs, collapse = " + ")))
ScopeFormula <- list(lower = f2,
upper = f3)
step(fit, direction = "forward", scope = ScopeFormula, trace = 1)
}
df <- data.frame(Y= rep(c(0,1),25),time = rpois(50,2), x1 = rnorm(50, 0,1),
x2 = rnorm(50,.5,2), x3 = rnorm(50,0,1))
YName = "Y"
Xs <- c("x1","x2","x3")
XsMin <- 1
res <- ForwardStep(df,YName,Xs,XsMin)
#Start: AIC=71.31
#Y ~ 1
#
# Df Deviance AIC
#<none> 69.315 71.315
#+ x1 1 68.661 72.661
#+ x3 1 68.797 72.797
#+ x2 1 69.277 73.277
(Public service announcement: step-wise regression is a garbage generator. There are better statistical techniques available.)
I have been able to run regression with some coefficients constrained to positive territory, but I'm doing alot of rolling regressions where I face the problem. Here is my sample code:
library(penalized)
set.seed(1)
x1=rnorm(100)*10
x2=rnorm(100)*10
x3=rnorm(100)*10
y=sin(x1)+cos(x2)-x3+rnorm(100)
data <- data.frame(y, x1, x2, x3)
win <- 10
coefs <- matrix(NA, ncol=4, nrow=length(y))
for(i in 1:(length(y)-win)) {
d <- data[(1+i):(win+i),]
p <- win+i
# Linear Regression
coefs[p,] <- as.vector(coef(penalized(y, ~ x1 + x2 + x3, ~1,
lambda1=0, lambda2=0, positive = c(F, F, T), data=data)))}
This is how I usually populate matrix with coefs from rolling regression and now I receive error:
Error in coefs[p, ] <- as.vector(coef(penalized(y, ~x1 + x2 + x3, ~1, :
number of items to replace is not a multiple of replacement length
I assume that this error is produced because there is not always Intercept + 3 coefficients coming out of that penalized regression function. Is there away to get penalized function to show 0 coefs as well? or other way to populated matrix / data.frame?
Perhaps you are unaware of the which argument for coef for "penfit" object. Have a look at:
getMethod(coef, "penfit")
#function (object, ...)
#{
# .local <- function (object, which = c("nonzero", "all", "penalized",
# "unpenalized"), standardize = FALSE)
# {
# coefficients(object, which, standardize)
# }
# .local(object, ...)
#}
#<environment: namespace:penalized>
We can set which = "all" to report all coefficients. The default is which = "nonzero", which is causing the "replacement length differs" issue.
The following works:
library(penalized)
set.seed(1)
x1 = rnorm(100)*10
x2 = rnorm(100)*10
x3 = rnorm(100)*10
y = sin(x1) + cos(x2) - x3 + rnorm(100)
data <- data.frame(y, x1, x2, x3)
win <- 10
coefs <- matrix(NA, ncol=4, nrow=length(y))
for(i in 1:(length(y)-win)) {
d <- data[(1+i):(win+i),]
p <- win + i
pen <- penalized(y, ~ x1 + x2 + x3, ~1, lambda1 = 0, lambda2 = 0,
positive = c(F, F, T), data = data)
beta <- coef(pen, which = "all")
coefs[p,] <- unname(beta)
}