R Different Prediction Result for Formula Containing "%>%" - r

In R, when I use "predict" to get a confidence interval for a certain x (x=42) under the model: y = (centered x) + (centered x)^2. I found two possible ways:
model1 = lm(y ~ scale(x, center=T, scale=F) + I( (scale(x, center=T, scale=F))^2 ), data=data)
model2 = lm(y ~ (x %>% scale(center=T, scale=F)) + I( (x %>% scale(center=T, scale=F))^2 ), data=data)
The summary results for the two models are the same. But when I ran:
predict(model1, data.frame(x=42), interval="confidence", level=0.95)
predict(model2, data.frame(x=42), interval="confidence", level=0.95)
The results are different. I am wondering why. Does R treat the above two formulas differently because of the usage of "%>%"?
The dataset is a practice dataset from Kutner's textbook SENIC.txt, y is the 11th column, x is the 12th column.

The issue here is with scale and not %>%. scale returns a matrix which seems to affect the outcome.
One solution is to write vector --> vector equivalent of scale and use it:
library("magrittr") # for `%>%`
set.seed(1)
dataset = data.frame(x = rnorm(30))
dataset[["y"]] = 1 + (3 * dataset[["x"]]) + rnorm(30, mean = 0, sd = 0.1)
scale_vector = function(x, ...){
stopifnot(inherits(x, "numeric"))
scale(x, ...)[, 1]
}
lm(y ~ scale_vector(x, center=T, scale=F) + I( (scale_vector(x, center=T, scale=F))^2 ), data=dataset)
#>
#> Call:
#> lm(formula = y ~ scale_vector(x, center = T, scale = F) + I((scale_vector(x,
#> center = T, scale = F))^2), data = dataset)
#>
#> Coefficients:
#> (Intercept)
#> 1.27423
#> scale_vector(x, center = T, scale = F)
#> 2.99296
#> I((scale_vector(x, center = T, scale = F))^2)
#> -0.01645
lm(y ~ (x %>% scale_vector(center=T, scale=F)) + I( (x %>% scale_vector(center=T, scale=F))^2 ), data=dataset)
#>
#> Call:
#> lm(formula = y ~ (x %>% scale_vector(center = T, scale = F)) +
#> I((x %>% scale_vector(center = T, scale = F))^2), data = dataset)
#>
#> Coefficients:
#> (Intercept)
#> 1.27423
#> x %>% scale_vector(center = T, scale = F)
#> 2.99296
#> I((x %>% scale_vector(center = T, scale = F))^2)
#> -0.01645
Besides, if you do not mind using tidyverse, this might be cleaner:
library("magrittr") # for `%>%`
set.seed(1)
dataset = tibble::tibble(x = rnorm(30),
y = 1 + (3 * x) + rnorm(30, mean = 0, sd = 0.1)
)
dataset %>%
dplyr::mutate(x_scaled = scale(x, center = TRUE, scale = FALSE)[, 1]) %>%
lm(y ~ x_scaled + I(x_scaled^2), data = .)
#>
#> Call:
#> lm(formula = y ~ x_scaled + I(x_scaled^2), data = .)
#>
#> Coefficients:
#> (Intercept) x_scaled I(x_scaled^2)
#> 1.27423 2.99296 -0.01645

Related

Using boxcox inside a user-defined function / object is not a matrix error

I´m trying to create a function to (visually) compare the distribution of a variable, with that of the same variable after a Box-Cox transformation.
The variable is a single column pulled out of my entire data frame.
library(EnvStats)
bc_compare_1 <- function(var){
bc_var <- boxcox(lm(var ~ 1))
lambda <- bc_var$x[which.max(bc_var$y)]
var_T <- (var^lambda - 1)/lambda
g <- ggarrange(
ggdensity(var, fill = "grey", alpha = 0.3) +
geom_histogram(colour = 1, fill = "white",
position = "identity", alpha = 0) +
ggtitle("original") +
theme(plot.title = element_text(size = 11)),
ggdensity(var_T, fill = "grey", alpha = 0.3) +
geom_histogram(colour = 1, fill = "white",
position = "identity", alpha = 0) +
ggtitle("transformed") +
theme(plot.title = element_text(size = 11)))
g <- annotate_figure(g, top = text_grob(substring(deparse(substitute(var)),3), size = 11))
l <- list(g, paste("lambda = ", lambda))
return(l)
}
This unfortunately doesn´t work:
Error in model.frame.default(formula = var ~ 1, drop.unused.levels = TRUE) :
object is not a matrix
I tried several things, but nothing works, and it seems that the problem is somehow with boxcox() not being able to deal with a linear model which was created within the function, cause I get the same error even in the simple example:
library(EnvStats)
testt <- function(var){
boxcox(lm(var ~ 1))
}
edit:
trying to include the data argument in the lm() function also didn´t seem to work:
testt <- function(data, var){
data %>%
pull(var) -> dvar
lmvar <- lm(data = data, formula = dvar ~ 1)
boxcox(lmvar)
}
-> also no good:
Error in model.frame.default(formula = (data %>% pull(var)) ~ 1, data = data, :
'data' must be a data.frame, environment, or list
(the data is a dataframe)
Any ideas?
Thanks a lot in advance!
Guy
Here is a solution.
The following function uses reformulate to put the formula's pieces together and corrects the lm call, apparently, boxcox needs the model.matrix x and the response y and these values default to FALSE, see the documentation for lm, section Arguments.
The column name var can be quoted or not.
suppressPackageStartupMessages(
library(EnvStats)
)
testt <- function(data, var){
fmla <- reformulate("1", var)
lmvar <- lm(formula = fmla, data = data, x = TRUE, y = TRUE)
boxcox(lmvar)
}
testt(iris, "Sepal.Length")
#> $lambda
#> [1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
#>
#> $objective
#> [1] 0.9839965 0.9881195 0.9910533 0.9927322 0.9931009 0.9921171 0.9897540
#> [8] 0.9860027 0.9808736
#>
#> $objective.name
#> [1] "PPCC"
#>
#> $optimize
#> [1] FALSE
#>
#> $optimize.bounds
#> lower upper
#> NA NA
#>
#> $eps
#> [1] 2.220446e-16
#>
#> $lm.obj
#>
#> Call:
#> lm(formula = fmla, data = data, x = TRUE, y = TRUE)
#>
#> Coefficients:
#> (Intercept)
#> 5.843
#>
#>
#> $sample.size
#> [1] 150
#>
#> $data.name
#> [1] "lmvar"
#>
#> attr(,"class")
#> [1] "boxcoxLm"
# same output, omitted
testt(iris, Sepal.Length)
Created on 2022-08-26 by the reprex package (v2.0.1)
Edit
An argument optimize could be useful whenever the optimal lambda is needed.
testt <- function(data, var, optimize = FALSE){
var <- as.character(substitute(var))
fmla <- reformulate("1", var)
lmvar <- lm(formula = fmla, data = data, x = TRUE, y = TRUE)
boxcox(lmvar, optimize = optimize)
}
testt(iris, Sepal.Length)$lambda
#[1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
testt(iris, Sepal.Length, optimize = TRUE)$lambda
#[1] -0.1117881

How to generate 1000 data sets and use Ridge and Lasso regression for all of them in R?

So I need to generate 1000 data sets with 200 observations in R from this model: model
and use Lasso and Ridge regression for all of them. Then I need to get beta_j coefficients for Lasso and Ridge. Can anyone help? Thank you already!
The setup is as you described in the image:
library(magrittr)
library(tidyverse)
library(glmnet)
M <- 9
beta <- c(c(0, 3, 2, 1, 0.5, 0.3),
rep(0, 10 - 6))
beta <- beta[-1] #glmnet contains the intercept
sigma <- diag(M) + 0.5 - 0.5 * diag(M)
sigma
N <- 200
G <- 1000
Now, to make the X and the right beta:
Xj <- mvtnorm::rmvnorm(n = N, sigma = sigma) %>%
set_colnames(paste0("x_", seq_len(ncol(.))))
# X <- cbind(intercept = 1, Xj) # glmnet contains the intercept
X <- Xj
epsilon <- rnorm(n = N, sd = 0.5)
beta %>% length
X %>% ncol()
y <- tcrossprod(beta, X) + epsilon
y
For each dataset, to model estimates has to be found:
list(
lasso =
glmnet::cv.glmnet(
X, y, family = "gaussian",
alpha = 1,
intercept = FALSE
),
ridge =
glmnet::cv.glmnet(
X, y, family = "gaussian",
alpha = 0,
intercept = FALSE
)
) %>%
print() %>%
map_df(. %>% coef() %>% as.matrix() %>% t() %>% as_tibble(), .id = "type")
Now, one could use replicate but the number of datasets is very large.
We will have to use parallel programming here...
library(furrr)
plan(multisession, workers = 4)
seq_len(G) %>%
# seq_len(50) %>%
furrr::future_map_dfr(
~ {
Xj <- mvtnorm::rmvnorm(n = N, sigma = sigma) %>%
set_colnames(paste0("x_", seq_len(ncol(.))))
# X <- cbind(intercept = 1, Xj) # glmnet contains the intercept
X <- Xj
epsilon <- rnorm(n = N, sd = 0.5)
y <- tcrossprod(beta, X) + epsilon
list(
lasso =
glmnet::cv.glmnet(
X, y, family = "gaussian",
alpha = 1,
intercept = FALSE,
parallel = FALSE
),
ridge =
glmnet::cv.glmnet(
X, y, family = "gaussian",
alpha = 0,
intercept = FALSE,
parallel = FALSE
)
) %>%
# print() %>%
map_df(. %>%
coef() %>%
as.matrix() %>%
t() %>%
as_tibble(), .id = "type") %>%
mutate(rep = .x)
},
.progress = TRUE,
.options = furrr_options(seed = TRUE)
) ->
results
This will give a progress-bar, and a reps column that ties with dataset
belongs to which model estimates.
Let us try to summarise these results somehow:
results %>%
glimpse() %>%
pivot_longer(c(`(Intercept)`, starts_with("x_")),
names_to = "parameter", values_to = "estimate") %>%
glimpse() %>%
# ggplot(aes(estimate, group = interaction(type, parameter))) +
ggplot(aes(estimate)) +
geom_vline(data = tibble(true_beta = beta, parameter = paste0("x_", 1:9)) %>%
add_row(true_beta = 0, parameter = "(Intercept)"),
aes(xintercept = true_beta)) +
# geom_density() +
stat_bin(geom = "step", aes(y = after_stat(density))) +
facet_grid(type ~ parameter, scales = "free") +
ggpubr::theme_pubclean()
For each parameter, there are a bunch of estimates, and they are then plotted
as a histogram, and then the true values are vertical lines:
The results are quite surprising to me atleast:
Instead of coef one can use glmnet::coef.glmnet, and provide s = c("lambda.1se", "lambda.min"). Just for fun, here's how the plot would look if both of these hyper-parameter lambdas was used:

Ho to run stratified bootstrapped linear regression in R?

Into my model x is categorical variable with 3 categories: 0,1 & 2, where 0 is reference category. However 0 categories are larger than others (1,2), so to avoid biased sample I want to to stratified bootstrapping, but could not find any relevant method for that
df <- data.frame (x = c(0,0,0,0,0,1,1,2,2),
y = c(10,11,10,10,12,17,16,20,19),
m = c(6,5,6,7,2,10,14,8,11)
)
df$x <- as.factor(df$x)
df$x <- relevel(df$x,ref = "0")
fit <- lm(y ~ x*m, data = df)
summary(fit)
Expanding on Roland's answer in the comments, you can harvest the confidence intervals from bootstrapping using boot.ci:
library(boot)
b <- boot(df, \(DF, i) coef(lm(y ~ x*m, data = df[i,])), strata = df$x, R = 999)
result <- do.call(rbind, lapply(seq_along(b$t0), function(i) {
m <- boot.ci(b, type = 'norm', index = i)$normal
data.frame(estimate = b$t0[i], lower = m[2], upper = m[3])
}))
result
#> estimate lower upper
#> (Intercept) 12.9189189 10.7166127 15.08403731
#> x1 6.5810811 2.0162637 8.73184665
#> x2 9.7477477 6.9556841 11.37390826
#> m -0.4459459 -0.8010925 -0.07451434
#> x1:m 0.1959459 -0.1842914 0.55627896
#> x2:m 0.1126126 -0.2572955 0.48352616
And even plot the results like this:
ggplot(within(result, var <- rownames(result)), aes(estimate, var)) +
geom_vline(xintercept = 0, color = 'gray') +
geom_errorbarh(aes(xmin = lower, xmax = upper), height = 0.1) +
geom_point(color = 'red') +
theme_light()

Data is too long Error in R FlexmixNL package

I tried to search this online, but couldn't exactly figure out what my issue was. Here is my code:
n = 10000
x1 <- runif(n,0,100)
x2 <- runif(n,0,100)
y1 <- 10*sin(x1/10) + 10 + rnorm(n, sd = 1)
y2 <- x2 * cos(x2) - 2 * rnorm(n, sd = 2)
x <- c(x1, x2)
y <- c(x1, x2)
start1 = list(a = 10, b = 5)
start2 = list(a = 30, b = 5)
library(flexmix)
library(flexmixNL)
modelNL <- flexmix(y~x, k =2,
model = FLXMRnlm(formula = y ~ a*x/(b+x),
family = "gaussian",
start = list(start1, start2)))
plot(x, y, col = clusters(modelNL))
and before the plot, it gives me this error:
Error in matrix(1, nrow = sum(groups$groupfirst)) : data is too long
I checked google for similar errors, but I don't quite understand what is wrong with my own code that results in this error.
As you can already tell, I am very new to R, so please explain it in the most layman terms possible. Thank you in advance.
Ironically (in the context of an error message saying data is "too long") I think the proximate cause of that error is no data argument. If you give it the data in the form of a dataframe, you still get an error but its not the same one as you are experiencing. When you plot the data, you get a rather bizarre set of values at least from a statistical distribution standpoint and it's not clear why you are trying to model this with this formula. Nonetheless, with those starting values and a dataframe argument to data, one sees results.
> modelNL <- flexmix(y~x, k =2, data=data.frame(x=x,y=y),
+ model = FLXMRnlm(formula = y ~ a*x/(b+x),
+ family = "gaussian",
+ start = list(start1, start2)))
> modelNL
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x/(b + x), family = "gaussian", start = list(start1, start2)))
Cluster sizes:
1 2
6664 13336
convergence after 20 iterations
> summary(modelNL)
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x/(b + x), family = "gaussian", start = list(start1, start2)))
prior size post>0 ratio
Comp.1 0.436 6664 20000 0.333
Comp.2 0.564 13336 16306 0.818
'log Lik.' -91417.03 (df=7)
AIC: 182848.1 BIC: 182903.4
Most R regression functions first check for the matchng names in formulae within the data= argument. Apparently this function fails when it needs to go out to the global environment to match formula tokens.
I tried a formula suggested by the plot of the data and get convergent results:
> modelNL <- flexmix(y~x, k =2, data=data.frame(x=x,y=y),
+ model = FLXMRnlm(formula = y ~ a*x*cos(x+b),
+ family = "gaussian",
+ start = list(start1, start2)))
> modelNL
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
Cluster sizes:
1 2
9395 10605
convergence after 17 iterations
> summary(modelNL)
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
prior size post>0 ratio
Comp.1 0.521 9395 18009 0.522
Comp.2 0.479 10605 13378 0.793
'log Lik.' -78659.85 (df=7)
AIC: 157333.7 BIC: 157389
The reduction in AIC seems huge compare to the first formula.

ggplot exponential smooth with tuning parameter inside exp

ggplot provides various "smoothing methods" or "formulas" that determine the form of the trend line. However it is unclear to me how the parameters of the formula are specified and how I can get the exponential formula to fit my data. In other words how to tell ggplot that it should fit the parameter inside the exp.
df <- data.frame(x = c(65,53,41,32,28,26,23,19))
df$y <- c(4,3,2,8,12,8,20,15)
x y
1 65 4
2 53 3
3 41 2
4 32 8
5 28 12
6 26 8
7 23 20
8 19 15
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "glm", se=FALSE, color="black", formula = y ~ exp(x)) +
geom_point()
p
Problematic fit:
However if the parameter inside the exponential is fit then the form of the trend line becomes reasonable:
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "glm", se=FALSE, color="black", formula = y ~ exp(-0.09 * x)) +
geom_point()
p
Here is an approach with method nls instead of glm.
You can pass additional parameters to nls with a list supplied in method.args =. Here we define starting values for the a and r coefficients to be fit from.
library(ggplot2)
ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "nls", se = FALSE,
formula = y ~ a * exp(r * x),
method.args = list(start = c(a = 10, r = -0.01)),
color = "black") +
geom_point()
As discussed in the comments, the best way to get the coefficients on the graph is by fitting the model outside the ggplot call.
model.coeff <- coef(nls( y ~ a * exp(r * x), data = df, start = c(a = 50, r = -0.04)))
ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "nls", se = FALSE,
formula = y ~ a * exp(r * x),
method.args = list(start = c(a = 50, r = -0.04)),
color = "black") +
geom_point() +
geom_text(x = 40, y = 15,
label = as.expression(substitute(italic(y) == a %.% italic(e)^(r %.% x),
list(a = format(unname(model.coeff["a"]),digits = 3),
r = format(unname(model.coeff["r"]),digits = 3)))),
parse = TRUE)
Firstly, to pass additional parameters to the function passed to the method param of geom_smooth, you can pass a list of named parameters to method.args.
Secondly, the problem you're seeing is that glm is placing the coefficient in front of the whole term: y ~ coef * exp(x) instead of inside: y ~ exp(coef * x) like you want. You could use optimization to solve the latter outside of glm, but you can fit it into the GLM paradigm by a transformation: a log link. This works because it's like taking the equation you want to fit, y = exp(coef * x), and taking the log of both sides, so you're now fitting log(y) = coef * x, which is equivalent to what you want to fit and works with the GLM paradigm. (This ignores the intercept. It also ends up in transformed link units, but it's easy enough to convert back if you like.)
You can run this outside of ggplot to see what the models look like:
df <- data.frame(
x = c(65,53,41,32,28,26,23,19),
y <- c(4,3,2,8,12,8,20,15)
)
bad_model <- glm(y ~ exp(x), family = gaussian(link = 'identity'), data = df)
good_model <- glm(y ~ x, family = gaussian(link = 'log'), data = df)
# this is bad
summary(bad_model)
#>
#> Call:
#> glm(formula = y ~ exp(x), family = gaussian(link = "identity"),
#> data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -7.7143 -2.9643 -0.8571 3.0357 10.2857
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.714e+00 2.437e+00 3.986 0.00723 **
#> exp(x) -3.372e-28 4.067e-28 -0.829 0.43881
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 41.57135)
#>
#> Null deviance: 278.00 on 7 degrees of freedom
#> Residual deviance: 249.43 on 6 degrees of freedom
#> AIC: 56.221
#>
#> Number of Fisher Scoring iterations: 2
# this is better
summary(good_model)
#>
#> Call:
#> glm(formula = y ~ x, family = gaussian(link = "log"), data = df)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.745 -2.600 0.046 1.812 6.080
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.93579 0.51361 7.663 0.000258 ***
#> x -0.05663 0.02054 -2.757 0.032997 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 12.6906)
#>
#> Null deviance: 278.000 on 7 degrees of freedom
#> Residual deviance: 76.143 on 6 degrees of freedom
#> AIC: 46.728
#>
#> Number of Fisher Scoring iterations: 6
From here, you can reproduce what geom_smooth is going to do: make a sequence of x values across the domain and use the predictions as the y values for the line:
# new data is a sequence across the domain of the model
new_df <- data.frame(x = seq(min(df$x), max(df$x), length = 501))
# `type = 'response'` because we want values for y back in y units
new_df$bad_pred <- predict(bad_model, newdata = new_df, type = 'response')
new_df$good_pred <- predict(good_model, newdata = new_df, type = 'response')
library(tidyr)
library(ggplot2)
new_df %>%
# reshape to long form for ggplot
gather(model, y, contains('pred')) %>%
ggplot(aes(x, y)) +
geom_line(aes(color = model)) +
# plot original points on top
geom_point(data = df)
Of course, it's a lot easier to let ggplot handle all that for you:
ggplot(df, aes(x, y)) +
geom_smooth(
method = 'glm',
formula = y ~ x,
method.args = list(family = gaussian(link = 'log'))
) +
geom_point()

Resources