I use formula y=x^3+3 to generate data.frame df with variables x and y,
but when i using lm() to describe the relation of xy, i get y=81450x-5463207.2. This is really different with the original y=x^3+3.
How to make lm() or using other way to reappear the original formula ?
library(tidyverse)
mf <- function(x){
y=x^3+3
}
df=data.frame()
for (i in 1:300){
df[i,1]=i
df[i,2]=mf(i)
}
names(df) <- c('x','y')
model <- lm(y~x,data = df)
model$coefficients
#DarrenTsai ansered first in the comments, if he also writes an answer, consider to accept his' first.
lm(y ~ x, data = df) searches for a solution in the form of y = b0 + b1*x which is not how the data were generated. You can tell lm to include x^n using I() as in
lm(y ~ x + I(x^2), + I(x^3) + I(x^4))
a short form for x + I(x^2), + I(x^3) + ... + I(x^n) is `poly(x, n)' as used in the comment of user2554330
Let me do some changes of your code for better coding style
# library(tidyverse) -- you did not use any of this so there is no need to load it
mf <- function(x){ # -- writing this in one without curly braces is an option
y=x^3+3 # -- this will be retrieved as Intercept 3 plus 1*x^3
}
#for (i in 1:300){ -- there is really no need for a loop here
# df[i,1]=i
# df[i,2]=mf(i)
#}
#names(df) <- c('x','y')
df <- data.frame(x = 1:300, #-- this is shorter and faster then the loop
y = mf(1:300))
model <- lm(y ~ poly(x, 5, raw = TRUE), data = df)
round(coef(model), 4)
#> (Intercept) poly(x, 5, raw = TRUE)1
#> 3 0
#> poly(x, 5, raw = TRUE)2 poly(x, 5, raw = TRUE)3
#> 0 1
#> poly(x, 5, raw = TRUE)4 poly(x, 5, raw = TRUE)5
#> 0 0
Created on 2022-09-24 with reprex v2.0.2
(Intercept) is three and the I(x^3) coded here as poly(df$x, 5, raw = TRUE)3 is one as coded in mf.
Related
Forewarning: I am a complete noob, so I'm sorry for the dumb question. I've tried everything trying to figure out how to write out the actual polynomial function given these coefficients.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 89.6131 0.8525 105.119 < 2e-16
poly(log(x), 3, raw = TRUE)1 -36.8351 2.3636 -15.584 1.13e-10
poly(log(x), 3, raw = TRUE)2 6.9735 1.6968 4.110 0.000928
poly(log(x), 3, raw = TRUE)3 -0.7105 0.3124 -2.274 0.038063
I thought that it would just be f(x) = 89.6131 - 36.8351log(x) + 6.9735log(x^2) - 0.7105*log(x^3).
I've tried a bunch of variations of this but nothing seems to work. I'm trying to plug my polynomial function and my x-values in to Desmos and get it to return what I'm getting in R which is:
1 2 3 4 5 6
9.806469 15.028672 20.317227 25.669588 28.757896 35.816853
7 8 9 10 11 12
41.334623 43.919057 49.267966 53.880519 60.862101 63.830004
13 14 15 16 17 18
70.390727 79.412081 80.416065 85.214063 86.165068 98.187744
19
96.723278
My x values are:
x = c(49.64,34.61,23.76,16.31,13.23,8.47,6.19,5.4,4.15,3.37,2.53,2.26,1.79,1.34,1.3,1.13,1.1,0.8,0.83)
Modeling code:
#data
x = c(49.64,34.61,23.76,16.31,13.23,8.47,6.19,5.4,4.15,3.37,2.53,2.26,1.79,1.34,1.3,1.13,1.1,0.8,0.83)
y = c(10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100)
#fitting the model
model1 <- lm(y~poly(log(x),3,raw=TRUE))
new.distance <- data.frame(
distance = c(49.64,34.61,23.76,16.31,13.23,8.47,6.19,5.4,4.15,3.37,2.53,2.26,1.79,1.34,1.3,1.13,1.1,0.83,0.8)
)
predict(model1, newdata = new.distance)
summary(model1)
Libraries
library(tidyverse)
Sample data
x <- c(49.64,34.61,23.76,16.31,13.23,8.47,6.19,5.4,4.15,3.37,2.53,2.26,1.79,1.34,1.3,1.13,1.1,0.8,0.83)
y <- c(10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100)
df <-
tibble(
x = x,
y = y
) %>%
mutate(
lx = log(x)
)
Fitting model
model1 <- lm(y~poly(log(x),3,raw=TRUE))
Predicting data
df_to_pred <-
data.frame(
x_to_pred = c(49.64,34.61,23.76,16.31,13.23,8.47,6.19,5.4,4.15,3.37,2.53,2.26,1.79,1.34,1.3,1.13,1.1,0.83,0.8)
)
Original data x predicted
Predicted data - function x manual
df %>%
cbind(y_pred_model = predict(model1, newdata = new.distance)) %>%
mutate(y_pred_manual = 89.6131 - 36.8351*log(x) + 6.9735*log(x)^2 - 0.7105*log(x)^3) %>%
ggplot(aes(y_pred_manual,y_pred_model))+
geom_abline(intercept = 0,slope = 1,size = 1, col = "red")+
geom_point()
Let's consider data following:
set.seed(42)
y <- runif(100)
df <- data.frame("Exp" = rexp(100), "Norm" = rnorm(100), "Wei" = rweibull(100, 1))
I want to perform linear regression but when formula is a string in format:
form <- "Exp + Norm + Wei"
I thought that I only have to use:
as.formula(lm(y~form, data = df))
However it's not working. The error is about variety in length of variables. (it seems like it still treats form as a string vector of length 1, but I have no idea why).
Do you know how I can do it ?
We can use paste to construct the formula, and use it directly on lm
lm(paste('y ~', form), data = df)
-output
#Call:
#lm(formula = paste("y ~", form), data = df)
#Coefficients:
#(Intercept) Exp Norm Wei
# 0.495861 0.026988 0.046689 0.003612
In the modelr package the function gather_predictions can be used to add predictions from multiple models to a data frame, I'm however unsure on how to specify these models in the function call. The help documentation gives the following exmaple:
df <- tibble::data_frame(
x = sort(runif(100)),
y = 5 * x + 0.5 * x ^ 2 + 3 + rnorm(length(x))
)
m1 <- lm(y ~ x, data = df)
grid <- data.frame(x = seq(0, 1, length = 10))
grid %>% add_predictions(m1)
m2 <- lm(y ~ poly(x, 2), data = df)
grid %>% spread_predictions(m1, m2)
grid %>% gather_predictions(m1, m2)
here the models are specifically mentioned in the function call. That works fine if we have a few models we want predictions for, but what if we have a large or unknown amount of models? In this case manually specifying the models isn't really workable anymore.
the way the help documentation phrases the arguments segment seems to suggest you need to add every model as a separate argument.
gather_predictions and spread_predictions take multiple models. The
name will be taken from either the argument name of the name of the
model.
And for example inputting a list of models into gather_predictions doesn't work.
Is there some easy way to input a list / large amount of models to gather_predictions?
example for 10 models in a list:
modelslist <- list()
for (N in 1:10) {
modelslist[[N]] <- lm(y ~ poly(x, N), data = df)
}
If having the models stored some other way than a list works better, that's fine as well.
m <- grid %>% gather_predictions(lm(y ~ poly(x, 1), data = df))
for (N in 2:10) {
m <- rbind(m, grid %>% gather_predictions(lm(y ~ poly(x, N), data = df)))
}
There are workarounds to solve this problem. My approach was to:
1. build a list of models with specific names
2. use a tweaked version of modelr::gather_predictions() to apply all models in the list to data
# prerequisites
library(tidyverse)
set.seed(1363)
# I'll use generic name 'data' throughout the code, so you can easily try other datasets.
# for this example I'll use your data df
data=df
# data visualization
ggplot(data, aes(x, y)) +
geom_point(size=3)
your sample data
# build a list of models
models <-vector("list", length = 5)
model_names <- vector("character", length=5)
for (i in 1:5) {
modelformula <- str_c("y ~ poly(x,", i, ")", sep="")
models[[i]] <- lm(as.formula(modelformula), data = data)
model_names[[i]] <- str_c('model', i) # remember we name the models here sequantially
}
# apply names to the models list
names(models) <- model_names
# this is modified verison of modelr::gather_predictions() in order to accept list of models
gather.predictions <- function (data, models, .pred = "pred", .model = "model")
{
df <- map2(models, .pred, modelr::add_predictions, data = data)
names(df) <- names(models)
bind_rows(df, .id = .model)
}
# the rest is the same as modelr's function...
grids <- gather.predictions(data = data, models = models, .pred = "y")
ggplot(data, aes(x, y)) +
geom_point() +
geom_line(data = grids, colour = "red") +
facet_wrap(~ model)
example of polynomial models (degree 1:5) applied to your sample data
side note: there are good reasons why I chose strings to build the model...to discuss.
Thanks to this post regarding the failure of stepwise variable selection in lm
I have a data for example looks like below as described in that post
set.seed(1) # for reproducible example
x <- sample(1:500,500) # need this so predictors are not perfectly correlated.
x <- matrix(x,nc=5) # 100 rows, 5 cols
y <- 1+ 3*x[,1]+2*x[,2]+4*x[,5]+rnorm(100) # y depends on variables 1, 2, 5 only
# you start here...
df <- data.frame(y,as.matrix(x))
full.model <- lm(y ~ ., df) # include all predictors
step(full.model,direction="backward")
What I need is to select only 5 best variables and then 6 best variables out of these 20, does anyone know how to make this contarains?
MuMIn::dredge() has the option about the limits for number of terms. [NOTE]: the number of combinations, the time required, grows exponentially with number of predictors.
set.seed(1) # for reproducible example
x <- sample(100*20)
x <- matrix(x, nc = 20) # 20 predictor
y <- 1 + 2*x[,1] + 3*x[,2] + 4*x[,3] + 5*x[,7] + 6*x[,8] + 7*x[,9] + rnorm(100) # y depends on variables 1,2,3,7,8,9 only
df <- data.frame(y, as.matrix(x))
full.model <- lm(y ~ ., df) # include all predictors
library(MuMIn)
# options(na.action = "na.fail") # trace = 2: a progress bar is displayed
dredge(full.model, m.lim = c(5, 5), trace = 2) # result: x2, x3, x7, x8, x9
In R, one can build an lm() or glm() object with fixed coefficients, using the offset parameter in a formula.
x=seq(1,100)
y=x^2+3*x+7
# Forcing to fit the polynomial: 2x^2 + 4x + 8
fixed_model = lm(y ~ 0 + offset(8 + 4*x + 2*I(x^2) ))
Is it possible to do the same thing using poly()? I tried the code below but it doesn't seem to work.
fixed_model_w_poly <- lm(y ~ 0 + offset(poly(x, order=2, raw = TRUE, coefs= c(8, 4, 2))))
Error : number of offsets is 200, should equal 100 (number of observations)
I want to use poly() as a convenient interface to run iterations with a high number of fixed coefficients or order values, rather than having to manually code: offset(8 + 4*x + 2*I(x^2) ) for each order/coefficient combination.
P.S: Further but not essential information: This is to go inside an MCMC routine. So an example usage would be to generate (and then compare) model_current to model_next in the below code:
library(MASS)
coeffs_current <- c(8, 4, 2)
model_current <- lm(y ~ 0 + offset(poly(x, order=2, raw = TRUE, coefs= coeffs_current )))
cov <- diag(rep(1,3))
coeffs_next <- mvrnorm(1, mu = as.numeric(coeffs_current ),
Sigma = cov )
model_next <- lm(y ~ 0 + offset(poly(x, order=2, raw = TRUE, coeffs_next ))
This demonstrates what I suggested. (Not to use poly.)
library(MASS)
# coeffs_current <- c(8, 4, 2) Name change for compactness.
cc <- c(8, 4, 2)
form <- as.formula(bquote(y~x+offset(.(cc[1])+x*.(cc[2])+.(cc[3])*I(x^2) )))
model_current <- lm(form, data=dat))
I really have no idea what you intend to do with this next code. Looks like you want something based on the inputs to the prior function, but doesn't look like you want it based on the results.
cov <- diag(rep(1,3))
coeffs_next <- mvrnorm(1, mu = as.numeric(cc ),
Sigma = cov )
The code works (at least as I intended) with a simple test case. The bquote function substitutes values into expressions (well actually calls) and the as.formula function evaluates its argument and then dresses the result up as a proper formula-object.
dat <- data.frame(x=rnorm(20), y=rnorm(20) )
cc <- c(8, 4, 2)
form <- as.formula( bquote(y~x+offset(.(cc[1])+x*.(cc[2])+.(cc[3])*I(x^2) )))
model_current <- lm(form, data=dat)
#--------
> model_current
Call:
lm(formula = form, data = dat)
Coefficients:
(Intercept) x
-9.372 -5.326 # Bizarre results due to the offset.
#--------
form
#y ~ x + offset(8 + x * 4 + 2 * I(x^2))