How to convert log function in RStudio? - r

fit1 = lm(price ~ . , data = car)
fit2 = lm(log(price) ~ . , data = car)
I'm not sure how to convert log(price) to price in fit2 Won't it just become the same thing as fit1 if I do convert it? Please help.

Let's take a very simple example. Suppose I have some data points like this:
library(ggplot2)
df <- data.frame(x = 1:10, y = (1:10)^2)
(p <- ggplot(df, aes(x, y)) + geom_point())
I want to try to fit a model to them, but don't know what form this should take. I try a linear regression first and plot the resultant prediction:
mod1 <- lm(y ~ x, data = df)
(p <- p + geom_line(aes(y = predict(mod1)), color = "blue"))
Next I try a linear regression on log(y). Whatever results I get from predictions from this model will be predicted values of log(y). But I don't want log(y) predictions, I want y predictions, so I need to take the 'anti-log' of the prediction. We do this in R by doing exp:
mod2 <- lm(log(y) ~ x, data = df)
(p <- p + geom_line(aes(y = exp(predict(mod2))), color = "red"))
But we can see that we have different regression lines. That's because when we took the log of y, we were effectively fitting a straight line on the plot of log(y) against x. When we transform the axis back to a non-log axis, our straight line becomes an exponential curve. We can see this more clearly by drawing our plot again with a log-transformed y axis:
p + scale_y_log10(limits = c(1, 500))
Created on 2020-08-04 by the reprex package (v0.3.0)

Related

how to force a polynomial regression to be very flexible for big sharp turns while reducing flexibility for small turns

Is it possible to force a polynomial regression to be very flexible for big sharp turns while reducing flexibility for small turns?
The reason is that I have a variabele y which depends on x for which I am sure there is a positive correlation, although my dataset contains some noise. This is why I do not want a wiggly line for x between 1 and 75 as in the graph below.
library(ggplot2)
library(dplyr)
x <- c(1:100)
y <- c(1,3,6,12,22,15,13,11,5,1,3,6,12,22,11,5,1,3,6,12,22,11,5,5,9,10,1,6,12,22,15,13,11,5,-1,-12,-23,6,12,22,11,5,1,3,6,12,22,11,5,-11,-22,-9,12,22,11,5,9,10,18,1,3,6,12,22,15,13,11,5,-5, -9, -12,6,12,22,11,5,1,3,6,12,22,11,5,1,3,6,12,22,11,5,9,10,18,28,37,50,90,120,150,200)
y <- y + x
df <- data.frame(x, y)
model <- lm(y ~ poly(x, 6, raw = TRUE), data = df)
predictions <- model %>% predict(df)
df <- cbind(df, predictions)
ggplot() +
geom_point(data = df, aes(x = x, y = y), size = 0.1) +
geom_line(data = df, aes(x = x, y = predictions), colour="blue", size=0.1)
I can alter the model to:
model <- lm(y ~ poly(x, 2, raw = TRUE), data = df)
Which gives this graph:
In this case the model is without wigglyness for x between 0 and 90 although it is missing the flexibility to make the turn around x is 90.
I am not looking for a specific solution for this example dataset. I am looking for a solution to have a polynomial regression flexible enough to make sharp (big) turns, although reducing wigglyness for small turns at the same time. (Maybe this can be solved by a limit to make a maximum of n turns?)
I want to use it automated at several datasets. For this reason I do not want to specify different ranges of x for different models.
Thank you!
I have also tried using gam from mgcv, although this gives similar results:
mygam <- gam(y ~ s(x, k = 7), data = df)
mygam <- gam(y ~ s(x, k = 3), data = df)
This graph is based on pmax(p1, p2) where p1 and p2 are two polynomials:

Plotting multiple lm() models in one plot

I have fitted 6 lm() models and 1 gam() model on the same dataset.
Now I want to plot them all in one plot on top of each other. Can I do this without defining the models again in ggplot?
My case is this
I have
model1 <- lm(y~1, data = data) %>% coef()
model2 <- lm(y~x, data = data) %>% coef()
model3 <- lm(y~abs(x), data = data) %>% coef()
...
model7 <- gam(y~s(x), data = data) %>% coef()
can I feed the stored coefficients of my models to ggplot?
ggplot(data, mapping = aes(x = x, y = y)) +
geom_point() +
geom_abline(model1) +
geom_abline(model2) +
....
Or do Is the only way to plot the model prediction lines to manualy fill out the parameters like this:
ggplot(data, mapping = aes(x = x, y = y)) +
geom_point() +
geom_abline(intercept = model1[1]) +
geom_abline(slope = model2[2], intercept = model2[1]) +
geom_abline(slope = model3[2], intercept = model3[1]) +
...
Example code
set.seed(123)
x <- rnorm(50)
y <- rweibull(50,1)
d <- as.data.frame(cbind(x,y))
model1 <- coef(lm(y~1, data = d))
model2 <- coef(lm(y~x, data = d))
model3 <- coef(lm(y~abs(x), data = d))
Including the SE for each line/model and a legend would be welcome as well.
In order for this to work, you really need to save the whole model. So if we assume you have the entire model
# set.seed(101) used for sample data
model1 <- lm(y~1, data = d)
model2 <- lm(y~x, data = d)
model3 <- lm(y~abs(x), data = d)
We can write a helper function to predict new values from these models over a the given range of x values. Here's such a function
newvalsforx <- function(x) {
xrng <- seq(min(x), max(x), length.out=100)
function(m) data.frame(x=xrng, y=predict(m, data.frame(x=xrng)))
}
pred <- newvals(d$x)
This pred() will make predictions from the models over the observed range of x. We can then use these as new data to pass to geom_lines that we can add to a plot. For example
ggplot(d, aes(x,y)) +
geom_point() +
geom_line(data=pred(model1), color="red") +
geom_line(data=pred(model2), color="blue") +
geom_line(data=pred(model3), color="green")
This gives me

How to create a 2nd order trendline in R [duplicate]

I have a simple polynomial regression which I do as follows
attach(mtcars)
fit <- lm(mpg ~ hp + I(hp^2))
Now, I plot as follows
> plot(mpg~hp)
> points(hp, fitted(fit), col='red', pch=20)
This gives me the following
I want to connect these points into a smooth curve, using lines gives me the following
> lines(hp, fitted(fit), col='red', type='b')
What am I missing here. I want the output to be a smooth curve which connects the points
I like to use ggplot2 for this because it's usually very intuitive to add layers of data.
library(ggplot2)
fit <- lm(mpg ~ hp + I(hp^2), data = mtcars)
prd <- data.frame(hp = seq(from = range(mtcars$hp)[1], to = range(mtcars$hp)[2], length.out = 100))
err <- predict(fit, newdata = prd, se.fit = TRUE)
prd$lci <- err$fit - 1.96 * err$se.fit
prd$fit <- err$fit
prd$uci <- err$fit + 1.96 * err$se.fit
ggplot(prd, aes(x = hp, y = fit)) +
theme_bw() +
geom_line() +
geom_smooth(aes(ymin = lci, ymax = uci), stat = "identity") +
geom_point(data = mtcars, aes(x = hp, y = mpg))
Try:
lines(sort(hp), fitted(fit)[order(hp)], col='red', type='b')
Because your statistical units in the dataset are not ordered, thus, when you use lines it's a mess.
Generally a good way to go is to use the predict() function. Pick some x values, use predict() to generate corresponding y values, and plot them. It can look something like this:
newdat = data.frame(hp = seq(min(mtcars$hp), max(mtcars$hp), length.out = 100))
newdat$pred = predict(fit, newdata = newdat)
plot(mpg ~ hp, data = mtcars)
with(newdat, lines(x = hp, y = pred))
See Roman's answer for a fancier version of this method, where confidence intervals are calculated too. In both cases the actual plotting of the solution is incidental - you can use base graphics or ggplot2 or anything else you'd like - the key is just use the predict function to generate the proper y values. It's a good method because it extends to all sorts of fits, not just polynomial linear models. You can use it with non-linear models, GLMs, smoothing splines, etc. - anything with a predict method.

Adding R squared value to orthogonal regression line in R

I have produced a scatter plot in R of expected/observed values. I calculated orthogonal regression and added the line using the following:
library(ggplot2)
library(MethComp)
r<-read_csv("Uni/MSci/Project/DATA/new data sheets/comparisons/for comarison
graphs/R Regression/GCdNi.csv")
x<-r[1]
y<-r[2]
P<-ggplot()+geom_point(aes(x=x,y=y))+
scale_size_area()+xlab("Expected")+ylab("Observed")+ggtitle("G - Cd x Ni")+
xlim(0, 40)+ylim(0, 40)
# Orthogonal, total least squares or Deming regression
deming <- Deming(y=r$Observed, x=r$Expected)[1:2]
deming
R <- prcomp( ~ r$Expected + r$Observed )
slope <- R$rotation[2,1] / R$rotation[1,1]
slope
intercept <- R$center[2] - slope*R$center[1]
intercept
#Plot orthogonal regression
P+geom_abline(intercept = deming[1], slope = deming[2])
This gives me the following plot:
Is there a way I can calculate and add an R squared value to the graph?
Heres some of the data frame to allow for reproduction:
Expected Observed
2.709093153 1.37799781
2.611562579 1.410720257
2.22411805 1.287685907
3.431914392 1.906787706
3.242018129 1.823698676
3.46139841 1.767857729
2.255673738 1.111307235
2.400606765 1.294583377
1.818447253 0.995226256
2.528992184 1.173159775
2.46829393 1.101852756
1.826044939 0.883336715
1.78702201 1.050122993
2.37226253 1.025298403
2.140921846 1.094761918
I could not reproduce your data, but here's how you could do something like that with linear regression.
library(ggplot2)
set.seed(1)
x <- rnorm(20,1,100)
y<- x + rnorm(20,50,10)
regression <- lm(y ~ x)
r2 <- summary(regression)$r.squared
ggplot() + geom_point(aes(x, y)) +
geom_line(aes(x, regression$fitted.values)) +
annotate("text", x = -100, y = 200, label = paste0("r squared = ", r2))
In the future, you should provide a reproducible example.

Obtaining predictions from an mgcv::gam fit that contains a matrix "by" variable to a smooth

I just discovered that mgcv::s() permits one to supply a matrix to its by argument, permitting one to smooth a continuous variable with separate smooths for each of a combination of variables (and their interactions if so desired). However, I'm having trouble getting sensible predictions from such models, for example:
library(mgcv) #for gam
library(ggplot2) #for plotting
#Generate some fake data
set.seed(1) #for replicability of this example
myData = expand.grid(
var1 = c(-1,1)
, var2 = c(-1,1)
, z = -10:10
)
myData$y = rnorm(nrow(myData)) + (myData$z^2 + myData$z*4) * myData$var1 +
(3*myData$z^2 + myData$z) * myData$var2
#note additive effects of var1 and var2
#plot the data
ggplot(
data = myData
, mapping = aes(
x = z
, y = y
, colour = factor(var1)
, linetype = factor(var2)
)
)+
geom_line(
alpha = .5
)
#reformat to matrices
zMat = matrix(rep(myData$z,times=2),ncol=2)
xMat = matrix(c(myData$var1,myData$var2),ncol=2)
#get the fit
fit = gam(
formula = myData$y ~ s(zMat,by=xMat,k=5)
)
#get the predictions and plot them
predicted = myData
predicted$value = predict(fit)
ggplot(
data = predicted
, mapping = aes(
x = z
, y = value
, colour = factor(var1)
, linetype = factor(var2)
)
)+
geom_line(
alpha = .5
)
Yields this plot of the input data:
And this obviously awry plot of the predicted values:
Whereas replacing the gam fit above with:
fit = gam(
formula = y ~ s(z,by=var1,k=5) + s(z,by=var2,k=5)
, data = myData
)
but otherwise running the same code yields this reasonable plot of predicted values:
What am I doing wrong here?
The use of vector-valued inputs to mgcv smooths is taken up here. It seems to me that you are misunderstanding these model types.
Your first formula
myData$y ~ s(zMat,by=xMat,k=5)
fits the model
y ~ f(z)*x_1 + f(z)*x_2
That is, mgcv estimates a single smooth function f(). This function is evaluated at each covariate, with the weightings supplied to the by argument.
Your second formula
y ~ s(z,by=var1,k=5) + s(z,by=var2,k=5)
fits the model
y ~ f_1(z)*x_1 +f_2(z)*x_2
where f_1() and f_2() are two different smooth functions. Your data model is essentially the second formula, so it is not surprising that it gives a more sensible looking fit.
The first formula is useful when you want an additive model where a single function is evaluated on each variable, with given weightings.

Resources