how to force a polynomial regression to be very flexible for big sharp turns while reducing flexibility for small turns - r

Is it possible to force a polynomial regression to be very flexible for big sharp turns while reducing flexibility for small turns?
The reason is that I have a variabele y which depends on x for which I am sure there is a positive correlation, although my dataset contains some noise. This is why I do not want a wiggly line for x between 1 and 75 as in the graph below.
library(ggplot2)
library(dplyr)
x <- c(1:100)
y <- c(1,3,6,12,22,15,13,11,5,1,3,6,12,22,11,5,1,3,6,12,22,11,5,5,9,10,1,6,12,22,15,13,11,5,-1,-12,-23,6,12,22,11,5,1,3,6,12,22,11,5,-11,-22,-9,12,22,11,5,9,10,18,1,3,6,12,22,15,13,11,5,-5, -9, -12,6,12,22,11,5,1,3,6,12,22,11,5,1,3,6,12,22,11,5,9,10,18,28,37,50,90,120,150,200)
y <- y + x
df <- data.frame(x, y)
model <- lm(y ~ poly(x, 6, raw = TRUE), data = df)
predictions <- model %>% predict(df)
df <- cbind(df, predictions)
ggplot() +
geom_point(data = df, aes(x = x, y = y), size = 0.1) +
geom_line(data = df, aes(x = x, y = predictions), colour="blue", size=0.1)
I can alter the model to:
model <- lm(y ~ poly(x, 2, raw = TRUE), data = df)
Which gives this graph:
In this case the model is without wigglyness for x between 0 and 90 although it is missing the flexibility to make the turn around x is 90.
I am not looking for a specific solution for this example dataset. I am looking for a solution to have a polynomial regression flexible enough to make sharp (big) turns, although reducing wigglyness for small turns at the same time. (Maybe this can be solved by a limit to make a maximum of n turns?)
I want to use it automated at several datasets. For this reason I do not want to specify different ranges of x for different models.
Thank you!
I have also tried using gam from mgcv, although this gives similar results:
mygam <- gam(y ~ s(x, k = 7), data = df)
mygam <- gam(y ~ s(x, k = 3), data = df)
This graph is based on pmax(p1, p2) where p1 and p2 are two polynomials:

Related

How to convert log function in RStudio?

fit1 = lm(price ~ . , data = car)
fit2 = lm(log(price) ~ . , data = car)
I'm not sure how to convert log(price) to price in fit2 Won't it just become the same thing as fit1 if I do convert it? Please help.
Let's take a very simple example. Suppose I have some data points like this:
library(ggplot2)
df <- data.frame(x = 1:10, y = (1:10)^2)
(p <- ggplot(df, aes(x, y)) + geom_point())
I want to try to fit a model to them, but don't know what form this should take. I try a linear regression first and plot the resultant prediction:
mod1 <- lm(y ~ x, data = df)
(p <- p + geom_line(aes(y = predict(mod1)), color = "blue"))
Next I try a linear regression on log(y). Whatever results I get from predictions from this model will be predicted values of log(y). But I don't want log(y) predictions, I want y predictions, so I need to take the 'anti-log' of the prediction. We do this in R by doing exp:
mod2 <- lm(log(y) ~ x, data = df)
(p <- p + geom_line(aes(y = exp(predict(mod2))), color = "red"))
But we can see that we have different regression lines. That's because when we took the log of y, we were effectively fitting a straight line on the plot of log(y) against x. When we transform the axis back to a non-log axis, our straight line becomes an exponential curve. We can see this more clearly by drawing our plot again with a log-transformed y axis:
p + scale_y_log10(limits = c(1, 500))
Created on 2020-08-04 by the reprex package (v0.3.0)

plot lower-level interactions with predicted values in ggplot2

sub <- c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,20,20)
f1 <- c("f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","f","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m","m")
f2 <- c("c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c2","c2","c2","c2","c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c2","c2","c2","c2")
f3 <- c(0.03,0.03,0.49,0.49,0.33,0.33,0.20,0.20,0.13,0.13,0.05,0.05,0.47,0.47,0.30,0.30,0.22,0.22,0.15,0.15, 0.03,0.03,0.49,0.49,0.33,0.33,0.20,0.20,0.13,0.13,0.05,0.05,0.47,0.47,0.30,0.30,0.22,0.22,0.15,0.15)
y <- c(0.9,1,98,96,52,49,44,41,12,19,5,5,89,92,65,56,39,38,35,33, 87,83,5,7,55,58,67,61,70,80,88,90,0.8,0.9,55,52,55,58,70,69)
dat <- data.frame(sub=sub, f1=f1, f2=f2, f3=f3, y=y)
m <- lmer(y ~ f1*f2*f3 + (1|sub), data=dat)
Only the f1*f3 interaction is significant so now I'd like to plot this interaction using the predicted values from model m. I tried
X <- with(dat, expand.grid(f1=unique(f1), f3=range(f3)))
X$Predicted <- predict(m, newdata=X, re.form=NA)
but get an error...
If I add f2 and plot the results
X <- with(dat, expand.grid(f1=unique(f1), f3=range(f3), f2=unique(f2)))
X$Predicted <- predict(m, newdata=X, re.form=NA)
ggplot(X, aes(f3, Predicted)) + geom_path(aes(color=f2)) + facet_wrap(~f1)
I get two slopes in each panel corresponding to the levels of f2, but I just want the f1*f3 interaction from model m (without f2). Does anybody know how can I solve this?
The effects package is useful:
library(effects)
fit <- effect('f1:f3', m) # add xlevels = 100 for higher resolution CI's
fit_df <- as.data.frame(fit)
ggplot() +
geom_point(aes(f3, y, color = f1), dat) +
geom_ribbon(aes(f3, ymin = lower, ymax = upper, fill = f1), fit_df, alpha = 0.3) +
geom_line(aes(f3, fit, color = f1), fit_df)
The package prints a NOTE warning you that the requested term is part of a higher order interaction. Proceed at own risk. I'm pretty sure the confidence intervals here are asymptotic.

Making surface plot of regression estimates from multiple continuous variables

I have a multi-level model with categorical and continuous variables and splines. Nice and complex. Anyhow I am trying to visualize model fit.
For example, here is some toy data:
library(lme4)
library(rms)
library(gridExtra)
## Make model using sleepstudy data
head(sleepstudy)
# Add some extra vars
sleepstudy$group <- factor( sample(c(1,2), nrow(sleepstudy), replace=TRUE) )
sleepstudy$x1 <- jitter(sleepstudy$Days, factor=5)^2 * jitter(sleepstudy$Reaction)
# Set up a mixed model with spline
fm1 <- lmer(Reaction ~ rcs(Days, 4) * group + (rcs(Days, 4) | Subject), sleepstudy)
# Now add continuous covar
fm2 <- lmer(Reaction ~ rcs(Days, 4) * group + x1 + (rcs(Days, 4) | Subject), sleepstudy)
# Plot fit
new.df <- sleepstudy
new.df$pred1 <- predict(fm1, new.df, allow.new.levels=TRUE, re.form=NA)
new.df$pred2 <- predict(fm2, new.df, allow.new.levels=TRUE, re.form=NA)
g1 <- ggplot(data=new.df, aes(x=Days)) +
geom_line(aes(y=pred1, col=group), size=2) +
ggtitle("Model 1")
g2 <- ggplot(data=new.df, aes(x=Days)) +
geom_line(aes(y=pred2, col=group), size=2) +
ggtitle("Model 2")
grid.arrange(g1, g2, nrow=1)
Plot 1 is smooth, but plot 2 is jagged due to the effect of x1. So I would like to make a surface plot with x = Days, y = x1 and z = pred2 and stratified by group. Not having experience of surface plots I've started out with the wireframe command:
wireframe(pred2 ~ Days * x1, data = new.df[new.df$group==1,],
xlab = "Days", ylab = "x1", zlab="Predicted fit"
)
However although this command does not give an error, my plot is blank:
Questions:
Where am I going wrong with my wireframe?
Is there a better way to visualize my model fit?
I figured out that the data format needed for a wireframe' orplot_ly' surface is that of a 2D matrix of x rows by y columns of corresponding z values (I got a hint towards this from this question Plotly 3d surface graph has incorrect x and y axis values). I also realised I could use `expand.grid' to make a matrix covering the range of possible x and y values and use those to predict z as follows:
days <- 0:9
x1_range <- range(sleepstudy$x1)[2] * c(0.05, 0.1, 0.15, 0.2, 0.25, 0.3)
new.data2 <- expand.grid(Days = days, x1 = x1_range, group = unique(sleepstudy$group) )
new.data2$pred <- predict(fm2, new.data2, allow.new.levels=TRUE, re.form=NA)
I can then stuff those into two different matrices to represent the z-surface for each group in my model:
surf1 <- ( matrix(new.data2[new.data2$group == 1, ]$pred, nrow = length(days), ncol = length(x1_range)) )
surf2 <- ( matrix(new.data2[new.data2$group == 2, ]$pred, nrow = length(days), ncol = length(x1_range)) )
group <- c(rep(1, nrow(surf1)), rep(2, nrow(surf2) ))
Finally I can use plot_ly to plot each surface:
plot_ly (z=surf1, x = mets_range, y = ages, type="surface") %>%
add_surface (z = surf2, surfacecolor=surf2,
color=c('red','yellow'))
The resulting plot:
So the resulting plot is what I wanted (albeit not very useful in this made up example but useful in real data). The only thing I can't figure out is how to show two different color scales. I can suppres the scale altogether but if anyone knows how to show 2 scales for different surfaces do please let me know and I will edit the answer.

R: GAM with fit on subset of data

I fit a Generalized Additive Model using gam from the mgcv package. I have a data table containing my dependent variable Y, an independent variable X, other independent variables Oth and a two-level factor Fac. I would like to fit the following model
Y ~ s(X) + Oth
BUT with the additional constraint that the s(X) term is fit only on one of the two levels of the factor, say Fac==1. The other terms Oth should be fit with the whole data.
I tried exploring s(X,by=Fac) but this biases the fit for Oth. In other words, I would like to express the belief that X relates to Y only if Fac==1, otherwise it does not make sense to model X.
Cheap trick: use an auxiliary variable that is X if Fac == 1 and 0 elsewhere.
library("mgcv")
library("ggplot2")
# simulate data
N <- 1e3
dat <- data.frame(covariate = runif(N),
predictor = runif(N),
group = factor(sample(0:1, N, TRUE)))
dat$outcome <- rnorm(N,
1 * dat$covariate +
ifelse(dat$group == 1,
.5 * dat$predictor +
1.5 * sin(dat$predictor * pi),
0), .1)
# some plots
ggplot(dat, aes(x = predictor, y = outcome,
col = group, group = group)) +
geom_point()
ggplot(dat, aes(x = covariate, y = outcome,
col = group, group = group)) +
geom_point()
# create auxiliary variable
dat$aux <- ifelse(dat$group == 1,
dat$predictor,
0)
# fit the data
fit1 <- gam(outcome ~ covariate + s(predictor, by = group),
data = dat)
fit2 <- gam(outcome ~ covariate + s(aux, by = group),
data = dat)
# compare fits
summary(fit1)
summary(fit2)
If I understand it right, you're thinking about some model with interaction like this:
Y ~ 0th + (Fac==1)*s(X)
If you want to "express the belief that X relates to Y only if Fac==1" don't treat Fac as a factor, but as a numeric variable. In this case you will get numeric interaction and only one set of coefficients (when it's a factor there where two). This type of model is a varying coefficient model.
# some data
data <- data.frame(th = runif(100),
X = runif(100),
Y = runif(100),
Fac = sample(0:1, 100, TRUE))
data$Fac<-as.numeric(as.character(data$Fac)) #change to numeric
# then run model
gam(Y~s(X, by=Fac)+th,data=data)
See the documentation for by option in the documentation ?s

A replacement for method = 'loess'

This is where I'm at so far:
I have a data frame df with two columns A and B (both containing real numbers) where b is dependent on a. I plot the columns against each other:
p = ggplot(df, aes(A, B)) + geom_point()
and see that the relationship is non-linear. Adding:
p = p + geom_smooth(method = 'loess', span = 1)
gives a 'good' line of best fit. Given a new value a of A I then use the following method to predict the value of B:
B.loess = loess(B ~ A, span = 1, data = df)
predict(B.loess, newdata = a)
So far, so good. However, I then realise I can't extrapolate using loess (presumably because it is non-parametric?!). The extrapolation seems fairly natural - the relationship looks something like a power type thing is going on e.g:
x = c(1:10)
y = 2^x
df = data.frame(A = x, B = y)
This is where I get unstuck. Firstly, what methods can I use to plot a line of best fit to this kind of ('power') data without using loess? Pathetic attempts such as:
p = ggplot(df, aes(A, B)) + geom_point() +
geom_smooth(method = 'lm', formula = log(y) ~ x)
give me errors. Also, assuming I am actually able to plot a line of best fit that I am happy with, I am having trouble using predict in a similar way I did when using loess. For examples sake, suppose I am happy with the line of best fit:
p = ggplot(df, aes(A, B)) + geom_point() +
geom_smooth(method = 'lm', formula = y ~ x)
then if I want to predict what value B would take if A was equal to 11 (theoretically 2^11), the following method does not work:
B.lm = lm(B ~ A)
predict(B.lm, newdata = 11)
Any help much appreciated. Cheers.
First , To answer your last question, you need to provide a data.frame with colnames are the predictors.
B.lm <- lm(B ~ A,data=df)
predict(B.lm, newdata = data.frame(A=11))
1
683.3333
As an alternative to loess you can try some higher polynomial regressions. Here I in this plot I compare poly~3 to loess using latticeExtra(easier to add the xspline interpolation) but in similar syntax to ggplot2.(layer).
xyplot(A ~ B,data=df,par.settings = ggplot2like(),
panel = function(x,y,...){
panel.xyplot(x,y,...)
grid.xspline(x,y,..., default.units = "native") ## xspline interpolation
})+
layer(panel.smoother(y ~ poly(x, 3), method = "lm"), style = 1)+ ## poly
layer(panel.smoother(y ~ x, span = 0.9),style=2) ### loeess
The default surface for loess.control is interpolate which, unsurprisingly doesn't allow extrapolations. The alternative, direct, allows you to extrapolate though a question remains as to whether this is meaningful.
predict(loess(hp~disp,mtcars),newdata=1000)
[1] NA
predict(loess(hp~disp,mtcars,control=loess.control(surface="direct")),newdata=1000)
[1] -785.0545

Resources