Coefficients of my polynomial model in R don't match graph - r

Using Greg's helpful answer here, I fit a second order polynomial regression line to my dataset:
poly.fit<-lm(y~poly(x,2),df)
When I plot the line, I get the graph below:
The coefficients are:
# Coefficients:
# (Intercept) poly(x, 2)1 poly(x, 2)2
# 727.1 362.4 -269.0
I then wanted to find the x-value of the peak. I assume there is an easy way to do so in R but I did not know it,* so I went to Wolfram Alpha. I entered the equation:
y=727.1+362.4x-269x^2
Wolfram Alpha returned the following:
As you can see, the function intersects the x-axis at approximately x=2.4. This is obviously different from my plot in R, which ranges from 0≤x≤80. Why are these different? Does R interpret my x-values as a fraction of some backroom variable?
*I would also appreciate answers on how to find this peak. Obviously I could take the derivative, but how do I set to zero?

Use predict.
plot( 40:90, predict( poly.fit, list(x=40:90) )

In the case of a quadratic polynomial, you can of course use a little calculus and algebra (once you have friendly coefficients).
Somewhat more generally, you can get an estimate by evaluating your model over a range of candidate values and determining which one gives you the maximum response value.
Here is a (only moderately robust) function which will work here.
xmax <- function(fit, startx, endx, x='x', within=NA){
## find approximate value of variable x where model
## specified by fit takes maximum value, inside interval
## [startx, endx]; precision specified by within
within <- ifelse(is.na(within), (endx - startx)/100, within)
testx <- seq(startx, endx, by=within)
testlist <- list(testx)
names(testlist)[1] <- x
testy <- predict(fit, testlist)
testx[which.max(testy)]
}
Note if your predictor variable were called something other than x, you have to specify it as a string in the x parameter.
So to find the x value where your curve has its peak:
xmax(poly.fit, 50, 80, within=0.1)

Related

R absolute value of residuals with log transformation

I have a linear model in R of the form
lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
I want to interpret the residuals but get them back on the scale of num_encounters. I have seen residuals.lm(x, type="working") and residuals.lm(x, type="response") but I'm not sure about the values returned by them. Do I for instance still need to use exp() to get the residual values back on the num_encounters scale? Or are they already on that scale? I want to plot these absolute values back, both in a histogram and in a raster map afterwards.
EDIT:
Basically my confusion is that the following code results in 3 different histograms, while I was expecting the first 2 to be identical.
df$predicted <- exp(predict(x, newdata=df))
histogram(df$num_encounters-df$predicted)
histogram(exp(residuals(x, type="response")))
histogram(residuals(x, type="response"))
I want to interpret the residuals but get them back on the scale of
num_encounters.
You can easily calculate them:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
In addition what #Roland suggests, which indeed is correct and works, the problem with my confusion was just basic high-school logarithm algebra.
Indeed the absolute response residuals (on the scale of the original dependent variable) can be calculated as #Roland says with
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
If you want to calculate them from the model residuals, you need to keep logarithm substraction rules into account.
log(a)-log(b)=log(a/b)
The residual is calculated from the original model. So in my case, the model predicts log(num_encounters). So the residual is log(observed)-log(predicted).
What I was trying to do was
exp(resid) = exp(log(obs)-log(pred)) = exp(log(obs/pred)) = obs/pred
which is clearly not the number I was looking for. To get the absolute response residual from the model response residual, this is what I needed.
obs-obs/exp(resid)
So in R code, this is what you could also do:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
abs_resid <- df$num_encounters - df$num_encounters/exp(residuals(mod, type="response"))
This resulted in the same number as with the method described by #Roland which is much easier of course. But at least I got my brain lined up again.

Finding x for f(x) loess function

I have a couple of loess models using similar data to....
set.seed(123)
y<-(runif(100,-20,20))
z<-seq(-12.75,12,.25)*rnorm(100,1,3)
x<-seq(1,100,1)
df<-data.frame(cbind(y,x,z))
model <- loess(y ~ x, data = df)
model2<-loess(z~x,data=df)
What I am trying to accomplish (without any luck) is to identify where the smoothed lines do 2 things:
1) I want to identify at what value(s) of x do the lines cross y=0
2) I want to identify at what value(s) of x the 2 loess lines cross each other.
I've been looking for similar problems and solutions to those problems for too long now with no success. Any help would be greatly appreciated.
ggplot(df,aes(x=x,y=y))+
geom_point()+
geom_smooth(method="loess",se=F)+
geom_smooth(aes(y=z),method="loess",se=F)
You can use predict to get y value for any x, and then optimise to find the specific x value that solves for the y value you want.
For example, to find the zero crossing of model, we can optimise to find where the square of its fitted value is a minimum
zero1 <- optimize(function(x, m) predict(m, x)^2, range(x), model)
#
# $minimum
# [1] 67.89191
Note that this will only find a single local minimum. If your model crosses zero several times, you will need to solve like this in each of the ranges where there is a zero (by changing the second argument of optimize, which specifies the range to search within).
Exactly the same approach can find where the models intersect. For that case you minimise the square of the difference between the two models:
intersection <- optimize(function(x, m1, m2) (predict(m1, x) - predict(m2, x))^2,
range(x), model, model2)
# $minimum
# [1] 45.65295

Finding model predictor values that maximize the outcome

How do you find the set of values for model predictors (a mixture of linear and non-linear) that yield the highest value for the response.
Example Model:
library(lme4); library(splines)
summary(lmer(formula = Solar.R ~ 1 + bs(Ozone) + Wind + Temp + (1 | Month), data = airquality, REML = F))
Here I am interested in what conditions (predictors) produce the highest solar radation (outcome).
This question seems simple, but I've failed to find a good answer using Google.
If the model was simple, I could take the derivatives to find the maximum or minimum. Someone has suggested that if the model function can be extracted, the stats::optim() function might be used. As a last resort, I could simulate all the reasonable variations of input values and plug it into the predict() function and look for the maximum value.
The last approach mentioned doesn't seem very efficient and I imagine that this is a common enough task (e.g., finding optimal customers for advertising) that someone has built some tools for handling it. Any help is appreciated.
There are some conceptual issues here.
for the simple terms (Wind and Temp), the response is a linear (and hence both monotonic and unbounded) function of the predictors. Thus if these terms have positive parameter estimates, increasing their values to infinity (Inf) will give you an infinite response (Solar.R); values should be as small as possible (negative infinite) if the coefficients are negative. Practically speaking, then, you want to set these predictors to the minimum or maximum reasonable value if the parameter estimates are respectively negative or positive.
for the bs term, I'm not sure what the properties of the B-spline are beyond the boundary knots, but I'm pretty sure that the curves go off to positive or negative infinity, so you've got the same issue. However, for the case of bs, it's also possible that there are one or more interior maxima. For this case I would probably try to extract the basis terms and evaluate the spline over the range of the data ...
Alternatively, your mentioning optim makes me think that this is a possibility:
data(airquality)
library(lme4)
library(splines)
m1 <- lmer(formula = Solar.R ~ 1 + bs(Ozone) + Wind + Temp + (1 | Month),
data = airquality, REML = FALSE)
predval <- function(x) {
newdata <- data.frame(Ozone=x[1],Wind=x[2],Temp=x[3])
## return population-averaged prediction (no Month effect)
return(predict(m1, newdata=newdata, re.form=~0))
}
aq <- na.omit(airquality)
sval <- with(aq,c(mean(Ozone),mean(Wind),mean(Temp)))
predval(sval)
opt1 <- optim(fn=predval,
par=sval,
lower=with(aq,c(min(Ozone),min(Wind),min(Temp))),
upper=with(aq,c(max(Ozone),max(Wind),max(Temp))),
method="L-BFGS-B", ## for constrained opt.
control=list(fnscale=-1)) ## for maximization
## opt1
## $par
## [1] 70.33851 20.70000 97.00000
##
## $value
## [1] 282.9784
As expected, this is intermediate in the range of Ozone(1-168), and min/max for Wind (2.3-20.7) and Temp (57-97).
This brute force solution could be made much more efficient by automatically selecting the min/max values for the simple terms and optimizing only over the complex (polynomial/spline/etc.) terms.

Plotting fitted values vs observed ones in R or winbugs

I want to plot the fitted values versus the observed ones and want to put straight line showing the goodness of fit. However, I do not want to use abline() because I did not calculate the fitted values using lm command as my I used a model that R does not cover. I calculated the coefficients and used them to calculate the fitted values. So, what can I do to obtain such a plot in R or in winbugs?
Here is what I want
Still no data provided, but maybe this simple example using the curve function will inform the process:
x <- 1:10
y <- 2+ 3*(1:10) + rnorm(10)
plot(1:10, y)
curve( 2+3*x, 0, 10, add=TRUE)
Note to new R users. the expression y_i = 1 - xbeta + delta_i + e_i would fail in R in part because the x and beta are not separated by an operator. But if you do understand R's matrix syntax it might be a very compact expression even if "X" were multidimensional. All of htis depends on the specifics which we are so far lacking.

Error returned predicting new data using GAM with periodic smoother

Apologies if this is better suited in CrossValidated.
I am fitting GAM models to binomial data using the mgcv package in R. One of the covariates is periodic, so I am specifying the bs = "cc" cyclic cubic spline. I am doing this in a cross validation framework, but when I go to fit my holdout data using the predict function I get the following error:
Error in pred.mat(x, object$xp, object$BD) :
can't predict outside range of knots with periodic smoother
Here is some code that should replicate the error:
# generate data:
x <- runif(100,min=-pi,max=pi)
linPred <- 2*cos(x) # value of the linear predictor
theta <- 1 / (1 + exp(-linPred)) #
y <- rbinom(100,1,theta)
plot(x,theta)
df <- data.frame(x=x,y=y)
# fit gam with periodic smoother:
gamFit <- gam(y ~ s(x,bs="cc",k=5),data=df,family=binomial())
summary(gamFit)
plot(gamFit)
# predict y values for new data:
x.2 <- runif(100,min=-pi,max=pi)
df.2 <- data.frame(x=x.2)
predict(gamFit,newdata=df.2)
Any suggestions on where I'm going wrong would be greatly appreciated. Maybe manually specifying knots to fall on -pi and pi?
I did not get an error on the first run but I did replicate the error on the second try. Perhaps you need to use set.seed(123) #{no error} and set.seed(223) #{produces error}. to see if that creates partial success. I think you are just seeing some variation with a relatively small number of points in your derivation and validation datasets. 100 points for GAM fit is not particularly "generous".
Looking at the gamFit object it appears that the range of the knots is encoded in gamFit$smooth[[1]]['xp'], so this should restrict your inputs to the proper range:
x.2 <- runif(100,min=-pi,max=pi);
x.2 <- x.2[findInterval(x.2, range(gamFit$smooth[[1]]['xp']) )== 1]
# Removes the errors in all the situations I tested
# There were three points outside the range in the set.seed(223) case
The problem is that your test set contains values that were not in the range of your training set. Since you used a spline, knots were created at the minimum and maximum value of x, and your fitted function is not defined outside of that range. So, when you test the model, you should exclude those points that are outside the range. Here is how you would exclude the points in the test set:
set.seed(2)
... <Your code>
predict(gamFit,newdata=df.2[df.2$x>=min(df$x) & df.2$x<=max(df$x),,drop=F])
Or, you could specify the "outer" knot points in the model to the min and max of your whole data. I don't know how to do that offhand.

Resources