R absolute value of residuals with log transformation - r

I have a linear model in R of the form
lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
I want to interpret the residuals but get them back on the scale of num_encounters. I have seen residuals.lm(x, type="working") and residuals.lm(x, type="response") but I'm not sure about the values returned by them. Do I for instance still need to use exp() to get the residual values back on the num_encounters scale? Or are they already on that scale? I want to plot these absolute values back, both in a histogram and in a raster map afterwards.
EDIT:
Basically my confusion is that the following code results in 3 different histograms, while I was expecting the first 2 to be identical.
df$predicted <- exp(predict(x, newdata=df))
histogram(df$num_encounters-df$predicted)
histogram(exp(residuals(x, type="response")))
histogram(residuals(x, type="response"))

I want to interpret the residuals but get them back on the scale of
num_encounters.
You can easily calculate them:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))

In addition what #Roland suggests, which indeed is correct and works, the problem with my confusion was just basic high-school logarithm algebra.
Indeed the absolute response residuals (on the scale of the original dependent variable) can be calculated as #Roland says with
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
If you want to calculate them from the model residuals, you need to keep logarithm substraction rules into account.
log(a)-log(b)=log(a/b)
The residual is calculated from the original model. So in my case, the model predicts log(num_encounters). So the residual is log(observed)-log(predicted).
What I was trying to do was
exp(resid) = exp(log(obs)-log(pred)) = exp(log(obs/pred)) = obs/pred
which is clearly not the number I was looking for. To get the absolute response residual from the model response residual, this is what I needed.
obs-obs/exp(resid)
So in R code, this is what you could also do:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
abs_resid <- df$num_encounters - df$num_encounters/exp(residuals(mod, type="response"))
This resulted in the same number as with the method described by #Roland which is much easier of course. But at least I got my brain lined up again.

Related

exponential regression with R ( and negative values)

I am trying to fit a curve to a set of data points but did not succeed. So I ask you.
plot(time,val) # look at data
exponential.model <- lm(log(val)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit,lwd=2) # show the fitted line
My only problem is, that my data contains negative values and so log(val) produces a lot of NA making the model computation crash.
I know that my data does not necessarily look like exponential , but I want to see the fit anyway. I also used another program which shows me val=27.1331*exp(-time/2.88031) is a nice fit but I do not know, what I am doing wrong.
I want to compute it with R.
I had the idea to shift data so no negative values remain, but result is poor and quite sure wrong.
plot(time,val+20) # look at data
exponential.model <- lm(log(val+20)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit-20,lwd=2) # show the (BAD) fitted line
Thank you!
I figured some things out and have a satisfying solution.
exponential.model <- lm(log(val)~ a) # compute model
The log(val) term is trying to rescale the values, so a linear model can be applied. Since this not possible to my values, you have to use a non-linear model (nls).
exponential.model <- nls(val ~ a*exp(b*time), start=c(b=-0.1,h=30))
This worked fine for me.
satisfying fit

Finding Mean Squared Error?

I have produced a linear data set and have used lm() to fit a model to that dataset. I am now trying to find the MSE using mse()
I know the formula for MSE but I'm trying to use this function. What would be the proper way to do so? I have looked at the documentation, but I'm either dumb or it's just worded for people who actually know what they're doing.
library(hydroGOF)
x.linear <- seq(0, 200, by=1) # x data
error.linear <- rnorm(n=length(x.linear), mean=0, sd=1) # Error (0, 1)
y.linear <- x.linear + error.linear # y data
training.data <- data.frame(x.linear, y.linear)
training.model <- lm(training.data)
training.mse <- mse(training.model, training.data)
plot(training.data)
mse() needs two data frames. I'm not sure how to get a data frame out of lm(). Am I even on the right track to finding a proper MSE for my data?
Try this:
mean((training.data - predict(training.model))^2)
#[1] 0.4467098
You can also use below mentioned code which is very clean to get mean square error
install.packages("Metrics")
library(Metrics)
mse(actual, predicted)
The first data set on which is actual one : training.data
The second argument is the one which you will predict like :
pd <- predict(training.model , training.data)
mse(training.data$,pd)
Seems you have not done prediction yet so first predict the data based on your model and then calculate mse
You can use the residual component from lm model output to find mse in this manner :
mse = mean(training.model$residuals^2)
Note: if you come from another program (like SAS)
they get the mean using the sum and the degrees of freedom of the residual. I recommend doing the same if you want a more accurate estimate of the error.
mse = sum(training.model$residuals^2)/training.model$df.residual
I found this while trying to figure out why mean(my_model$residuals^2) was different in R than the MSE in SAS.

Trouble with 'fitdistrplus' package, t-distribution

I am trying to fit t-distributions to my data but am unable to do so. My first try was
fitdistr(myData, "t")
There are 41 warnings, all saying that NaNs are produced. I don't know how, logarithms seem to be involved. So I adjusted my data somewhat so that all data is >0, but I still have the same problem (9 fewer warnings though...). Same problem with sstdFit(), produces NaNs.
So instead I try with fitdist which I've seen on stackoverflow and CrossValidated:
fitdist(myData, "t")
I then get
Error in mledist(data, distname, start, fix.arg, ...) :
'start' must be defined as a named list for this distribution
What does this mean? I tried looking into the documentation but that told me nothing. I just want to possibly fit a t-distribution, this is so frustrating :P
Thanks!
Start is the initial guess for the parameters of your distribution. There are logs involved because it is using maximum likelihood and hence log-likelihoods.
library(fitdistrplus)
dat <- rt(100, df=10)
fit <- fitdist(dat, "t", start=list(df=2))
I think it's worth adding that in most cases, using the fitdistrplus package to fit a t-distribution to real data will lead to a very bad fit, which is actually quite misleading. This is because the default t-distribution functions in R are used, and they don't support shifting or scaling. That is, if your data has a mean other than 0, or is scaled in some way, then the fitdist function will simply lead to a bad fit.
In real life, if data fits a t-distribution, it is usually shifted (i.e. has a mean other than 0) and / or scaled. Let's generate some data like that:
data = 1.5*rt(10000,df=5) + 0.5
Given this data has been sampled from the t-distribution with 5 degrees of freedom, you'd think that trying to fit a t-distribution to this should work quite nicely. But actually, here is the result. It estimates a df of 2, and provides a bad fit as shown in the qq plot.
> fit_bad <- fitdist(data,"t",start=list(df=3))
> fit_bad
Fitting of the distribution ' t ' by maximum likelihood
Parameters:
estimate Std. Error
df 2.050967 0.04301357
> qqcomp(list(fit_bad)) # generates plot to show fit
When you fit to a t-distribution you want to not only estimate the degrees of freedom, but also a mean and scaling parameter.
The metRology package provides a version of the t-distribution called t.scaled that has a mean and sd parameter in addition to the df parameter [metRology]. Now let's fit it again:
> library("metRology")
> fit_good <- fitdist(data,"t.scaled",
start=list(df=3,mean=mean(data),sd=sd(data)))
> fit_good
Fitting of the distribution ' t.scaled ' by maximum likelihood
Parameters:
estimate Std. Error
df 4.9732159 0.24849246
mean 0.4945922 0.01716461
sd 1.4860637 0.01828821
> qqcomp(list(fit_good)) # generates plot to show fit
Much better :-) The parameters are very close to how we generated the data in the first place! And the QQ plot shows a much nicer fit.

Predict.lm() in R - how to get nonconstant prediction bands around fitted values

So I am currently trying to draw the confidence interval for a linear model. I found out I should use predict.lm() for this, but I have a few problems really understanding the function and I do not like using functions without knowing what's happening. I found several how-to's on this subject, but only with the corresponding R-code, no real explanation.
This is the function itself:
## S3 method for class 'lm'
predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
interval = c("none", "confidence", "prediction"),
level = 0.95, type = c("response", "terms"),
terms = NULL, na.action = na.pass,
pred.var = res.var/weights, weights = 1, ...)
Now, what I've trouble understanding:
1) newdata
An optional data frame in which to look for variables
with which to predict. If omitted, the fitted values are used.
Everyone seems to use newdata for this, but I cannot quite understand why. For calculating the confidence interval I obviously need the data which this interval is for (like the # of observations, mean of x etc), so cannot be what is meant by it. But then: What is does it mean?
2) interval
Type of interval calculation.
okay.. but what is "none" for?
3a) type
Type of prediction (response or model term).
3b) terms
If type="terms", which terms (default is all terms)
3a: Can I by that get the confidence interval for one specific variable in my model? And if so, what is 3b for then? If I can specify the term in 3a, it wouldn't make sense to do it in 3b again.. so I guess I'm wrong again, but I cannot figure out why.
I guess some of you might think: Why don't just try this out? And I would (even if it would maybe not solve everything here), but I right now don't know how to. As I do not now what the newdata is for, I don't know how to use it and if I try, I do not get the right confidence interval. Somehow it is very important how you choose that data, but I just don't understand!
EDIT: I want to add that my intention is to understand how predict.lm works. By that I mean I don't understand if it works the way I think it does. That is it calculates y-hat (predicted values) and than uses adds/subtracts for each the upr/lwr-bounds of the interval to calculate several datapoints(looking like a confidence-line then) ?? Then I would undestand why it is necessary to have the same lenght in the newdata as in the linear model.
Make up some data:
d <- data.frame(x=c(1,4,5,7),
y=c(0.8,4.2,4.7,8))
Fit the model:
lm1 <- lm(y~x,data=d)
Confidence and prediction intervals with the original x values:
p_conf1 <- predict(lm1,interval="confidence")
p_pred1 <- predict(lm1,interval="prediction")
Conf. and pred. intervals with new x values (extrapolation and more finely/evenly spaced than original data):
nd <- data.frame(x=seq(0,8,length=51))
p_conf2 <- predict(lm1,interval="confidence",newdata=nd)
p_pred2 <- predict(lm1,interval="prediction",newdata=nd)
Plotting everything together:
par(las=1,bty="l") ## cosmetics
plot(y~x,data=d,ylim=c(-5,12),xlim=c(0,8)) ## data
abline(lm1) ## fit
matlines(d$x,p_conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+")
matlines(d$x,p_pred1[,c("lwr","upr")],col=2,lty=2,type="b",pch=1)
matlines(nd$x,p_conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+")
matlines(nd$x,p_pred2[,c("lwr","upr")],col=4,lty=2,type="b",pch=1)
Using new data allows for extrapolation beyond the original data; also, if the original data are sparsely or unevenly spaced, the prediction intervals (which are not straight lines) may not be well approximated by linear interpolation between the original x values ...
I'm not quite sure what you mean by the "confidence interval for one specific variable in my model"; if you want confidence intervals on a parameter, then you should use confint. If you want predictions for the changes based only on some of the parameters changing (ignoring the uncertainty due to the other parameters), then you do indeed want to use type="terms".
interval="none" (the default) just tells R not to bother computing any confidence or prediction intervals, and to return just the predicted values.

inverse of 'predict' function

Using predict() one can obtain the predicted value of the dependent variable (y) for a certain value of the independent variable (x) for a given model. Is there any function that predicts x for a given y?
For example:
kalythos <- data.frame(x = c(20,35,45,55,70),
n = rep(50,5), y = c(6,17,26,37,44))
kalythos$Ymat <- cbind(kalythos$y, kalythos$n - kalythos$y)
model <- glm(Ymat ~ x, family = binomial, data = kalythos)
If we want to know the predicted value of the model for x=50:
predict(model, data.frame(x=50), type = "response")
I want to know which x makes y=30, for example.
Saw the previous answer is deleted. In your case, given n=50 and the model is binomial, you would calculate x given y using:
f <- function (y,m) {
(logit(y/50) - coef(m)[["(Intercept)"]]) / coef(m)[["x"]]
}
> f(30,model)
[1] 48.59833
But when doing so, you better consult a statistician to show you how to calculate the inverse prediction interval. And please, take VitoshKa's considerations into account.
Came across this old thread but thought I would add some other info. Package MASS has function dose.p for logit/probit models. SE is via delta method.
> dose.p(model,p=.6)
Dose SE
p = 0.6: 48.59833 1.944772
Fitting the inverse model (x~y) would not makes sense here because, as #VitoshKa says, we assume x is fixed and y (the 0/1 response) is random. Besides, if the data weren’t grouped you’d have only 2 values of the explanatory variable: 0 and 1. But even though we assume x is fixed it still makes sense to calculate a confidence interval for the dose x for a given p, contrary to what #VitoshKa says. Just as we can reparameterize the model in terms of ED50, we can do so for ED60 or any other quantile. Parameters are fixed, but we still calculate CI's for them.
The chemcal package has an inverse.predict() function, which works for fits of the form y ~ x and y ~ x - 1
You just have to rearrange the regression equation, but as the comments above state this may prove tricky and not necessarily have a meaningful interpretation.
However, for the case you presented you can use:
(1/coef(model)[2])*(model$family$linkfun(30/50)-coef(model)[1])
Note I did the division by the x coefficient first to allow the name attribute to be correct.
For just a quick view (without intervals and considering additional issues) you could use the TkPredict function in the TeachingDemos package. It does not do this directly, but allows you to dynamically change the x value(s) and see what the predicted y-value is, so it would be fairly simple to move x until the desired Y is found (for given values of additional x's), this will also show possibly problems with multiple x's that would work for the same y.

Resources