exponential regression with R ( and negative values) - r

I am trying to fit a curve to a set of data points but did not succeed. So I ask you.
plot(time,val) # look at data
exponential.model <- lm(log(val)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit,lwd=2) # show the fitted line
My only problem is, that my data contains negative values and so log(val) produces a lot of NA making the model computation crash.
I know that my data does not necessarily look like exponential , but I want to see the fit anyway. I also used another program which shows me val=27.1331*exp(-time/2.88031) is a nice fit but I do not know, what I am doing wrong.
I want to compute it with R.
I had the idea to shift data so no negative values remain, but result is poor and quite sure wrong.
plot(time,val+20) # look at data
exponential.model <- lm(log(val+20)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit-20,lwd=2) # show the (BAD) fitted line
Thank you!

I figured some things out and have a satisfying solution.
exponential.model <- lm(log(val)~ a) # compute model
The log(val) term is trying to rescale the values, so a linear model can be applied. Since this not possible to my values, you have to use a non-linear model (nls).
exponential.model <- nls(val ~ a*exp(b*time), start=c(b=-0.1,h=30))
This worked fine for me.
satisfying fit

Related

How to extract a single Y variable value using a X variable without using a plot

me again!
In one of my assignment I have to create a plot with a regression line in and simply read this plot and give data.
Question: "at 80 degrees F what is the wind-speed?"
By simply looking at the plot you can state its ~9m/s at 80F. This would suffice, but knowing what you can do in R i would like to know for ether future reference and now.
How does one using only the Data frame ( in picture ) extract a Y value for a given X value using linear regression
Linear regression because the value itself isn't given, but it can be extracted if you assume its a linear regression.
So in essence instead of reading out the value in the plot ( pic 2 ) I would like to use a function that given a X(temp) value in the DF prints out a Y(wind) value using linear regression.
I tried other stuff i found on stackoverflow, using
lm(data~data, dataframe) but that didnt give me the result i desired.
You might look for the predict function.
First make a linear regression and then calculate the predicted value with predict. Just keep in mind, that you add your X-value in a data.frame.
datasets::airquality
lm_air <- lm(Wind ~ Temp, airquality)
predict(lm_air, data.frame(Temp = 80))

R absolute value of residuals with log transformation

I have a linear model in R of the form
lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
I want to interpret the residuals but get them back on the scale of num_encounters. I have seen residuals.lm(x, type="working") and residuals.lm(x, type="response") but I'm not sure about the values returned by them. Do I for instance still need to use exp() to get the residual values back on the num_encounters scale? Or are they already on that scale? I want to plot these absolute values back, both in a histogram and in a raster map afterwards.
EDIT:
Basically my confusion is that the following code results in 3 different histograms, while I was expecting the first 2 to be identical.
df$predicted <- exp(predict(x, newdata=df))
histogram(df$num_encounters-df$predicted)
histogram(exp(residuals(x, type="response")))
histogram(residuals(x, type="response"))
I want to interpret the residuals but get them back on the scale of
num_encounters.
You can easily calculate them:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
In addition what #Roland suggests, which indeed is correct and works, the problem with my confusion was just basic high-school logarithm algebra.
Indeed the absolute response residuals (on the scale of the original dependent variable) can be calculated as #Roland says with
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
If you want to calculate them from the model residuals, you need to keep logarithm substraction rules into account.
log(a)-log(b)=log(a/b)
The residual is calculated from the original model. So in my case, the model predicts log(num_encounters). So the residual is log(observed)-log(predicted).
What I was trying to do was
exp(resid) = exp(log(obs)-log(pred)) = exp(log(obs/pred)) = obs/pred
which is clearly not the number I was looking for. To get the absolute response residual from the model response residual, this is what I needed.
obs-obs/exp(resid)
So in R code, this is what you could also do:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
abs_resid <- df$num_encounters - df$num_encounters/exp(residuals(mod, type="response"))
This resulted in the same number as with the method described by #Roland which is much easier of course. But at least I got my brain lined up again.

How to plot a comparisson of two fixed categorical values for linear regression of another continuous variable

So I want to plot this:
lmfit = lm (y ~ a + b)
but, "b" only has the values of zero and one. So, I want to plot two separate regression lines, that are paralel to one one another to show the difference that b makes to the y-intercept. So after plotting this:
plot(b,y)
I want to then use abline(lmfit,col="red",lwd=2) twice, once with the x value of b set to zero, and once with it set to one. So once without the term included, and once where b is just 1b.
To restate: b is categorical, 0 or 1. a is continuous with a slight linear trend.
Thank you.
Example:
You might want to consider using predict(...) with b=0 and b=1, as follows. Since you didn't provide any data, I'm using the built-in mtcars dataset.
lmfit <- lm(mpg~wt+cyl,mtcars)
plot(mpg~wt,mtcars,col=mtcars$cyl,pch=20)
curve(predict(lmfit,newdata=data.frame(wt=x,cyl=4)),col=4,add=T)
curve(predict(lmfit,newdata=data.frame(wt=x,cyl=6)),col=6,add=T)
curve(predict(lmfit,newdata=data.frame(wt=x,cyl=8)),col=8,add=T)
Given you have an additive lm model to begin with, drawing the lines is pretty straightforward, even though not completely intuitive. I tested it with the following simulated data:
y <- rnorm(30)
a <- rep(1:10,times=3)
b <- rep(c(1,0),each=15)
LM <- lm(y~a+b)
You have to access the coefficient values in the lm. Its is:
LM$coefficients
Here comes the tricky part, you have to assign the coefficients for each line.
The first one is easy:
abline(LM$coef[1],LM$coef[2])
The other one is a bit more complicated, given R works with additive coefficients, so for the second line you have:
abline(LM$coef[1]+LM$coef[3],LM$coef[2])
I hope this is what you was expecting
Unless I've misunderstood the question, all you have to do is run abline again but on a model without the b term.
abline(lm(y~a),col="red",lwd=2)

Error returned predicting new data using GAM with periodic smoother

Apologies if this is better suited in CrossValidated.
I am fitting GAM models to binomial data using the mgcv package in R. One of the covariates is periodic, so I am specifying the bs = "cc" cyclic cubic spline. I am doing this in a cross validation framework, but when I go to fit my holdout data using the predict function I get the following error:
Error in pred.mat(x, object$xp, object$BD) :
can't predict outside range of knots with periodic smoother
Here is some code that should replicate the error:
# generate data:
x <- runif(100,min=-pi,max=pi)
linPred <- 2*cos(x) # value of the linear predictor
theta <- 1 / (1 + exp(-linPred)) #
y <- rbinom(100,1,theta)
plot(x,theta)
df <- data.frame(x=x,y=y)
# fit gam with periodic smoother:
gamFit <- gam(y ~ s(x,bs="cc",k=5),data=df,family=binomial())
summary(gamFit)
plot(gamFit)
# predict y values for new data:
x.2 <- runif(100,min=-pi,max=pi)
df.2 <- data.frame(x=x.2)
predict(gamFit,newdata=df.2)
Any suggestions on where I'm going wrong would be greatly appreciated. Maybe manually specifying knots to fall on -pi and pi?
I did not get an error on the first run but I did replicate the error on the second try. Perhaps you need to use set.seed(123) #{no error} and set.seed(223) #{produces error}. to see if that creates partial success. I think you are just seeing some variation with a relatively small number of points in your derivation and validation datasets. 100 points for GAM fit is not particularly "generous".
Looking at the gamFit object it appears that the range of the knots is encoded in gamFit$smooth[[1]]['xp'], so this should restrict your inputs to the proper range:
x.2 <- runif(100,min=-pi,max=pi);
x.2 <- x.2[findInterval(x.2, range(gamFit$smooth[[1]]['xp']) )== 1]
# Removes the errors in all the situations I tested
# There were three points outside the range in the set.seed(223) case
The problem is that your test set contains values that were not in the range of your training set. Since you used a spline, knots were created at the minimum and maximum value of x, and your fitted function is not defined outside of that range. So, when you test the model, you should exclude those points that are outside the range. Here is how you would exclude the points in the test set:
set.seed(2)
... <Your code>
predict(gamFit,newdata=df.2[df.2$x>=min(df$x) & df.2$x<=max(df$x),,drop=F])
Or, you could specify the "outer" knot points in the model to the min and max of your whole data. I don't know how to do that offhand.

Calculating ROC/AUC for MaxEnt and BIOMOD

Thanks very much Winchester for the kind help! I also saw the tutorial and that work for me! In the past two days I explored the output of both MaxEnt and BIOMOD, and I think I am still a little bit confused by the terms used within the two.
From Philips' code, it seems that he used the Sample points and backaground point to calculate ROC, while in BIOMOD, there is only prediction from the presence and pseudo absence points. which means, for the same dataset, I have the same number of presence/sample data, but different absence/background data for the two models, respectively. And when I recalculate the ROC, it is usually inconsistent with the values reported by the model themselves.
I think I still didnot get some of the point of model evaluation, concerning what is been evaluated and how to generate the evaluation dataset, ie. comfusion matrix, and which part of the data was selected as evaluation.
Thanks everybody for the kind reply! I am very sorry for the inconvenience. I appended a few more sentences to the post for BIOMOD to make it runable, as for MaxEnt, you can use the tutorial data.
Actually, the intend of my post is to find someone who have had the experience to work with both the presence/absence dataset and the presence-only dataset. I probably know how to deal with them separately , but not altogether.
I am using both MaxEnt and a few algorithms under BIOMOD for the distribution of my species, and I would like to plot the ROC/AUC in the same figue, anybody have done this before?
As far as I know, for MaxEnt, the ROC can be plotted using the ROCR and vcd library, which was given in the tutorial of MaxEnt by Philips:
install.packages("ROCR", dependencies=TRUE)
install.packages("vcd", dependencies=TRUE)
library(ROCR)
library(vcd)
library(boot)
setwd("c:/maxent/tutorial/outputs")
presence <- read.csv("bradypus_variegatus_samplePredictions.csv")
background <- read.csv("bradypus_variegatus_backgroundPredictions.csv")
pp <- presence$Logistic.prediction # get the column of predictions
testpp <- pp[presence$Test.or.train=="test"] # select only test points
trainpp <- pp[presence$Test.or.train=="train"] # select only test points
bb <- background$logistic
combined <- c(testpp, bb) # combine into a single vector
label <- c(rep(1,length(testpp)),rep(0,length(bb))) # labels: 1=present, 0=random
pred <- prediction(combined, label) # labeled predictions
perf <- performance(pred, "tpr", "fpr") # True / false positives, for ROC curve
plot(perf, colorize=TRUE) # Show the ROC curve
performance(pred, "auc")#y.values[[1]] # Calculate the AUC
While for BIOMOD, they require presence/absence data, so I used 1000 pseudo.absence points, and there is no background. I found another script given by Thuiller himself:
library(BIOMOD)
library(PresenceAbsence)
data(Sp.Env)
Initial.State(Response=Sp.Env[,12:13], Explanatory=Sp.Env[,4:10],
IndependentResponse=NULL, IndependentExplanatory=NULL)
Models(GAM = TRUE, NbRunEval = 1, DataSplit = 80,
Yweights=NULL, Roc=TRUE, Optimized.Threshold.Roc=TRUE, Kappa=F, TSS=F, KeepPredIndependent = FALSE, VarImport=0,
NbRepPA=0, strategy="circles", coor=CoorXY, distance=2, nb.absences=1000)
load("pred/Pred_Sp277")
data=cbind(Sp.Env[,1], Sp.Env[,13], Pred_Sp277[,3,1,1]/1000)
plotroc <- roc.plot.calculate(data)
plot(plotroc$threshold, plotroc$sensitivity, type="l", col="blue ")
lines(plotroc$threshold, plotroc$specificity)
lines(plotroc$threshold, (plotroc$specificity+plotroc$sensitivity)/2, col="red")
Now, the problem is, how could I plot them altogether? I have tried both, they work well for both seperately, but exclusively. Maybe I need some one to help me understand the underling philosiphy of ROC.
Thanks in advance~
Marco
Ideally, if you are going to compare methods, you should probably generate predictions from MaxEnt and BIOMOD for each location of the testing portion of your data set (observed presences and absences). As Christian mentioned, pROC is a nice package, especially for comparing ROC curves. Although I don't have access to the data, I've generated a dummy data set which should illustrate plotting two roc curves and calculating the difference in AUC.
library(pROC)
#Create dummy data set for test observations
obs<-rep(0:1, each=50)
pred1<-c(runif(50,min=0,max=0.8),runif(50,min=0.3,max=0.6))
pred2<-c(runif(50,min=0,max=0.6),runif(50,min=0.4,max=0.9))
roc1<-roc(obs~pred1) # Calculate ROC for each method
roc2<-roc(obs~pred2)
#Plot roc curves for each method
plot(roc1)
lines(roc2,col="red")
#Compare differences in area under ROC
roc.test(roc1,roc2,method="bootstrap",paired=TRUE)
I still couldnt get your code to work, but here is an example with the demonstration data from the package PresenceAbsence. I've plotted your lines, then added a bold line for the ROC. If you were labelling it, the false positive rate is on the x-axis, with the false negative rate on the y-axis, but I think that would not be accurate with the other lines that are present. Is this what you wanted to do?
data(SIM3DATA)
plotroc <- roc.plot.calculate(SIM3DATA,which.model=2, xlab = NULL, ylab = NULL)
plot(plotroc$threshold, plotroc$sensitivity, type="l", col="blue ")
lines(plotroc$threshold, plotroc$specificity)
lines(plotroc$threshold, (plotroc$specificity+plotroc$sensitivity)/2, col="red")
lines(1 - plotroc$specificity, plotroc$sensitivity, lwd = 2, lty = 5)
I have been using the pROC package. It has a lot of nice features when it comes to plotting ROC and AUC in the same graph. Furthermore it is very use.

Resources