Curve-fitting with nls() in R - r

I have some data that I fit a curve to a certain formula.
To do this, I use the nls-function, like so:
fitmodel <- nls(y ~ a+b/(1+exp(-((x-c)/d))),
data = combined,
start=list(a=200,b=2000, c=80, d=10.99),
trace=TRUE)
This works, but gives the warnings
"1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf"
When I draw a function with the determined parameters in my plot, it does not fit at all, like here:
jpw <- coef(fitmodel)
logfit <- function(x,jpw) {jpw[1] + jpw[2]/(1+exp(-((x-jpw[3])/jpw[4])))
}
points(logfit(x, jpw),type='l')
bad fit
My friend tried to fit this data in another program. It finds the same parameters, and the function the other program draws fits the data beautifully.
Also, it is really easy to find parameters manually that make the curve follow the data well.
Where did I mess up? I am a beginner, so it might be something stupid.
Thank you in advance!
Edited: Data file

Your problem is your plotting, you are only giving one value to points, so it is using that as the y, and defaulting to one value is one unit on the x axis (If you look at your original plot, you can see it ends at 439, which is the number of points you have).
You can fix this by plotting with x too:
plot(combined$V1~combined$V3)
points(x,logfit(x,jpw), type = 'l')

It would be interesting to see your dataset.
Anyway, here is a working example. Hope this can help you.
logfit <- function(x,jpw) {
jpw[1] + jpw[2]/(1+exp(-((x-jpw[3])/jpw[4])))
}
jpw <- c(-2,1,0,.5)
x <- runif(100, -3, 3)
y <- logfit(x, jpw)+rnorm(100, sd=0.01)
df <- data.frame(x,y)
curve(logfit(x,jpw),from=-3,to=3, ,type='l')
points(x,y)
fitmodel <- nls(y ~ a + b/(1+exp(-((x-c)/d))),
data = df,
start=list(a=1, b=2, c=1, d=1),
trace=TRUE)
fitmodel
The output is:
Nonlinear regression model
model: y ~ a + b/(1 + exp(-((x - c)/d)))
data: df
a b c d
-1.999901 1.002425 0.006527 0.498689
residual sum-of-squares: 0.009408
Number of iterations to convergence: 6
Achieved convergence tolerance: 1.732e-06
Here I use the data given by #jpw.
df <- dget(file="data.txt")
names(df) <- c("y","v2","x")
fitmodel <- nls(y ~ a + b/(1+exp(-((x-c)/d))),
data = df,
start=list(a=200,b=2000, c=80, d=10.99),
trace=TRUE)
summary(fitmodel)
The estimated parameters are:
Formula: y ~ a + b/(1 + exp(-((x - c)/d)))
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 231.6587 2.8498 81.29 <2e-16 ***
b 1893.0646 6.3528 297.99 <2e-16 ***
c 151.5405 0.2016 751.71 <2e-16 ***
d 17.2068 0.1779 96.72 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 37.19 on 473 degrees of freedom
Number of iterations to convergence: 10
Achieved convergence tolerance: 3.9e-06
And now I plot the results.
plot(df$x, df$y)
jpw.est <- coef(fitmodel)
curve(logfit(x,jpw.est), from=0, to=300, col="red", lwd=2, add=T)

Related

How to obtain Poisson's distribution "lambda" from R glm() coefficients

My R-script produces glm() coeffs below.
What is Poisson's lambda, then? It should be ~3.0 since that's what I used to create the distribution.
Call:
glm(formula = h_counts ~ ., family = poisson(link = log), data = pois_ideal_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-22.726 -12.726 -8.624 6.405 18.515
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.222532 0.015100 544.53 <2e-16 ***
h_mids -0.363560 0.004393 -82.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 11451.0 on 10 degrees of freedom
Residual deviance: 1975.5 on 9 degrees of freedom
AIC: 2059
Number of Fisher Scoring iterations: 5
random_pois = rpois(10000,3)
h=hist(random_pois, breaks = 10)
mean(random_pois) #verifying that the mean is close to 3.
h_mids = h$mids
h_counts = h$counts
pois_ideal_data <- data.frame(h_mids, h_counts)
pois_ideal_model <- glm(h_counts ~ ., pois_ideal_data, family=poisson(link=log))
summary_ideal=summary(pois_ideal_model)
summary_ideal
What are you doing here???!!! You used a glm to fit a distribution???
Well, it is not impossible to do so, but it is done via this:
set.seed(0)
x <- rpois(10000,3)
fit <- glm(x ~ 1, family = poisson())
i.e., we fit data with an intercept-only regression model.
fit$fitted[1]
# 3.005
This is the same as:
mean(x)
# 3.005
It looks like you're trying to do a Poisson fit to aggregated or binned data; that's not what glm does. I took a quick look for canned ways of fitting distributions to canned data but couldn't find one; it looks like earlier versions of the bda package might have offered this, but not now.
At root, what you need to do is set up a negative log-likelihood function that computes (# counts)*prob(count|lambda) and minimize it using optim(); the solution given below using the bbmle package is a little more complex up-front but gives you added benefits like easily computing confidence intervals etc..
Set up data:
set.seed(101)
random_pois <- rpois(10000,3)
tt <- table(random_pois)
dd <- data.frame(counts=unname(c(tt)),
val=as.numeric(names(tt)))
Here I'm using table rather than hist because histograms on discrete data are fussy (having integer cutpoints often makes things confusing because you have to be careful about right- vs left-closure)
Set up density function for binned Poisson data (to work with bbmle's formula interface, the first argument must be called x, and it must have a log argument).
dpoisbin <- function(x,val,lambda,log=FALSE) {
probs <- dpois(val,lambda,log=TRUE)
r <- sum(x*probs)
if (log) r else exp(r)
}
Fit lambda (log link helps prevent numerical problems/warnings from negative lambda values):
library(bbmle)
m1 <- mle2(counts~dpoisbin(val,exp(loglambda)),
data=dd,
start=list(loglambda=0))
all.equal(unname(exp(coef(m1))),mean(random_pois),tol=1e-6) ## TRUE
exp(confint(m1))
## 2.5 % 97.5 %
## 2.972047 3.040009

Different results, using same data and method(?), when using WordMat and R

I am interested to reproduce results calculated by the GNU plugin to MS Word WordMat in R, but I can't get them to arrive at similar results (I am not looking for identical, but simply similar).
I have some y and x values and a power function, y = bx^a
Using the following data,
x <- c(15,31,37,44,51,59)
y <- c(126,71,61,53,47,42)
I get a = -0.8051 and b = 1117.7472 in WordMat, but a = -0.8026 and B = 1108.2533 in R, slightly different values.
Am I using the nls function in some wrong way or is there a better (more transparent) way to calculate it in R?
Data and R code,
# x <- c(15,31,37,44,51,59)
# y <- c(126,71,61,53,47,42)
df <- data.frame(x,y)
moD <- nls(y~a*x^b, df, start = list(a = 1,b=1))
summary(moD)
Formula: y ~ a * x^b
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1.108e+03 1.298e+01 85.35 1.13e-07 ***
b -8.026e-01 3.626e-03 -221.36 2.50e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3296 on 4 degrees of freedom
Number of iterations to convergence: 19
Achieved convergence tolerance: 5.813e-06
It looks like WordMat is estimating the parameters of y=b*x^a by doing the log-log regression rather than by solving the nonlinear least-squares problem:
> x <- c(15,31,37,44,51,59)
> y <- c(126,71,61,53,47,42)
>
> (m1 <- lm(log(y)~log(x)))
Call:
lm(formula = log(y) ~ log(x))
Coefficients:
(Intercept) log(x)
7.0191 -0.8051
> exp(coef(m1)[1])
(Intercept)
1117.747
To explain what's going on here a little bit more: if y=b*x^a, taking the log on both sides gives log(y)=log(b)+a*log(x), which has the form of a linear regression (lm() in R). However, log-transforming also affects the variance of the errors (which are implicitly included on the right-hand side of the question), meaning that you're actually solving a different problem. Which is correct depends on exactly how you state the problem. This question on CrossValidated gives more details.

R: Finding the coefficients of an expression which produce the largest R-squared value?

Let's say I've got a data inputted into a data frame like so:
df = data.frame(x = c(1,2,3,4,5,10,15,25,50),
y = c(.57,.75,.82,0.87,.89,.95,.97,.98,.99))
df
and I wish to fit the expression:
y = ((x/a)^b)/(1+(x/a)^b)
where a and b are unknown parameters.
I have plotted the points and drawn a fitted line by guessing the values of a and b:
library(ggplot2)
graph <- ggplot(df, aes(x=x, y=y))
graph <- graph + geom_point()
a = 0.50
b = 1.00
guesstimate <- function(x){((x/a)^b)/(1+(x/a)^b)}
graph <- graph + stat_function(fun = guesstimate)
graph
However, I'd like to find the values of a and b which creates an expression that produces the highest R^2 square value; i.e. the best possible mathematical fit for the data possible.
Question:
Short of guessing through the values of a and b manually and checking with the naked eye which fit is best, is there a way to get R to find the 'best' a and b values along with providing the R-squared value which confirms to me that the chosen a and b values are indeed the best possible fit?
You can use the nls (non-linear least squares) function:
m1 = nls(y ~ (x/a)^b/(1+(x/a)^b), list(a=1, b=1), data=df)
summary(m1)
Formula: y ~ (x/a)^b/(1 + (x/a)^b)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 0.779291 0.009444 82.51 1.01e-11 ***
b 1.145174 0.012733 89.94 5.53e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.003086 on 7 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 5.949e-08
ggplot(df, aes(x,y)) +
geom_point() +
geom_line(data=data.frame(x=seq(1,50,0.1), y=predict(m1, newdata=list(x=seq(1,50,0.1)))),
aes(x,y), colour="red")
nls does not provide an r-squared value, because, as discussed in this thread on R-help, r-squared is not necessarily meaningful for a non-linear model. nls does, however, find the parameter values that minimize the residual sum-of-squares, so in that sense these parameters provide the best fit for the given data and model. That doesn't mean that there isn't another model specification that gives a better fit, though in this case the model fit is virtually perfect.
Even if not obvious, a linear model can be applied here, just using basic algebra. Indeed, starting from 1/y = (1+(x/a)^b)/(x/a)^b and a little manipulation, you can arrive to:
log(1/y - 1) = -b*log(x) + b*log(a)
which is basically a linear model in the variables Y = log(1/y - 1) and X = log(x). From here, you can use lm:
df2<-data.frame(Y = log(1/df$y - 1), X = log(df$x))
coeffs<-lm(Y ~ X, data=df2)$coefficients
a <- exp(-model[1]/model[2])
# 0.7491387
b <- -model[2]
#1.116111
which are similar to those obtained with nls.

fit exponential decay in increase form in R

I want to fit a function in the increase form of exponential decay (or asymptotic curve), such that:
Richness = C*(1-exp(k*Abundance)) # k < 0
I've read on this page about expn() function, but simply can't find it (or a nls package). All I found was a nlstools package, but it has no expn(). I tried with the usual nls and exp function, but I only get increasing exponentials...
I want to fit the graph like below (drawn in Paint), and I don't know where the curve should stabilize (Richness = C). Thanks in advance.
This should get you started. Read the documentation on nls(...) (type ?nls at the command prompt). Also look up ?summary and ?predict.
set.seed(1) # so the example is reproduceable
df <- data.frame(Abundance=sort(sample(1:70,30)))
df$Richness <- with(df, 20*(1-exp(-0.03*Abundance))+rnorm(30))
fit <- nls(Richness ~ C*(1-exp(k*Abundance)),data=df,
algorithm="port",
start=c(C=10,k=-1),lower=c(C=0,k=-Inf), upper=c(C=Inf,k=0))
summary(fit)
# Formula: Richness ~ C * (1 - exp(k * Abundance))
#
# Parameters:
# Estimate Std. Error t value Pr(>|t|)
# C 20.004173 0.726344 27.54 < 2e-16 ***
# k -0.030183 0.002334 -12.93 2.5e-13 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.7942 on 28 degrees of freedom
#
# Algorithm "port", convergence message: relative convergence (4)
df$pred <- predict(fit)
plot(df$Abundance,df$Richness)
lines(df$Abundance,df$pred, col="blue",lty=2)
Thanks, jlhoward. I've got to something similar after reading the link sent by shujaa.
R <- function(a, b, abT) a*(1 - exp(-b*abT))
form <- Richness ~ R(a,b,Abundance)
fit <- nls(form, data=d, start=list(a=20,b=0.01))
plot(d$Abundance,d$Richness, xlab="Abundance", ylab="Richness")
lines(d$Abundance, predict(fit,list(x=d$Abundance)))
I've found the initial values by trial and error, though. So your solution looks better :)
EDIT: The result:

Estimating a model of the form z=k(x^a)(y^b) in R

Given arrays of data x, y, z I need to estimate the constants k, a and b in
z = k x^a y^b
Some of the z data contains zeros, which makes taking logs of both sides difficult. Following discussion on cross validated I think I need to fit a generalized linear model with a poisson link function. But how to express this in R? I have tried
glm(z ~ x*y,data,family=poisson)
But this outputs estimates for x, y and x*y separately, so I suspect they don't relate to the power terms I want to estimate. Meanwhile
glm(log z ~ log x + log y, data, family=gaussian)
Complains of NaN/inf in fit, presumably because of the zeros in z.
How does I encode this model correctly?
So there are (at least) 2 ways to do this. The first uses glm(...).
fit <- glm(z~log(x)+log(y),data,family=poisson)
Poisson glm fits a function of the form
z = exp[ b0 + b1 × x1 + b2 × x2 ]
Substituting x1 = log(x), x2 = log(y) gives
z = exp[ b0 + log(xb1) + log(yb2) ] = exp(b0) × xb1 × yb2
from which we can see by inspection that k = exp(b0), a = b1, and b = b2. In R code:
set.seed(1)
# generate some "data"
s <- seq(1,10,.2)
x <- sample(s,25)
y <- sample(s,25)
z <- 2*x^2.5*y^-1.5+rnorm(25,0,5)
z[z<0] <- 0 # OP's data has zeros
df <- data.frame(x,y,z)
# generalized linear model
fit.glm <- glm(z~log(x)+log(y),df,family=poisson)
summary(fit.glm)
# ...
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.65005 0.24973 2.603 0.00924 **
# log(x) 2.51848 0.12318 20.446 < 2e-16 ***
# log(y) -1.47706 0.05288 -27.932 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for poisson family taken to be 1)
#
# Null deviance: 1236.554 on 24 degrees of freedom
# Residual deviance: 25.509 on 22 degrees of freedom
# AIC: Inf
# ...
Note that in this artificial example, k=2, a=2.5, and b=-1.5. Since exp(0.65) = 1.92, we can see that glm does well recovering these parameters. One problem with glm is that we are estimating log(k), not k, so if we want confidence limits on k we need another approach.
As #BenBolker says, non-linear regression is also an option. Here wee would use
# non-linear regression
fit.nls <- nls(z~k*x^a*y^b,df)
summary(fit.nls)
# Formula: z ~ k * x^a * y^b
#
# Parameters:
# Estimate Std. Error t value Pr(>|t|)
# k 1.95561 0.21705 9.01 7.77e-09 ***
# a 2.50086 0.05104 49.00 < 2e-16 ***
# b -1.45590 0.02651 -54.93 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 3.153 on 22 degrees of freedom
#
# Number of iterations to convergence: 7
# Achieved convergence tolerance: 8.344e-07
The statistics of the fit are a bit easier to interpret here. Since we are estimating k directly, the CL are for k. The disadvantage of nls is that sometimes it is difficult to get convergence. Generally, one must supply starting estimates that are fairly close to the final values.
We can plot z-predicted vs. z-actual to get a sense of how good the fit(s) are. This variation on the residuals plot is helpful when there is more than one predictor. The reference line, pred = actual, represents a perfect fit.
par(mfrow=c(1,2))
plot(z,predict(fit.glm,type="response"), main="glm",ylab="pred.Z",xlab="act.Z")
abline(0,1,lty=2,col="blue")
plot(z,predict(fit.nls), main="nls",ylab="pred.Z",xlab="act.Z")
abline(0,1,lty=2,col="blue")
Finally, we need to check the residuals for normality using the Q-Q plot.
par(mfrow=c(1,2))
resid <- residuals(fit.glm,type="response")
qqnorm(resid,main="glm")
qqline(resid, lty=2, col="blue")
resid <- residuals(fit.nls)
qqnorm(resid,main="nls")
qqline(resid, lty=2, col="blue")
Note that there is a greater deviation from normality than would be expected, given that the dataset was created by adding normally distributed error. This is mostly due to the addition of artificial zeros.

Resources