How can I exclude certain values for curve fitting in R? - r

I have the following plot of some experimental data (see below). The red line is a fitting curve of the black dots, which are experimental values. Now, the first three dots at 0, 0.583, and 1.916 form a baseline and the next two, 2.083, 2.416, seem to be outliers. How can I program the fitting curve, so that it doesn't take into account baseline and outliers? At the moment, R is clearly trying to optimize also for those irrelevant values.
x <-
c(0,0.583333,1.916666,2.083333,2.416666,2.5,3.666666,5.916666,9,16.75,20)
y <-
c(
0.05464,0.05453,0.0544,0.18043,0.18151,0.12551,0.18792,0.2497,0.28359,0.31734,0.3263
)
plot(x,y, ylim = range(c(0,0.45)), pch = 1)
fit <- nls(y ~ -p1 / exp(x) + p1, start = list(p1 = 1))
xx <- seq(0,20, length = 200)
lines(xx, predict(fit, data.frame(x = xx)), col = "red")

To avoid fitting the first 5 points use the subset= argument of nls giving a vector of the negative positions to exclude:
nls(y ~ -p1 / exp(x) + p1, start = list(p1 = 1), subset = -seq(5))
Note that this model is actually linear in its single parameter so we could use lm instead of nls:
lm(y ~ I(1-exp(-x)) - 1, subset = -seq(5))

Related

Fitting a sigmoid curve using a logistic function in R

I have data that follows a sigmoid curve and I would like fit a logistic function to extract the three (or two) parameters for each participant. I have found some methods online, but I'm not sure which is the correct option.
This tutorial explains that you should use the nls() function like this:
fitmodel <- nls(y~a/(1 + exp(-b * (x-c))), start=list(a=1,b=.5,c=25))
## get the coefficients using the coef function
params=coef(fitmodel)
... where you clearly need the starting values to find the best-fitting values (?).
And then this post explains that to get the starting values, you can use a "selfstarting model can estimate good starting values for you, so you don't have to specify them":
fit <- nls(y ~ SSlogis(x, Asym, xmid, scal), data = data.frame(x, y))
However somewhere else I also read that you should use the SSlogis function for fitting a logistic function. Please could someone confirm whether these two steps are the best way to go about it? Or should I use values that I have extracted from previous similar data for the starting values?
Additionally, what should I do if I don't want the logistic function to be defined by the asymptote at all?
Thank you!
There isn't a best way but SSlogis does eliminate having to set starting values whereas if you specify the formula you have more control over the parameterization.
If the question is really how to fix a at a predetermined level, here the value 1, without rewriting the formula then set a before running nls and omit it from the starting values.
a <- 1
fo <- y ~ a / (1 + exp(-b * (x-c)))
nls(fo, start = list(b = 0.5, c = 25))
Alternately this substitutes a=1 into formula fo giving fo2 without having to rewrite the formula yourself.
fo2 <- do.call("substitute", list(fo, list(a = 1)))
nls(fo2, start = list(b = 0.5, c = 25))
As #G. Grothendieck writes, there is no general "best way", it always depends on you particular aims. Use of SSLogis is a good idea, as you don't need to specify start values, but a definition of an own function is more flexible. See the following example, where we use heuristics to derive start values ourselves instead of specifying them manually. Then we fit a logistic model and as a small bonus, the Baranyi growth model with an explicit lag phase.
# time (t)
x <- c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
# Algae cell counts (Mio cells per ml)
y <- c(0.88, 1.02, 1.43, 2.79, 4.61, 7.12,
6.47, 8.16, 7.28, 5.67, 6.91)
## we now plot the data linearly and logarithmically
## the layout function is another way to subdivide the plotting area
nf <- layout(matrix(c(1,2,3,3), 2, 2, byrow = TRUE), respect = TRUE)
layout.show(nf) # this shows how the plotting area is subdivided
plot(x, y)
plot(x, log(y))
## we see that the first points show the steepest increase,
## so we can estimate a start value of the growth rate
r <- (log(y[5]) - log(y[1])) / (x[5] - x[1])
abline(a=log(y[1]), b=r)
## this way, we have a heuristics for all start parameters:
## r: steepest increase of y in log scale
## K: maximum value
## N0: first value
## we can check this by plotting the function with the start values
f <- function(x, r, K, N0) {K /(1 + (K/N0 - 1) * exp(-r *x))}
plot(x, y, pch=16, xlab="time (days)", ylab="algae (Mio cells)")
lines(x, f(x, r=r, K=max(y), N0=y[1]), col="blue")
pstart <- c(r=r, K=max(y), N0=y[1])
aFit <- nls(y ~ f(x, r, K,N0), start = pstart, trace=TRUE)
x1 <- seq(0, 25, length = 100)
lines(x1, predict(aFit, data.frame(x = x1)), col = "red")
legend("topleft",
legend = c("data", "start parameters", "fitted parameters"),
col = c("black", "blue", "red"),
lty = c(0, 1, 1),
pch = c(16, NA, NA))
summary(aFit)
(Rsquared <- 1 - var(residuals(aFit))/var(y))
## =============================================================================
## Approach with Baranyi-Roberts model
## =============================================================================
## sometimes, a logistic is not good enough. In this case, use another growth
## model
baranyi <- function(x, r, K, N0, h0) {
A <- x + 1/r * log(exp(-r * x) + exp(-h0) - exp(-r * x - h0))
y <- exp(log(N0) + r * A - log(1 + (exp(r * A) - 1)/exp(log(K) - log(N0))))
y
}
pstart <- c(r=0.5, K=7, N0=1, h0=2)
fit2 <- nls(y ~ baranyi(x, r, K, N0, h0), start = pstart, trace=TRUE)
lines(x1, predict(fit2, data.frame(x = x1)), col = "forestgreen", lwd=2)
legend("topleft",
legend = c("data", "logistic model", "Baranyi-Roberts model"),
col = c("black", "red", "forestgreen"),
lty = c(0, 1, 1),
pch = c(16, NA, NA))

Creating a Smooth Line in 3D R

I have a set of 3-dimensional points, like the sample data below. I would like to create a smooth line from it. There's information out there about to smooth a 2D surface in 3D space but how would I smooth a 1D line in 3D space?
Z = seq(0, 1, 0.01)
X = rnorm(length(Z), mean = 0, sd = 0.1)
Y = 2 * Z ^ 2 + rnorm(length(Z), mean = 0, sd = 0.1)
data = data.frame(X = X, Y = Y, Z= Z)
This is an example of multivariate regression. If you happen to know that the relationship with Z should be quadratic, you can do
fit <- lm(cbind(X, Y) ~ poly(Z, 2))
But I'm assuming you don't know that, and want some kind of general smoother. I don't think loess, lowess, or gam handle multivariate regression, but you can use natural splines in lm:
library(splines)
fit <- lm(cbind(X, Y) ~ ns(Z, df = 4))
The fitted values will be returned in a two-column matrix by predict(fit).
To plot the result, you can use rgl:
library(rgl)
plot3d(X, Y, Z, col = "red")
lines3d(cbind(predict(fit), Z))

"lines()" acting like "polygon()" R?

I'm trying to draw two black lines around a red regression line. But the lines() command draws something more like a polygon() than a simple line (see picture below code).
I'm wondering is there a fix to simply draw two lines around the regression line (i.e., uncertainty intervals), or I'm missing something?
library(rstanarm)
data(kidiq)
d <- kidiq
fit <- stan_glm(kid_score ~ mom_iq,
data = d,
prior = normal(0, 2.5),
prior_intercept = normal(0, 10),
prior_aux = cauchy(0, 100))
plot(kid_score ~ mom_iq, data = d, type = "n")
abline(fit, col = 2)
pred_lin <- posterior_linpred(fit)
loop <- length(d$mom_iq)
I <- matrix(NA, loop, 2)
for(i in 1:loop){
I[i,] = quantile(pred_lin[,i], c(.025, .975))
}
lines(d$mom_iq, I[,1], lty = 2)
lines(d$mom_iq, I[,2])
Try the ordered data.frame like:
a <- cbind(d$mom_iq, I[,1])
a <- a[order(a[,1]),]
lines(a)
So you can also write:
lines(sort(d$mom_iq), I[,2][order(d$mom_iq)])
or simply:
apply(I, 2, function(x) lines(sort(d$mom_iq), x[order(d$mom_iq)]))

How are the "plot.gam" confidence intervals calculated?

If a model is fitted using mgcv and then the smooth terms are plotted,
m <- gam(y ~ s(x))
plot(m, shade = TRUE)
then you get a plot of the curve with a confidence interval. These are, I presume, pointwise-confidence intervals. How are they computed?
I tried to write
object <- plot(m, shade = true)
object[[1]]$fit +- 2*object[[1]]$se
in order to extract the lower and upper bounds using the standard errors and a multiplier of 2, but when I plot it, it looks a bit different than the confidence intervals plotted by plot.gam?
So, how are those calculated?
I do not use seWithMean = true or anything like that.
It is 1 standard deviation.
oo <- plot.gam(m)
oo <- oo[[1]]
points(oo$x, oo$fit, pch = 20)
points(oo$x, oo$fit - oo$se, pch = 20)
Reproducible example:
x <- seq(0, 2 * pi, length = 100)
y <- x * sin(x) + rnorm(100, 0, 0.5)
m <- gam(y ~ s(x))

How to perform a non linear regression for my data

I have set of Temperature and Discomfort index value for each temperature data. When I plot a graph between temperature(x axis) and Calculated Discomfort index value( y axis) I get a reversed U-shape curve. I want to do non linear regression out of it and convert it into PMML model. My aim is to get the predicted discomfort value if I give certain temperature.
Please find the below dataset :
Temp <- c(0,5,10,6 ,9,13,15,16,20,21,24,26,29,30,32,34,36,38,40,43,44,45, 50,60)
Disc<-c(0.00,0.10,0.25,0.15,0.24,0.26,0.30,0.31,0.40,0.41,0.49,0.50,0.56, 0.80,0.90,1.00,1.00,1.00,0.80,0.50,0.40,0.20,0.15,0.00)
How to do non linear regression (possibly with nls??) for this dataset?
I did take a look at this, then I think it is not as simple as using nls as most of us first thought.
nls fits a parametric model, but from your data (the scatter plot), it is hard to propose a reasonable model assumption. I would suggest using non-parametric smoothing for this.
There are many scatter plot smoothing methods, like kernel smoothing ksmooth, smoothing spline smooth.spline and LOESS loess. I prefer to using smooth.spline, and here is what we can do with it:
fit <- smooth.spline(Temp, Disc)
Please read ?smooth.spline for what it takes and what it returns. We can check the fitted spline curve by
plot(Temp, Disc)
lines(fit, col = 2)
Should you want to make prediction elsewhere, use predict function (predict.smooth.spline). For example, if we want to predict Temp = 20 and Temp = 44, we can use
predict(fit, c(20,44))$y
# [1] 0.3940963 0.3752191
Prediction outside range(Temp) is not recommended, as it suffers from potential bad extrapolation effect.
Before I resort to non-parametric method, I also tried non-linear regression with regression splines and orthogonal polynomial basis, but they don't provide satisfying result. The major reason is that there is no penalty on the smoothness. As an example, I show some try with poly:
try1 <- lm(Disc ~ poly(Temp, degree = 3))
try2 <- lm(Disc ~ poly(Temp, degree = 4))
try3 <- lm(Disc ~ poly(Temp, degree = 5))
plot(Temp, Disc, ylim = c(-0.3,1.0))
x<- seq(min(Temp), max(Temp), length = 50)
newdat <- list(Temp = x)
lines(x, predict(try1, newdat), col = 2)
lines(x, predict(try2, newdat), col = 3)
lines(x, predict(try3, newdat), col = 4)
We can see that the fitted curve is artificial.
We can fit polynomials as follows, but it's going to overfit the data as we have higher degree:
m <- nls(Disc ~ a + b*Temp + c*Temp^2 + d*Temp^3 + e*Temp^4, start=list(a=0, b=1, c=1, d=1, e=1))
plot(Temp,Disc,pch=19)
lines(Temp,predict(m),lty=2,col="red",lwd=3)
m <- nls(Disc ~ a + b*Temp + c*Temp^2 + d*Temp^3 + e*Temp^4 + f*Temp^5, start=list(a=0, b=1, c=1, d=1, e=1, f=1))
lines(Temp,predict(m),lty=2,col="blue",lwd=3)
m <- nls(Disc ~ a + b*Temp + c*Temp^2 + d*Temp^3 + e*Temp^4 + f*Temp^5 + g*Temp^6, start=list(a=0, b=1, c=1, d=1, e=1, f=1, g=1))
lines(Temp,predict(m),lty=2,col="green",lwd=3)
m.poly <- lm(Disc ~ poly(Temp, degree = 15))
lines(Temp,predict(m),lty=2,col="yellow",lwd=3)
legend(x = "topleft", legend = c("Deg 4", "Deg 5", "Deg 6", "Deg 20"),
col = c("red", "green", "blue", "yellow"),
lty = 2)

Resources