R Fitting polynomial to data through (a) fixed point(s) - r

I'm stuck at a very specific problem where I have to find a function describing the (normalized) leaf shape of a plant. The problem is not just to find the polynomial that best describes the data, but also that it starts at (0,0) ends at (1,0) and moves through the point of maximum width (x_ymax, 1) without ever going wider.
An alternate option I tried is Hermite interpolation, using those 3 specific points as control points but the function it provides is way off the actual shape of the leaf, unless I provide more control points.
Is there a specific function for this or do I need to make some manual conversion? Or would there be better or alternate options to tackling this problem?
Thanks in advance!

I'm not sure if this would always work, but here is an example of a "Generalized Additive Model" that uses a cyclic spline. When you specify that the model should not have an intercept (i.e. include -1 in formula, then it should pass through y=0. You will have to scale your predictor variable to be between 0 and 1 in order for the ends to pass through the points you mentioned (see here for more info.).
Example
# required model
library(mgcv)
# make data
n <- 200
tmp <- seq(0,20*pi,,n)
x <- tmp / (2*pi)
mon <- x%%1
err <- rnorm(n, sd=0.5)
y <- sin(tmp) + err + 1
plot(x, y, t="l")
df <- data.frame(x, y, mon)
# GAM with intercept
fit1 <- gam(y ~ s(mon, bs = "cc", k = 12), data=df)
summary(fit1)
plot(fit1)
# GAM without intercept
fit2 <- gam(y ~ s(mon, bs = "cc", k = 12) - 1, data=df) # note "-1" for no intercept
summary(fit2)
plot(fit2)

Related

Constrained Spline Function in r

hope all is well.
I have been exploring a few options for constraining a spline function so that it not only stays positive, but, so that it stays above the lowest value of y in the dataframe. I am assuming there is a penalized spline function out there where one can readily adapt the shape, though I have not found easily or tried yet. I have also tried nls with an exponential decay function which works, however, the last estimated point is much higher than desired (would like it to pass through, or be closer to the final value of y). see code below with the options i have tried. The ultimate goal however is to fit a spline that passes through all points and never decreases below the lowest value of y at any point while also acknowledging that yes there are only 5 data points. thanks in advance for the help.
library(tidyverse)
library(broom)
library(gnm)
library(cobs)
library(zoo)
DF <- data.frame(x = seq(1,5,1),y=c(26419753,9511111,3566667,57993,52194))
t=1:5
# option 1a and 1b: preferred method which is fitting a spline function
mod1a <- splinefun(DF$x,DF$y)
curve(mod1a, 1,5)
pred_interval_mod1a <- seq(1,5,length = 40)
interp(pred_interval_mod1a) # has that dip to negative near the end which should remain larger than y= 52,194
mod1b <- cobs(x= DF$x,y = DF$y,pointwise=rbind(c(0,52194,-1),c(0,26419753,1)))
pred_interval_mod1b <- seq(1,5,length = 40)
interp(pred_interval_mod1b)
# option 2: NLS for exponential decay with starting values
mod2 <- nls(y ~ SSasymp(t, yf, y0, log_alpha), data = DF)
qplot(t, y, data = augment(mod2)) + geom_line(aes(y = .fitted))
# option 3: similar NLS premise but with lower defined
mod3 <- nls(y ~ yf + (y0 - yf) * exp(-alpha * t), data = DF,
start = list(y0 = 26419753, yf = 52194, alpha = 1),
lower= c(-Inf,52194,-Inf),algorithm="port")
# option 4: similar to 2 and 3
a=log(52194)
mod4 <- gnm(y ~ Exp(1 + t) -1, verbose = FALSE, constrain="Exp(.+x).Intercept",
constrainTo=a, start=c(a,-0.05), data=DF)
mod4_df <- data.frame(t = seq(1,5,by=1))
mod4_pred <- predict(mod4,newdata=mod4_df)
mod4_pred

Fitting non-linear Langmuir Isotherm in R

I want to fit Isotherm models for the following data in R. The simplest isotherm model is Langmuir model given here model is given in the bottom of the page. My MWE is given below which throw the error. I wonder if there is any R package for Isotherm models.
X <- c(10, 30, 50, 70, 100, 125)
Y <- c(155, 250, 270, 330, 320, 323)
Data <- data.frame(X, Y)
LangIMfm2 <- nls(formula = Y ~ Q*b*X/(1+b*X), data = Data, start = list(Q = 1, b = 0.5), algorith = "port")
Error in nls(formula = Y ~ Q * b * X/(1 + b * X), data = Data, start = list(Q = 1, :
Convergence failure: singular convergence (7)
Edited
Some nonlinear models can be transform to linear models. My understanding is that there might be one-to-one relationship between the estimates of nonlinear model and its linear model form but their corresponding standard errors are not related to each other. Is this assertion true? Are there any pitfalls in fitting Nonlinear Models by transforming to linearity?
I am not aware of such packages and personally I don't think that you need one as the problem can be solved using a base R.
nls is sensitive to the starting parameters, so you should begin with a good starting guess. You can easily evaluate Q because it corresponds to the asymptotic limit of the isotherm at x-->Inf, so it is reasonable to begin with Q=323 (which is the last value of Y in your sample data set).
Next, you could do plot(Data) and add a line with an isotherm that corresponds to your starting parameters Q and b and tweak b to come up with a reasonable guess.
The plot below shows your data set (points) and a probe isotherm with Q = 323 and b = 0.5, generated by with(Data,lines(X,323*0.5*X/(1+0.5*X),col='red')) (red line). It seemed a reasonable starting guess to me, and I gave it a try with nls:
LangIMfm2 <- nls(formula = Y ~ Q*b*X/(1+b*X), data = Data, start = list(Q = 300, b = 1), algorith = "port")
# Nonlinear regression model
# model: Y ~ Q * b * X/(1 + b * X)
# data: Data
# Q b
# 366.2778 0.0721
# residual sum-of-squares: 920.6
#
# Algorithm "port", convergence message: relative convergence (4)
and plotted predicted line to make sure that nls found the right solution:
lines(Data$X,predict(LangIMfm2),col='green')
Having said that, I would suggest to use a more effective strategy, based on the linearization of the model by rewriting the isotherm equation in reciprocal coordinates:
z <- 1/Data
plot(Y~X,z)
abline(lm(Y~X,z))
M <- lm(Y~X,z)
Q <- 1/coef(M)[1]
# 363.2488
b <- coef(M)[1]/coef(M)[2]
# 0.0741759
As you could see, both approaches produce essentially the same result, but the linear model is more robust and doesn't require starting parameters (and, as far as I remember, it is the standard way of the isotherm analysis in the experimental physical chemistry).
You can use the SSmicmen self-starter function (see Ritz and Streibig, 2008, Nonlinear Regression with R) in the nlme package for R, which calculates initial parameters from the fit of the linearized form of the Michaelis-Menten (MM) equation. Fortunately, the MM equation possesses a form that can be adapted for the Langmuir equation, S = Smax*x/(KL + x). I've found the nlshelper and tidyverse packages useful for modeling and exporting the results of the nls command into tables and plots, particularly when modeling sample groups. Here's my code for modeling a single set of sorption data:
library(tidyverse)
library(nlme)
library(nlshelper)
lang.fit <- nls(Y ~ SSmicmen(X,Smax,InvKL), data=Data)
fit.summary <- tidy(lang.fit)
fit.coefs <- coef(lang.fit)
For simplicity, the Langmuir affinity constant is modeled here as 1/KL. Applying this code, I get the same parameter estimates as #Marat given above.
The simple code below allows for wrangling the data in order to create a ggplot object, containing the original points and fitted line (i.e., geom_point would represent the original X and Y data, geom_line would represent the original X plus YHat).
FitY <- tibble(predict(lang.fit))
YHat <- FitY[,1]
Data2 <- cbind(Data, YHat)
If you want to model multiple groups of data (say, based on a "Sample_name" column, then the lang.fit variable would be calculated as below, this time using the nlsList command:
lang.fit <- nlsList(Y ~ SSmicmen(X,Smax,InvKL) | Sample_name, data=Data)
The problem is the starting values. We show two approaches to this as well as an alternative that converges even using the starting values in the question.
1) plinear The right hand side is linear in Q*b so it would be better to absorb b into Q and then we have a parameter that enters linearly so it is easier to solve. Also with the plinear algorithm no starting values are needed for the linear parameter so only the starting value for b need be specified. With plinear the right hand side of the nls formula should be specified as the vector that multiplies the linear parameter. The result of running nls giving fm0 below will be coefficients named b and .lin where Q = .lin / b.
We already have our answer from fm0 but if we want a clean run in terms of b and Q rather than b and .lin we can run the original formula in the question using the starting values implied by the coefficients returned by fm0 as shown.
fm0 <- nls(Y ~ X/(1+b*X), Data, start = list(b = 0.5), alg = "plinear")
st <- with(as.list(coef(fm0)), list(b = b, Q = .lin/b))
fm <- nls(Y ~ Q*b*X/(1+b*X), Data, start = st)
fm
giving
Nonlinear regression model
model: Y ~ Q * b * X/(1 + b * X)
data: Data
b Q
0.0721 366.2778
residual sum-of-squares: 920.6
Number of iterations to convergence: 0
Achieved convergence tolerance: 9.611e-07
We can display the result. The points are the data and the red line is the fitted curve.
plot(Data)
lines(fitted(fm) ~ X, Data, col = "red")
(contineud after plot)
2) mean Alternately, using a starting value of mean(Data$Y) for Q seems to work well.
nls(Y ~ Q*b*X/(1+b*X), Data, start = list(b = 0.5, Q = mean(Data$Y)))
giving:
Nonlinear regression model
model: Y ~ Q * b * X/(1 + b * X)
data: Data
b Q
0.0721 366.2779
residual sum-of-squares: 920.6
Number of iterations to convergence: 6
Achieved convergence tolerance: 5.818e-06
The question already had a reasonable starting value for b which we used but if one were needed one could set Y to Q*b so that they cancel and X to mean(Data$X) and solve for b to give b = 1 - 1/mean(Data$X) as a possible starting value. Although not shown using this starting value for b with mean(Data$Y) as the starting value for Q also resulted in convergence.
3) optim If we use optim the algorithm converges even with the initial values used in the question. We form the residual sum of squares and minimize that:
rss <- function(p) {
Q <- p[1]
b <- p[2]
with(Data, sum((Y - b*Q*X/(1+b*X))^2))
}
optim(c(1, 0.5), rss)
giving:
$par
[1] 366.27028219 0.07213613
$value
[1] 920.62
$counts
function gradient
249 NA
$convergence
[1] 0
$message
NULL

Drawing a 3D decision boundary of logistic regression

I have fitted a logistic regression model that takes 3 variables into account. I would like to make a 3D plot of the datapoints and draw the decision boundary (which I suppose would be a plane here).
I found an online example that applies to the case (so that you can load the data directly)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
I was thinking of using the 3Dscatterplot package, but I am not sure what equation I should write to draw the boundary. Any ideas?
Many thanks,
The decision boundary will be a 3-d plane, which you could plot with any 3-d plotting package in R. I'll use persp by defining an x-y grid and then calculating the corresponding z value with the outer function:
# Use iris dataset for example logistic regression
data(iris)
iris$long <- as.numeric(iris$Sepal.Length > 6)
mod <- glm(long~Sepal.Width+Petal.Length+Petal.Width, data=iris, family="binomial")
# Plot 50% decision boundary; another cutoff can be achieved by changing the intercept term
x <- seq(2, 5, by=.1)
y <- seq(1, 7, by=.1)
z <- outer(x, y, function(x, y) (-coef(mod)[1] - coef(mod)[2]*x - coef(mod)[3]*y) /
coef(mod)[4])
persp(x, y, z, col="lightblue")

Modifying a curve to prevent singular gradient matrix at initial parameter estimates

I want to use y=a^(b^x) to fit the data below,
y <- c(1.0385, 1.0195, 1.0176, 1.0100, 1.0090, 1.0079, 1.0068, 1.0099, 1.0038)
x <- c(3,4,5,6,7,8,9,10,11)
data <- data.frame(x,y)
When I use the non-linear least squares procedure,
f <- function(x,a,b) {a^(b^x)}
(m <- nls(y ~ f(x,a,b), data = data, start = c(a=1, b=0.5)))
it produces an error: singular gradient matrix at initial parameter estimates. The result is roughly a = 1.1466, b = 0.6415, so there shouldn't be a problem with intial parameter estimates as I have defined them as a=1, b=0.5.
I have read in other topics that it is convenient to modify the curve. I was thinking about something like log y=log a *(b^x), but I don't know how to deal with function specification. Any idea?
I will expand my comment into an answer.
If I use the following:
y <- c(1.0385, 1.0195, 1.0176, 1.0100, 1.0090, 1.0079, 1.0068, 1.0099, 1.0038)
x <- c(3,4,5,6,7,8,9,10,11)
data <- data.frame(x,y)
f <- function(x,a,b) {a^b^x}
(m <- nls(y ~ f(x,a,b), data = data, start = c(a=0.9, b=0.6)))
or
(m <- nls(y ~ f(x,a,b), data = data, start = c(a=1.2, b=0.4)))
I obtain:
Nonlinear regression model
model: y ~ f(x, a, b)
data: data
a b
1.0934 0.7242
residual sum-of-squares: 0.0001006
Number of iterations to convergence: 10
Achieved convergence tolerance: 3.301e-06
I always obtain an error if I use 1 as a starting value for a, perhaps because 1 raised to anything is 1.
As for automatically generating starting values, I am not familiar with a procedure to do that. One method I have read about is to simulate curves and use starting values that generate a curve that appears to approximate your data.
Here is the plot generated using the above parameter estimates using the following code. I admit that maybe the lower right portion of the line could fit a little better:
setwd('c:/users/mmiller21/simple R programs/')
jpeg(filename = "nlr.plot.jpeg")
plot(x,y)
curve(1.0934^(0.7242^x), from=0, to=11, add=TRUE)
dev.off()

Issues with predict function in R

I'm having issues with using the predict() function in R and I hope that I can get some help. Consider a dataset with two columns - 1) Y, 2) X
My goal is to fit a natural spline fit and get a 95% CI and to mark points outside of the 95% CI as outlier. Here is what I do:
1) Initially no point in the dataset is marked as outlier.
2) I fit my ns fit and using its 95% CI, I mark the points outside of the CI as outlier
3) I, then, exclude the initially marked outliers, and fit another ns and using it's 95% CI, I mark outliers.
* Issue: *
Suppose my initial dataset has 1000 obs. I mark some outliers in the first round and I get 23 outliers. Then I fit another ns (call it fit.ns) using the remaining 977 non-outliers. I then use ALL X's (all 1000) to get predicted values based on this new fit but I get warning AND error that newdata in my predict function has 1000 obs but fit has 977. The predicted values returned has also 977 values and NOT 1000.
* My predict() code *
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
# Getting Fitted Values and 95% CI:
fit.ns.values <- predict(fit.ns, newdata = data.frame(Time = data.temp$Time),
interval="prediction", level = 1 - 0.05) # ??? PROBLEM
I really appreciate your help.
Seems that I cannot upload the dataset, but my code is:
library(splines)
ns.knot <- 10
for (i in 1:2){
# I exclude outliers so that my ns.fit does not get affected my outliers
data.ns <- data.temp[data.temp$OutlierInd == 0,]
data.ns$BeatNum <- 1:nrow(data.ns) # BeatNum is like a row number for me and is an auxilary variable
# Place Holder for Natural Spline results:
data.temp$IBI.NSfit <- rep(NA, nrow(data.temp))
data.temp$IBI.NSfit.L95 <- rep(NA, nrow(data.temp))
data.temp$IBI.NSfit.U95 <- rep(NA, nrow(data.temp))
# defining the knots in n.s.:
knots <- (data.ns$BeatNum)[seq(ns.knot, (length(data.ns$BeatNum) - ns.knot), by = ns.knot)]
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
# Getting Fitted Values and 95% CI:
fit.ns.values <- predict(fit.ns, newdata = data.frame(Time = data.temp$Time), interval="prediction", level = 1 - 0.05) # ??? PROBLEM
data.temp$IBI.NSfit <- fit.ns.values[,1]
data.temp$IBI.NSfit.L95 <- fit.ns.values[,2]
data.temp$IBI.NSfit.U95 <- fit.ns.values[,3]
# Updating OutlierInd based on Natural Spline 95% CI:
data.temp$OutlierInd <- ifelse(data.temp$IBI < data.temp$IBI.NSfit.U95 & data.temp$IBI > data.temp$IBI.NSfit.L95, 0, 1)
}
Finally, I found the solution:
When I fit the model, I should use the "data =" option. In other words, instead of the command below,
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(data.ns$IBI ~ ns(data.ns$Time, knots = data.ns$Time[knots]))
I should use the command below instead:
# Fitting a Natural Spline Fit (df = 3 by default)
fit.ns <- lm(IBI ~ ns(Time, knots = Time[knots]), data = data.ns)
Then the predict function will work.
I wanted to add a comment but my rep level doesnt allow that.
Anyways, I think this is a well documented point that predict uses the exact variables names used in the fit function. So, naming your variables is the best way to get around this error in my experience.
So, in the case above, please redefine a data frame just for your fit purposes like this
library(splines)
#Fit part
fit.data <- data.frame(y=rnorm(30),x=rnorm(30))
fit.ns <- lm(y ~ ns(x,3),data=fit.data)
#Predict
pred.data <- data.frame(y=rnorm(10),x=rnorm(10))
pred.fit <- predict(fit.ns,interval="confidence",limit=0.95,data.frame(x=pred.data$x))
IMHO, this should get rid of your error

Resources