Predicting data from a power curve manually - r

I have a series of data I have fit a power curve to, and I use the predict function in R to allow me predict y values based on additional x values.
set.seed(1485)
len <- 24
x <- runif(len)
y <- x^3 + rnorm(len, 0, 0.06)
ds <- data.frame(x = x, y = y)
mydata=data.frame(x,y)
z <- nls(y ~ a * x^b, data = mydata, start = list(a=1, b=1))
#z is same as M!
power <- round(summary(z)$coefficients[1], 3)
power.se <- round(summary(z)$coefficients[2], 3)
plot(y ~ x, main = "Fitted power model", sub = "Blue: fit; green: known")
s <- seq(0, 1, length = 100)
lines(s, s^3, lty = 2, col = "green")
lines(s, predict(z, list(x = s)), lty = 1, col = "blue")
text(0, 0.5, paste("y =x^ (", power, " +/- ", power.se,")", sep = ""), pos = 4)
Instead of using the predict function here, how could I manually calculate estimated y values based on additional x values based on this power function. If this were just a simple linear regression, I would calculate the slope and y intercept and calculate my y values by
y= mx + b
Is there a similar equation I can use from the output of z that will allow me to estimate y values from additional x values?
> z
Nonlinear regression model
model: y ~ a * x^b
data: mydata
a b
1.026 3.201
residual sum-of-squares: 0.07525
Number of iterations to convergence: 5
Achieved convergence tolerance: 5.162e-06

You would do it the same way except you use the power equation you modeled. You can access the parameters the model calculated using z$m$getPars()
Here is a simple example to illustrate:
predict(z, list(x = 1))
Results in: 1.026125
Which equals the results of
z$m$getPars()["a"] * 1 ^ z$m$getPars()["b"]
Which is equivalet to y = a * x^b

Here are some ways.
1) with This evaluates the formula with respect to the coefficients:
x <- 1:2 # input
with(as.list(coef(z)), a * x^b)
## [1] 1.026125 9.437504
2) attach We could also use attach although it is generally frowned upon:
attach(as.list(coef(z)))
a * x^b
## [1] 1.026125 9.437504
3) explicit Explicit definition:
a <- coef(z)[["a"]]; b <- coef(z)[["b"]]
a * x^b
## [1] 1.026125 9.437504
4) eval This one extracts the formula from z so that we don't have to specify it again. formula(z)[[3]] is the right hand side of the formula used to produce z. Use of eval is sometimes frowned upon but this does avoid
the redundant specification of the formula.
eval(formula(z)[[3]], as.list(coef(z)))
## [1] 1.026125 9.437504

Related

Errors in nls() - singular gradient or NaNs produced

I am trying to fit my photosynthesis data to a nls function, which is a nonrectangular hyperbola function. So far, I have some issues with getting the right start value for nls and, therefore, I am getting a lot of errors such as 'singular gradient ', 'NaNs produced', or 'step factor 0.000488281 reduced below 'minFactor' of 0.000976562'. Would you please give some suggestions for finding the best starting values? Thanks in advance!
The codes and data are below:
#Dataframe
PPFD <- c(0,0,0,50,50,50,100,100,100,200,200,200,400,400,400,700,700,700,1000,1000,1000,1500,1500,1500)
Cultivar <- c(-0.7,-0.8,-0.6,0.6,0.5,0.8,2.0,2.0,2.3,3.6,3.7,3.7,5.7,5.5,5.8,9.7,9.6,10.0,14.7,14.4,14.9,20.4,20.6,20.9)
NLRC <-data.frame(PPFD,Cultivar)
#nls regression
reg_nrh <- nls(Cultivar ~ (1/(2*Theta))*(AQY*PPFD+Am-sqrt((AQY*PPFD+Am)^2-4*AQY*Theta*Am*PPFD))-Rd, data = NLRC, start=list(Am = max(NLRC$Cultivar)-min(NLRC$Cultivar), AQY = 0.05, Rd=-min(NLRC$Cultivar), Theta = 1))
#estimated parameters for plotting
Amnrh <- coef(reg_nrh)[1]
AQYnrh <- coef(reg_nrh)[2]
Rdnrh <- coef(reg_nrh)[3]
Theta <- coef(reg_nrh)[4]
#plot
plot(NLRC$PPFD, NLRC$Cultivar, main = c("Cultivar"), xlab="", ylab="", ylim=c(-2,40),cex.lab=1.2,cex.axis=1.5,cex=2)+mtext(expression("PPFD ("*mu*"mol photons "*m^-2*s^-1*")"),side=1,line=3.3,cex=1.5)+mtext(expression(P[net]*" ("*mu*"mol "*CO[2]*" "*m^-2*s^-1*")"),side=2,line=2.5,cex=1.5)
#simulated value
ppfd = seq(from = 0, to = 1500)
pnnrh <- (1/(2*Theta))*(AQYnrh*ppfd+Amnrh-sqrt((AQYnrh*ppfd+Amnrh)^2-4*AQYnrh*Theta*Amnrh*ppfd))- Rdnrh
lines(ppfd, pnnrh, col="Green")
If we
take the maximum of 0 and the expression within the sqrt to avoid taking negative square roots
fix Theta at 0.8
use lm to get starting values for AQY and Am
then it converges
Theta <- 0.8
fm <- lm(Cultivar ~ PPFD, NLRC)
st <- list(AQY = coef(fm)[[2]], Rd = -min(NLRC$Cultivar), Am = coef(fm)[[1]])
fo <- Cultivar ~
(1/(2*Theta))*(AQY*PPFD+Am-sqrt(pmax(0, (AQY*PPFD+Am)^2-4*AQY*Theta*Am*PPFD)))-Rd
reg <- nls(fo, data = NLRC, start = st)
deviance(reg) # residual sum of squares
## [1] 5.607943
plot(Cultivar ~ PPFD, NLRC)
lines(fitted(reg) ~ PPFD, NLRC, col = "red")
(continued after image)
Note that the first model below has only two parameters yet has lower residual sum of squares (lower is better).
reg2 <- nls(Cultivar ~ a * PPFD^b, NLRC, start = list(a = 1, b = 1))
deviance(reg2)
## [1] 5.098796
These have higher residual sum of squares but do have the advantage that they are very simple.
deviance(fm) # fm defined above
## [1] 6.938648
fm0 <- lm(Cultivar ~ PPFD + 0, NLRC) # same as fm except no intercept
deviance(fm0)
## [1] 7.381632

loop function with nls

I am struggling with looping nls functions. So here is an example data set for a single sample
dat<-read.table(text="time y
1 4.62
2 13.55
3 30.82
6 93.97
12 145.93
24 179.93", header = TRUE)
plot(data);lines(data)
model <- nls(y ~ Max * (1-exp(-k * (time - Lag))),data=dat,start=list(Max = 200, k = 0.1, Lag = 0.5))
but what if I want to apply model to multiple columns of samples?
for example
dat<-read.table(text="time gluc starch solka
+ 1 6.32 7.51 1.95
+ 2 20.11 25.49 6.43
+ 3 36.03 47.53 10.39
+ 6 107.52 166.31 27.01
+ 12 259.28 305.19 113.72
+ 24 283.40 342.56 251.14
+ 48 297.55 353.66 314.22", header = TRUE)
How can I get R to solve for Max, k, and Lag for each sample (gluc, starch, solka)?
In all the alternatives below we use these values:
long <- tidyr::pivot_longer(dat, -1, values_to = "y")
long$name <- factor(long$name)
st0 <- list(Max = 200, k = 0.1, Lag = 0.5)
1) nls grouped data Convert dat to long form and then use the grouped data feature of nls This solution is the most suitable among those presented here for testing whether certain parameters are common among the three names since it is easy to simply remove the subscript on a parameter if it is to be common among the names. The fitting itself does not use any packages but we show ggplot2 and lattice package graphics for plotting.
# get better starting values
model0 <- nls(y ~ Max * (1-exp(-k * (time - Lag))), long, start = st0)
st <- with(as.list(coef(model0)),
list(Max = rep(Max, 3), k = rep(k, 3), Lag = rep(Lag, 3)))
model <- nls(y ~ Max[name] * (1-exp(-k[name] * (time - Lag[name]))),
long, start = st)
model
giving:
Nonlinear regression model
model: y ~ Max[name] * (1 - exp(-k[name] * (time - Lag[name])))
data: long
Max1 Max2 Max3 k1 k2 k3 Lag1 Lag2
306.48737 389.84657 361.82290 0.12214 0.03857 0.13747 1.38072 2.02205
Lag3
1.31770
residual sum-of-squares: 7167
Number of iterations to convergence: 8
Achieved convergence tolerance: 9.186e-06
ggplot2 graphics could be done like this.
library(ggplot2)
fitdf <- transform(long, fit = fitted(model))
ggplot(fitdf, aes(x = time, y = y, color = name)) +
geom_point() +
geom_line(aes(y = fit))
A slightly different looking plot can be generated using lattice graphics which comes with R so the package does not have to be installed. The code is particularly compact.
library(lattice)
xyplot(fit + y ~ time | name, fitdf, type = c("l", "p"), auto.key = TRUE)
2) nlsList If you don't need to investigate common settings for parameters among the names then another possibility is to use nlsList in the nlme package (which comes with R so you don't have to install it). long and st0 are from above.
library(nlme)
fit <- nlsList(y ~ Max * (1-exp(-k * (time - Lag))) | name, long, start = st0)
giving an nlsList object whose 3 components are the three nls objects obtained by running nls for each name.
> fit
Call:
Model: y ~ Max * (1 - exp(-k * (time - Lag))) | name
Data: long
Coefficients:
Max k Lag
gluc 306.4875 0.12214330 1.380713
solka 389.8449 0.03856544 2.022057
starch 361.8231 0.13747402 1.317698
Degrees of freedom: 21 total; 12 residual
Residual standard error: 24.43858
We can plot the data and fit:
levs <- levels(long$name)
col <- setNames(rainbow(length(levs)), levs)
plot(y ~ time, long, col = col[name], pch = 20, cex = 1.5)
for(lv in levs) lines(fitted(fit[[lv]]) ~ time, dat, col = col[lv])
legend("bottomright", leg = levs, col = col, pch = 20, cex = 1.5)
3) subset An approach which is similar to (2) is to perform three nls runs using subset= to select the data. This returns a named list of nls objects. st0 and long are from above. No packages are used.
fit <- Map(function(nm) nls(y ~ Max * (1-exp(-k * (time - Lag))), data = long,
start = st0, subset = name == nm), levels(long$name))
The graphics code in (2) also works here.
Build the formulas you want to use as strings:
outcomes = c("gluc", "starch", "solka")
my_formulas = paste(outcomes, "~ Max * (1-exp(-k * (time - Lag)))")
model_list = list()
for(i in seq_along(outcomes)) {
model_list[[outcomes[i]]] = nls(
as.formula(my_formulas[i],
data = dat,
start = list(Max = 200, k = 0.1, Lag = 0.5)
)
}
This will create a list of models, you can the access with, e.g., summary(model_list[[1]]) or summary(model_list[["solka"]])

Adding self starting values to an nls regression in R

I have existing code for fitting a sigmoid curve to data in R. How can I used selfstart (or another method) to automatically find start values for the regression?
sigmoid = function(params, x) {
params[1] / (1 + exp(-params[2] * (x - params[3])))
}
dataset = data.frame("x" = 1:53, "y" =c(0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.18,0.18,0.18,0.33,0.33,0.33,0.33,0.41,0.41,0.41,0.41,0.41,0.41,0.5,0.5,0.5,0.5,0.68,0.58,0.58,0.68,0.83,0.83,0.83,0.74,0.74,0.74,0.83,0.83,0.9,0.9,0.9,1,1,1,1,1,1,1) )
x = dataset$x
y = dataset$y
# fitting code
fitmodel <- nls(y~a/(1 + exp(-b * (x-c))), start=list(a=1,b=.5,c=25))
# visualization code
# get the coefficients using the coef function
params=coef(fitmodel)
y2 <- sigmoid(params,x)
plot(y2,type="l")
points(y)
This is a common (and interesting) problem in non-linear curve fitting.
Background
We can find sensible starting values if we take a closer look at the function sigmoid
We first note that
So for large values of x, the function approaches a. In other words, as a starting value for a we may choose the value of y for the largest value of x.
In R language, this translates to y[which.max(x)].
Now that we have a starting value for a, we need to decide on starting values for b and c. To do that, we can make use of the geometric series
and expand f(x) = y by keeping only the first two terms
We now set a = 1 (our starting value for a), re-arrange the equation and take the logarithm on both sides
We can now fit a linear model of the form log(1 - y) ~ x to obtain estimates for the slope and offset, which in turn provide the starting values for b and c.
R implementation
Let's define a function that takes as an argument the values x and y and returns a list of parameter starting values
start_val_sigmoid <- function(x, y) {
fit <- lm(log(y[which.max(x)] - y + 1e-6) ~ x)
list(
a = y[which.max(x)],
b = unname(-coef(fit)[2]),
c = unname(-coef(fit)[1] / coef(fit)[2]))
}
Based on the data for x and y you give, we obtain the following starting values
start_val_sigmoid(x, y)
#$a
#[1] 1
#
#$b
#[1] 0.2027444
#
#$c
#[1] 15.01613
Since start_val_sigmoid returns a list we can use its output directly as the start argument in nls
nls(y ~ a / ( 1 + exp(-b * (x - c))), start = start_val_sigmoid(x, y))
#Nonlinear regression model
# model: y ~ a/(1 + exp(-b * (x - c)))
# data: parent.frame()
# a b c
# 1.0395 0.1254 29.1725
# residual sum-of-squares: 0.2119
#
#Number of iterations to convergence: 9
#Achieved convergence tolerance: 9.373e-06
Sample data
dataset = data.frame("x" = 1:53, "y" =c(0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.18,0.18,0.18,0.33,0.33,0.33,0.33,0.41,0.41,0.41,0.41,0.41,0.41,0.5,0.5,0.5,0.5,0.68,0.58,0.58,0.68,0.83,0.83,0.83,0.74,0.74,0.74,0.83,0.83,0.9,0.9,0.9,1,1,1,1,1,1,1) )
x = dataset$x
y = dataset$y

Glmnet is different with intercept=TRUE compared to intercept=FALSE and with penalty.factor=0 for an intercept in x

I am new to glmnet and playing with the penalty.factor option. The vignette says that it "Can be 0 for some variables, which implies no shrinkage, and that variable is always included in the model." And the longer PDF document has code. So I expected that running a regression with intercept = TRUE and no constant in x would be the same as with intercept = FALSE and a constant in x with penalty.factor = 0. But the code below shows that it is not: the latter case has an intercept of 0 and the other two coefficients are 20% larger than in the former.
library("glmnet")
set.seed(7)
# penalty for the intercept
intercept_penalty <- 0
# Simulate data with 2 features
num_regressors <- 2
num_observations <- 100
X <- matrix(rnorm(num_regressors * num_observations),
ncol = num_regressors,
nrow = num_observations)
# Add an intercept in the right-hand side matrix: X1 = (intercept + X)
X1 <- cbind(matrix(1, ncol = 1, nrow = num_observations), X)
# Set random parameters for the features
beta <- runif(1 + num_regressors)
# Generate observations for the left-hand side
Y <- X1 %*% beta + rnorm(num_observations) / 10
# run OLS
ols <- lm(Y ~ X)
coef_ols <- coef(ols)
# Run glmnet with an intercept in the command, not in the matrix
fit <- glmnet(y = Y,
x = X,
intercept = T,
penalty.factor = rep(1, num_regressors),
lambda = 0)
coef_intercept_equal_true <- coef(fit)
# run glmnet with an intercept in the matrix with a penalty
# factor of intercept_penalty for the intercept and 1 for the rest
fit_intercept_equal_false <- glmnet(y = Y,
x = X1,
intercept = F,
penalty.factor = c(intercept_penalty, rep(1, num_regressors)),
lambda = 0)
coef_intercept_equal_false <- coef(fit_intercept_equal_false)
# Compare all three methods in a data frame
# For lasso_intercept_equal_false, the index starts at 2 because
# position 1 is reserved for intercepts, which is missing in this case
comparison <- data.frame(original = beta,
ols = coef_ols,
lasso_intercept_equal_true = coef_intercept_equal_true[1:length(coef_intercept_equal_true)],
lasso_intercept_equal_false = coef_intercept_equal_false[2:length(coef_intercept_equal_false)]
)
comparison$difference <- comparison$lasso_intercept_equal_false - comparison$lasso_intercept_equal_true
comparison
Furthermore, the discrepancy for this example is the same with different penalty factors for the intercept term, whether intercept_penalty equals 0, 1, 3000, -10, etc. The discrepancy is similar with a positive penalty, e.g. lambda = 0.01.
If this is not a bug, what is the proper usage of penalty.factor?
I contacted the author, who confirmed that this is a bug and added that it is on his list of bug fixes. In the meantime, a workaround is to center the regressors, e.g. with
fit_centered <- glmnet(y = Y,
x = scale(X1, T, F),
intercept = F,
lambda = 0)
And in this case, the penalty factor does not matter. Here is a revised script that compares OLS, LASSO with intercept, LASSO without intercept, and LASSO with centered regressors:
library("glmnet")
set.seed(7)
# Simulate data with 2 features
num_regressors <- 2
num_observations <- 100
X <- matrix(rnorm(num_regressors * num_observations),
ncol = num_regressors,
nrow = num_observations)
# Add an intercept in the right-hand side matrix: X1 = (intercept + X)
X1 <- cbind(matrix(1, ncol = 1, nrow = num_observations), X)
# Set random parameters for the features
beta <- runif(1 + num_regressors)
# Generate observations for the left-hand side
Y <- X1 %*% beta + rnorm(num_observations) / 10
# run OLS
ols <- lm(Y ~ X)
coef_ols <- coef(ols)
# Run glmnet with an intercept in the command, not in the matrix
fit <- glmnet(y = Y,
x = X,
intercept = T,
penalty.factor = rep(1, num_regressors),
lambda = 0)
coef_intercept <- coef(fit)
# run glmnet with an intercept in the matrix with a penalty
# factor of 0 for the intercept and 1 for the rest
fit_no_intercept <- glmnet(y = Y,
x = X1,
intercept = F,
lambda = 0)
coef_no_intercept <- coef(fit_no_intercept)
# run glmnet with an intercept in the matrix with a penalty
# factor of 0 for the intercept and 1 for the rest
# If x is centered, it works (even though y is not centered). Center it with:
# X1 - matrix(colMeans(X1), nrow = num_observations, ncol = 1 + num_regressors, byrow = T)
# or with
# X1_centered = scale(X1, T, F)
fit_centered <- glmnet(y = Y,
x = scale(X1, T, F),
intercept = F,
lambda = 0)
coef_centered <- coef(fit_centered)
# Compare all three methods in a data frame
# For lasso_intercept and the others, the index starts at 2 because
# position 1 is reserved for intercepts, which is missing in this case
comparison <- data.frame(ols = coef_ols,
lasso_intercept = coef_intercept[1:length(coef_intercept)],
lasso_no_intercept = coef_no_intercept[2:length(coef_no_intercept)],
lasso_centered = coef_centered[2:length(coef_centered)]
)
comparison$diff_intercept <- comparison$lasso_intercept - comparison$lasso_no_intercept
comparison$diff_centered <- comparison$lasso_centered - comparison$lasso_intercept
comparison
The answer:
ols lasso_intercept lasso_no_intercept lasso_centered diff_intercept diff_centered
(Intercept) 0.9748302 0.9748302 0.0000000 0.0000000 0.9748302 -9.748302e-01
X1 0.6559541 0.6559541 0.7974851 0.6559541 -0.1415309 2.220446e-16
X2 0.7986957 0.7986957 0.9344306 0.7986957 -0.1357348 4.440892e-16
For LASSO with centered regressors, the estimated intercept is 0 but the other coefficients are the same as the LASSO with intercept.

maximum likelihood estimation

I am new user of R and hope you will bear with me if my question is silly. I want to estimate the following model using the maximum likelihood estimator in R.
y= a+b*(lnx-α)
Where a, b, and α are parameters to be estimated and X and Y are my data set. I tried to use the following code that I get from the web:
library(foreign)
maindata <- read.csv("C:/Users/NUNU/Desktop/maindata/output2.csv")
h <- subset(maindata, cropid==10)
library(likelihood)
modelfun <- function (a, b, x) { b *(x-a)}
par <- list(a = 0, b = 0)
var<-list(x = "x")
par_lo <- list(a = 0, b = 0)
par_hi <- list(a = 50, b = 50)
var$y <- "y"
var$mean <- "predicted"
var$sd <- 0.815585
var$log <- TRUE
results <- anneal(model = modelfun, par = par, var = var,
source_data = h, par_lo = par_lo, par_hi = par_hi,
pdf = dnorm, dep_var = "y", max_iter = 20000)
The result I am getting is similar although the data is different, i.e., even when I change the cropid. Similarly, the predicted value generated is for x rather than y.
I do not know what I missed or went wrong. Your help is highly appreciated.
I am not sure if your model formula will lead to a unique solution, but in general you can find MLE with optim function
Here is a simple example for linear regression with optim:
fn <- function(beta, x, y) {
a = beta[1]
b = beta[2]
sum( (y - (a + b * log(x)))^2 )
}
# generate some data for testing
x = 1:100
# a = 10, b = 3.5
y = 10 + 3.5 * log(x)
optim(c(0,0,0),fn,x=x,y=y,method="BFGS")
you can change the function "fn" to reflect your model formula e.g.
sum( (y - (YOUR MODEL FORMULA) )^2 )
EDIT
I am just giving a simple example of using optim in case you have a custom model formula to optimize. I did not mean using it from simple linear regression, since lm will be sufficient.
I was a bit surprised that iTech used optim for what is a problem that is linear in its parameters. With his data for x and y:
> lm(y ~ log(x) )
Call:
lm(formula = y ~ log(x))
Coefficients:
(Intercept) log(x)
10.0 3.5
For linear problems, the least squares solution is the ML solution.

Resources