I am struggling with looping nls functions. So here is an example data set for a single sample
dat<-read.table(text="time y
1 4.62
2 13.55
3 30.82
6 93.97
12 145.93
24 179.93", header = TRUE)
plot(data);lines(data)
model <- nls(y ~ Max * (1-exp(-k * (time - Lag))),data=dat,start=list(Max = 200, k = 0.1, Lag = 0.5))
but what if I want to apply model to multiple columns of samples?
for example
dat<-read.table(text="time gluc starch solka
+ 1 6.32 7.51 1.95
+ 2 20.11 25.49 6.43
+ 3 36.03 47.53 10.39
+ 6 107.52 166.31 27.01
+ 12 259.28 305.19 113.72
+ 24 283.40 342.56 251.14
+ 48 297.55 353.66 314.22", header = TRUE)
How can I get R to solve for Max, k, and Lag for each sample (gluc, starch, solka)?
In all the alternatives below we use these values:
long <- tidyr::pivot_longer(dat, -1, values_to = "y")
long$name <- factor(long$name)
st0 <- list(Max = 200, k = 0.1, Lag = 0.5)
1) nls grouped data Convert dat to long form and then use the grouped data feature of nls This solution is the most suitable among those presented here for testing whether certain parameters are common among the three names since it is easy to simply remove the subscript on a parameter if it is to be common among the names. The fitting itself does not use any packages but we show ggplot2 and lattice package graphics for plotting.
# get better starting values
model0 <- nls(y ~ Max * (1-exp(-k * (time - Lag))), long, start = st0)
st <- with(as.list(coef(model0)),
list(Max = rep(Max, 3), k = rep(k, 3), Lag = rep(Lag, 3)))
model <- nls(y ~ Max[name] * (1-exp(-k[name] * (time - Lag[name]))),
long, start = st)
model
giving:
Nonlinear regression model
model: y ~ Max[name] * (1 - exp(-k[name] * (time - Lag[name])))
data: long
Max1 Max2 Max3 k1 k2 k3 Lag1 Lag2
306.48737 389.84657 361.82290 0.12214 0.03857 0.13747 1.38072 2.02205
Lag3
1.31770
residual sum-of-squares: 7167
Number of iterations to convergence: 8
Achieved convergence tolerance: 9.186e-06
ggplot2 graphics could be done like this.
library(ggplot2)
fitdf <- transform(long, fit = fitted(model))
ggplot(fitdf, aes(x = time, y = y, color = name)) +
geom_point() +
geom_line(aes(y = fit))
A slightly different looking plot can be generated using lattice graphics which comes with R so the package does not have to be installed. The code is particularly compact.
library(lattice)
xyplot(fit + y ~ time | name, fitdf, type = c("l", "p"), auto.key = TRUE)
2) nlsList If you don't need to investigate common settings for parameters among the names then another possibility is to use nlsList in the nlme package (which comes with R so you don't have to install it). long and st0 are from above.
library(nlme)
fit <- nlsList(y ~ Max * (1-exp(-k * (time - Lag))) | name, long, start = st0)
giving an nlsList object whose 3 components are the three nls objects obtained by running nls for each name.
> fit
Call:
Model: y ~ Max * (1 - exp(-k * (time - Lag))) | name
Data: long
Coefficients:
Max k Lag
gluc 306.4875 0.12214330 1.380713
solka 389.8449 0.03856544 2.022057
starch 361.8231 0.13747402 1.317698
Degrees of freedom: 21 total; 12 residual
Residual standard error: 24.43858
We can plot the data and fit:
levs <- levels(long$name)
col <- setNames(rainbow(length(levs)), levs)
plot(y ~ time, long, col = col[name], pch = 20, cex = 1.5)
for(lv in levs) lines(fitted(fit[[lv]]) ~ time, dat, col = col[lv])
legend("bottomright", leg = levs, col = col, pch = 20, cex = 1.5)
3) subset An approach which is similar to (2) is to perform three nls runs using subset= to select the data. This returns a named list of nls objects. st0 and long are from above. No packages are used.
fit <- Map(function(nm) nls(y ~ Max * (1-exp(-k * (time - Lag))), data = long,
start = st0, subset = name == nm), levels(long$name))
The graphics code in (2) also works here.
Build the formulas you want to use as strings:
outcomes = c("gluc", "starch", "solka")
my_formulas = paste(outcomes, "~ Max * (1-exp(-k * (time - Lag)))")
model_list = list()
for(i in seq_along(outcomes)) {
model_list[[outcomes[i]]] = nls(
as.formula(my_formulas[i],
data = dat,
start = list(Max = 200, k = 0.1, Lag = 0.5)
)
}
This will create a list of models, you can the access with, e.g., summary(model_list[[1]]) or summary(model_list[["solka"]])
Related
I have a multivariate binomial GLM with one quadratic (dist_Roads) and multiple linear terms:
GLM <- glm(formula = Presence ~ dist_NP_boundary + dist_Villages + dist_Water + dist_Grassland + dist_Roads + I(dist_Roads^2), family = "binomial", data = DF)
All explanatory variables are continuous (in metres), the response variable is binomial (0/1).
The prediction plot generated by code plot(ggeffects::ggpredict(GLM), facets = TRUE) looks as follows:
Graph
How do I get the value of the quadratic explanatory variable (dist_Roads) at the point where the response variable is at its maximum? I.e. what is the X value at the peak (vertex) of the curve?
How can I calculate it in R?
We really need some data to demonstrate here, so let's create a similar set, focusing only on Roads and Outcome:
set.seed(1)
DF <- data.frame(Roads = seq(0, 10000, 100))
DF$Outcome <- rbinom(101, 1, exp(0.005 * DF$Roads - 5e-7 * DF$Roads^2 - 11) /
(exp(0.005 * DF$Roads - 5e-7 * DF$Roads^2 - 11) + 1))
Now let's create our model and plot it to see the shape:
model <- glm(Outcome ~ Roads + I(Roads^2), data = DF, family = binomial)
plot(ggeffects::ggpredict(model, "Roads"))
#> Loading required namespace: ggplot2
To find where the peak occurs, let's look at the coefficients:
coef(model)
#> (Intercept) Roads I(Roads^2)
#> -1.026831e+01 4.506460e-03 -4.361942e-07
Remember, that the maximum probability occurs where the maximum log odds occur, and the formula for log odds is given by these coefficients in the form:
#> y = -1.026831e+01 + 4.506460e-03 x - 4.361942e-07 x^2
Which means the derivative is given by:
#> dy/dx = 4.506460e-03 - 2 * 4.361942e-07 x
The only place where dy/dx is 0 is at the maximum point, so we can find the x value at the maximum by rearranging:
#> 0 = 4.506460e-03 - 2 * 4.361942e-07 x
#> 2 * 4.361942e-07 x = 4.506460e-03
#> x = 4.506460e-03 / 2 * 4.361942e-07
So the bottom line is that we can simply get the maximum by doing:
maximum_x <- -coef(model)[2] / (2 * coef(model)[3])
maximum_x
#> Roads
#> 5165.658
To confirm this is the case, let's predict the y value at this x value.
maximum_y <- predict(model, newdata = data.frame(Roads = maximum_x),
type = "response")
If we have this right, a point plotted at [maximum_x, maximum_y] should be at the peak of our curve:
plot(ggeffects::ggpredict(model, "Roads")) +
ggplot2::annotate(geom = "point", x = maximum_x, y = maximum_y, size = 3)
Created on 2022-08-21 with reprex v2.0.2
To answer you follow-up question in the comments: you can calculate adjusted predictions at specific values by putting these values into brackets:
# example from Allan's answer
library(ggeffects)
set.seed(1)
DF <- data.frame(Roads = seq(0, 10000, 100))
DF$Outcome <- rbinom(101, 1, exp(0.005 * DF$Roads - 5e-7 * DF$Roads^2 - 11) /
(exp(0.005 * DF$Roads - 5e-7 * DF$Roads^2 - 11) + 1))
model <- glm(Outcome ~ Roads + I(Roads^2), data = DF, family = binomial)
# value 5165 - see Allan's answer
ggpredict(model, "Roads [5165]")
#> # Predicted probabilities of Outcome
#>
#> Roads | Predicted | 95% CI
#> --------------------------------
#> 5165 | 0.80 | [0.63, 0.90]
Created on 2022-08-23 by the reprex package (v2.0.1)
There's a vignette showing all possibilities to calculate predictions at specific values: https://strengejacke.github.io/ggeffects/articles/introduction_effectsatvalues.html
I have made an ODE model in R using the package deSolve. Currently the output of the model gives me the "observed" prevalence of a disease (i.e. the prevalence not accounting for diagnostic imperfection).
However, I want to adjust the model to output the "true" prevalence, using a simple adjustment formula called the Rogan-Gladen estimator (http://influentialpoints.com/Training/estimating_true_prevalence.htm):
True prevalence =
(Apparent prev. + (Specificity-1)) / (Specificity + (Sensitivity-1))
As you will see in the code below, I have attempted to adjust only one of the differential equations (diggP).
Running the model without adjustment gives an expected output (a proportion between 0 and 1). However, attempting to adjust the model using the RG-estimator gives a spurious output (a proportion less than 0).
Any advice on what might be going wrong here would be very much appreciated.
# Load required packages
library(tidyverse)
library(broom)
library(deSolve)
# Set time (age) for function
time = 1:80
# Defining exponential decay of lambda over age
y1 = 0.003 + (0.15 - 0.003) * exp(-0.05 * time) %>% jitter(10)
df <- data.frame(t = time, y = y1)
fit <- nls(y ~ SSasymp(time, yf, y0, log_alpha), data = df)
fit
# Values of lambda over ages 1-80 years
data <- as.matrix(0.003 + (0.15 - 0.003) * exp(-0.05 * time))
lambda<-as.vector(data[,1])
t<-as.vector(seq(1, 80, by=1))
foi<-cbind(t, lambda)
foi[,1]
# Making lambda varying by time useable in the ODE model
input <- approxfun(x = foi[,1], y = foi[,2], method = "constant", rule = 2)
# Model
ab <- function(time, state, parms) {
with(as.list(c(state, parms)), {
# lambda, changing by time
import<-input(time)
# Derivatives
# RG estimator:
#True prevalence = (apparent prev + (sp-1)) / (sp + (se-1))
diggP<- (((import * iggN) - iggR * iggP) + (sp_igg-1)) / (sp_igg + (se_igg-1))
diggN<- (-import*iggN) + iggR*iggP
dtgerpP<- (0.5*import)*tgerpN -tgerpR*tgerpP
dtgerpN<- (0.5*-import)*tgerpN + tgerpR*tgerpP
# Return results
return(list(c(diggP, diggN, dtgerpP, dtgerpN)))
})
}
# Initial values
yini <- c(iggP=0, iggN=1,
tgerpP=0, tgerpN=1)
# Parameters
pars <- c(iggR = 0, tgerpR = (1/8)/12,
se_igg = 0.95, sp_igg = 0.92)
# Solve model
results<- ode(y=yini, times=time, func=ab, parms = pars)
# Plot results
plot(results, xlab="Time (years)", ylab="Proportion")
I would like to fit an exponential decay function in R to the following data:
data <- structure(list(x = 0:38, y = c(0.991744340878828, 0.512512332368168,
0.41102449265681, 0.356621905557202, 0.320851602373477, 0.29499198506227,
0.275037747162642, 0.25938850981822, 0.245263623938863, 0.233655093612007,
0.224041426946405, 0.214152907133301, 0.207475138903635, 0.203270738895484,
0.194942528735632, 0.188107106969046, 0.180926819430008, 0.177028560207711,
0.172595416846822, 0.166729221891201, 0.163502461048814, 0.159286528409165,
0.156110097827889, 0.152655498715612, 0.148684858095915, 0.14733605355542,
0.144691873223729, 0.143118852619617, 0.139542186417186, 0.137730138713745,
0.134353615271572, 0.132197800438632, 0.128369567159113, 0.124971834736476,
0.120027536018095, 0.117678812415655, 0.115720611113327, 0.112491329844252,
0.109219168085624)), class = "data.frame", row.names = c(NA,
-39L), .Names = c("x", "y"))
I've tried fitting with nls but the generated curve is not close to the actual data.
enter image description here
It would be very helpful if anyone could explain how to work with such nonlinear data and find a function of best fit.
Try y ~ .lin / (b + x^c). Note that when using "plinear" one omits the .lin linear parameter when specifying the formula to nls and also omits a starting value for it.
Also note that the .lin and b parameters are approximately 1 at the optimum so we could also try the one parameter model y ~ 1 / (1 + x^c). This is the form of a one-parameter log-logistic survival curve. The AIC for this one parameter model is worse than for the 3 parameter model (compare AIC(fm1) and AIC(fm3)) but the one parameter model might still be preferable due to its parsimony and the fact that the fit is visually indistinguishable from the 3 parameter model.
opar <- par(mfcol = 2:1, mar = c(3, 3, 3, 1), family = "mono")
# data = data.frame with x & y col names; fm = model fit; main = string shown above plot
Plot <- function(data, fm, main) {
plot(y ~ x, data, pch = 20)
lines(fitted(fm) ~ x, data, col = "red")
legend("topright", bty = "n", cex = 0.7, legend = capture.output(fm))
title(main = paste(main, "- AIC:", round(AIC(fm), 2)))
}
# 3 parameter model
fo3 <- y ~ 1/(b + x^c) # omit .lin parameter; plinear will add it automatically
fm3 <- nls(fo3, data = data, start = list(b = 1, c = 1), alg = "plinear")
Plot(data, fm3, "3 parameters")
# one parameter model
fo1 <- y ~ 1 / (1 + x^c)
fm1 <- nls(fo1, data, start = list(c = 1))
Plot(data, fm1, "1 parameter")
par(read.only = opar)
AIC
Adding the solutions in the other answers we can compare the AIC values. We have labelled each solution by the number of parameters it uses (the degrees of freedom would be one greater than that) and have reworked the log-log solution to use nls instead of lm and have a LHS of y since one cannot compare the AIC values of models having different left hand sides or using different optimization routines since the log likelihood constants used could differ.
fo2 <- y ~ exp(a + b * log(x+1))
fm2 <- nls(fo2, data, start = list(a = 1, b = 1))
fo4 <- y ~ SSbiexp(x, A1, lrc1, A2, lrc2)
fm4 <- nls(fo4, data)
aic <- AIC(fm1, fm2, fm3, fm4)
aic[order(aic$AIC), ]
giving from best AIC (i.e. fm3) to worst AIC (i.e. fm2):
df AIC
fm3 4 -329.35
fm1 2 -307.69
fm4 5 -215.96
fm2 3 -167.33
A biexponential model would fit much better, though still not perfect. This would indicate that you might have two simultaneous decay processes.
fit <- nls(y ~ SSbiexp(x, A1, lrc1, A2, lrc2), data = data)
#A1*exp(-exp(lrc1)*x)+A2*exp(-exp(lrc2)*x)
plot(y ~x, data = data)
curve(predict(fit, newdata = data.frame(x)), add = TRUE)
If the measurement error depends on magnitude, you could consider using it for weighting.
However, you should consider carefully what kind of model you'd expect from your domain knowledge. Just selecting a non-linear model empirically is usually not a good idea. A non-parametric fit might be a better option.
data <- structure(list(x = 0:38, y = c(0.991744340878828, 0.512512332368168,
0.41102449265681, 0.356621905557202, 0.320851602373477, 0.29499198506227,
0.275037747162642, 0.25938850981822, 0.245263623938863, 0.233655093612007,
0.224041426946405, 0.214152907133301, 0.207475138903635, 0.203270738895484,
0.194942528735632, 0.188107106969046, 0.180926819430008, 0.177028560207711,
0.172595416846822, 0.166729221891201, 0.163502461048814, 0.159286528409165,
0.156110097827889, 0.152655498715612, 0.148684858095915, 0.14733605355542,
0.144691873223729, 0.143118852619617, 0.139542186417186, 0.137730138713745,
0.134353615271572, 0.132197800438632, 0.128369567159113, 0.124971834736476,
0.120027536018095, 0.117678812415655, 0.115720611113327, 0.112491329844252,
0.109219168085624)), class = "data.frame", row.names = c(NA,
-39L), .Names = c("x", "y"))
# Do this because the log of 0 is not possible to calculate
data$x = data$x +1
fit = lm(log(y) ~ log(x), data = data)
plot(data$x, data$y)
lines(data$x, data$x ^ fit$coefficients[2], col = "red")
This did a lot better than using the nls forumla. And when plotting the fit seems to do fairly well.
I am trying to understand how to use mixed linear models to analyse my data by simulating a model, but I can't reproduce the input parameters. What am I missing?
I want to start simulating a model with a random intercept for each subject. Here is the formula of what I want to simulate and reproduce:
If beta1 (<11) is small I find gamma00 as the intercept in fixed section, but I am completedly unaable to retrieve the slope (beta1). Also, the linear effect is not significant. Where is my conceptual mistake?
library(lmerTest)
# Generating data set
# General values and variables
numObj <- 20
numSub <- 100
e <- rnorm(numObj * numSub, mean = 0, sd = 0.1)
x <- scale(runif(numObj * numSub, min = -100, max = 100))
y <- c()
index <- 1
# Coefficients
gamma00 <- 18
gamma01 <- 0.5
beta1 <- -100
w <- runif(numSub, min = -3, max = 3)
uo <- rnorm(numSub, mean = 0, sd = 0.1)
meanBeta0 <- mean(gamma00 + gamma01*w + uo) # I should be able to retrieve that parameter.
for(j in 1:numSub){
for(i in 1:numObj){
y[index] <- gamma00 + gamma01*w[j]+ uo[j] + beta1*x[i] + e[index]
index <- index + 1
}
}
dataFrame2 <- data.frame(y = y, x = x, subNo = factor(rep(1:numSub, each = numObj)), objNum = factor(rep(1:numObj, numSub)))
model2 <- lmer(y ~ x +
(1 | subNo), data = dataFrame2)
summary(model2)
anova(model2)
No conceptual mistake here, just a mixed up index value: you should be using index rather than i to index x in your data generation loop.
Basically due to the mix-up you were using the first subject's x values for generating data for all the subjects, but using the individual x values in the model.
I have a series of data I have fit a power curve to, and I use the predict function in R to allow me predict y values based on additional x values.
set.seed(1485)
len <- 24
x <- runif(len)
y <- x^3 + rnorm(len, 0, 0.06)
ds <- data.frame(x = x, y = y)
mydata=data.frame(x,y)
z <- nls(y ~ a * x^b, data = mydata, start = list(a=1, b=1))
#z is same as M!
power <- round(summary(z)$coefficients[1], 3)
power.se <- round(summary(z)$coefficients[2], 3)
plot(y ~ x, main = "Fitted power model", sub = "Blue: fit; green: known")
s <- seq(0, 1, length = 100)
lines(s, s^3, lty = 2, col = "green")
lines(s, predict(z, list(x = s)), lty = 1, col = "blue")
text(0, 0.5, paste("y =x^ (", power, " +/- ", power.se,")", sep = ""), pos = 4)
Instead of using the predict function here, how could I manually calculate estimated y values based on additional x values based on this power function. If this were just a simple linear regression, I would calculate the slope and y intercept and calculate my y values by
y= mx + b
Is there a similar equation I can use from the output of z that will allow me to estimate y values from additional x values?
> z
Nonlinear regression model
model: y ~ a * x^b
data: mydata
a b
1.026 3.201
residual sum-of-squares: 0.07525
Number of iterations to convergence: 5
Achieved convergence tolerance: 5.162e-06
You would do it the same way except you use the power equation you modeled. You can access the parameters the model calculated using z$m$getPars()
Here is a simple example to illustrate:
predict(z, list(x = 1))
Results in: 1.026125
Which equals the results of
z$m$getPars()["a"] * 1 ^ z$m$getPars()["b"]
Which is equivalet to y = a * x^b
Here are some ways.
1) with This evaluates the formula with respect to the coefficients:
x <- 1:2 # input
with(as.list(coef(z)), a * x^b)
## [1] 1.026125 9.437504
2) attach We could also use attach although it is generally frowned upon:
attach(as.list(coef(z)))
a * x^b
## [1] 1.026125 9.437504
3) explicit Explicit definition:
a <- coef(z)[["a"]]; b <- coef(z)[["b"]]
a * x^b
## [1] 1.026125 9.437504
4) eval This one extracts the formula from z so that we don't have to specify it again. formula(z)[[3]] is the right hand side of the formula used to produce z. Use of eval is sometimes frowned upon but this does avoid
the redundant specification of the formula.
eval(formula(z)[[3]], as.list(coef(z)))
## [1] 1.026125 9.437504