R: Using "group_by" with "nls()" - r

I have a dataset that I want to fit a Gompertz model grouped by 4 different factors (subject, race, target & distractor). The Gompertz model works when applied to the entire data set (i.e., without applying "group_by"). The group_by function works when I use a (much simpler) linear regression. However, when I try to use group_by with the Gompertz model I get the following error:
Error in chol2inv(object$m$Rmat()) :
element (3, 3) is zero, so the inverse cannot be computed
In addition: Warning messages:
1: In nls(yt ~ ymin + ymax * (exp(-exp((alpha * 2.718282/ymax) * (lambda - :
Convergence failure: false convergence (8)
2: In nls(yt ~ ymin + ymax * (exp(-exp((alpha * 2.718282/ymax) * (lambda - :
Convergence failure: singular convergence (7)
Here is the code:
grouped_data = all_merged %>%
group_by(subject,race,target,distractor)
gomp_fits = do(grouped_data, tidy(nls(yt ~ ymin+ymax*(exp(-exp((alpha* 2.718282/ymax)*(lambda-time)+1))), data = ., start = list(lambda = 0.480, alpha = 5.8, ymin = 0, ymax = 1.6),
control = list(warnOnly = TRUE),
algorithm = "port",
lower = c(0,-Inf, -Inf, 0),
upper= c(2, Inf, Inf, 2))))
Thank you!

TLDR
Consider nlsLM, a self-starting Gompertz model or use a method to calculate starting values, use it in a group_modify workflow.
Maybe something like this (though upper and lower limits may not be necessary
fit_gomp <- function(data, ...) {
nlsLM(formula = y ~ SSgompertz(x, Asym, b2, b3),
data = data,
lower = c(0,-Inf, -Inf, 0),
upper = c(2, Inf, Inf, 2),
...) %>% tidy()
}
data %>%
group_by(subject, race, target, distractor) %>%
group_modify(~ fit_qomp(data = .x), .keep = TRUE)
Getting starting values
While I haven't used a Gompertz model, consider if you can find a way to get starting values mathematically.
For example, let's say I want to fit a quadratic-plateau model (it only has 3 starting parameters however). First I have a function that defines the equation, which will go inside nls later.
# y = b0 + b1x + b2x^2
# b0 = intercept
# b1 = slope
# b2 = quadratic term
# jp = join point = critical concentration
quadp <- function(x, b0, b1, jp) {
b2 <- -0.5 * b1 / jp
if_else(
condition = x < jp,
true = b0 + (b1 * x) + (b2 * x * x),
false = b0 + (b1 * jp) + (b2 * jp * jp)
)
}
The second part is to make a fitting function that fits a quadratic polynomial, uses those coefficients as starting values in the nls portion, and fits the nls model.
fit_quadp <- function(data, ...) {
# get starting values from simple quadratic
start <- lm(y ~ poly(x, 2, raw = TRUE), data = data)
start_values <- list(b0 = start$coef[[1]], # intercept
b1 = start$coef[[2]], # slope
jp = median(data$x)) # join-point
# nls model that uses those starting values
nlsLM(formula = y ~ quadp(x, b0, b1, jp),
data = data,
start = start_values,
...
) %>% tidy()
}
The ... is to add arguments for nls.control if needed.
Analyzing grouped data
As for analyzing grouped data, I use group_modify() because it returns a data frame whereas group_map() returns a list. So my basic workflow looks like:
dataset %>%
group_by(grouping_variable_1, grouping_variable_2, ...) %>%
group_modify(~ fit_quadp(data = .x), .keep = TRUE)
Then out comes a table with all the tidy statistics because tidy() was used in the function. You can consider including a try() wrapped around the nls() portion of the function so that if it succeeds on the first two groups but on the third, it'll still continue and you should still get some results.
nlsLM()
Also, if you want to use nlsLM from minpack.lm, the algorithm there succeeds more than those available in nls(). Some worry about false convergence, but I haven't seen it yet in my applications. Also with nlsLM you may not need to bother with upper and lower limits, though they can still be set.

Related

Offset interaction term

I want to fit a coxph model in r and offset one main effect and the interaction term.
I try to use offset command in front of the variable but it gives the error: Error in model.frame.default(formula = Surv(time = X, event = Delta) ~ :
variable lengths differ (found for 'offset(D:Y)')
We can generate some toy data like
require(survival)
D = rbinom(100, 1, 0.5)
Y = rbinom(100, 1, 0.5)
X = rexp(100, 1)
Delta = rbinom(100, 1, 0.5)
coxph(Surv(time = X, event = Delta) ~ D + offset(Y) + offset(D:Y))
I want to offset Y and D:Y but it keeps giving me the error. Maybe I am wrong with how to use "offset".
Despite the fact that you are using the expression D:Y within a formula expression, it is first being processed by offset which does not actually "know" what to do with the ":" operator in the R formula-parsing context. It instead is giving you a warning message that implies it is being parsed as the integer-sequence operator.
Error in model.frame.default(formula = Surv(time = X, event = Delta) ~ :
variable lengths differ (found for 'offset(D:Y)')
In addition: Warning messages:
1: In D:Y : numerical expression has 100 elements: only the first used
If you want to use the product of D and Y as an additional "interaction" offset, then you could have done this:
> coxph(Surv(time = X, event = Delta) ~ D + offset(Y) + offset(I(D*Y)))
Call:
coxph(formula = Surv(time = X, event = Delta) ~ D + offset(Y) +
offset(I(D * Y)))
coef exp(coef) se(coef) z p
D -0.9959 0.3694 0.2825 -3.525 0.000423
Likelihood ratio test=12.26 on 1 df, p=0.0004635
n= 100, number of events= 53
And looking at it I thought maybe even dropping the I() since offset wasn't using formula parsing logic anyway. And that proved to be the case.

Adjusting ODE model output using a Rogan-Gladen estimator in R

I have made an ODE model in R using the package deSolve. Currently the output of the model gives me the "observed" prevalence of a disease (i.e. the prevalence not accounting for diagnostic imperfection).
However, I want to adjust the model to output the "true" prevalence, using a simple adjustment formula called the Rogan-Gladen estimator (http://influentialpoints.com/Training/estimating_true_prevalence.htm):
True prevalence =
(Apparent prev. + (Specificity-1)) / (Specificity + (Sensitivity-1))
As you will see in the code below, I have attempted to adjust only one of the differential equations (diggP).
Running the model without adjustment gives an expected output (a proportion between 0 and 1). However, attempting to adjust the model using the RG-estimator gives a spurious output (a proportion less than 0).
Any advice on what might be going wrong here would be very much appreciated.
# Load required packages
library(tidyverse)
library(broom)
library(deSolve)
# Set time (age) for function
time = 1:80
# Defining exponential decay of lambda over age
y1 = 0.003 + (0.15 - 0.003) * exp(-0.05 * time) %>% jitter(10)
df <- data.frame(t = time, y = y1)
fit <- nls(y ~ SSasymp(time, yf, y0, log_alpha), data = df)
fit
# Values of lambda over ages 1-80 years
data <- as.matrix(0.003 + (0.15 - 0.003) * exp(-0.05 * time))
lambda<-as.vector(data[,1])
t<-as.vector(seq(1, 80, by=1))
foi<-cbind(t, lambda)
foi[,1]
# Making lambda varying by time useable in the ODE model
input <- approxfun(x = foi[,1], y = foi[,2], method = "constant", rule = 2)
# Model
ab <- function(time, state, parms) {
with(as.list(c(state, parms)), {
# lambda, changing by time
import<-input(time)
# Derivatives
# RG estimator:
#True prevalence = (apparent prev + (sp-1)) / (sp + (se-1))
diggP<- (((import * iggN) - iggR * iggP) + (sp_igg-1)) / (sp_igg + (se_igg-1))
diggN<- (-import*iggN) + iggR*iggP
dtgerpP<- (0.5*import)*tgerpN -tgerpR*tgerpP
dtgerpN<- (0.5*-import)*tgerpN + tgerpR*tgerpP
# Return results
return(list(c(diggP, diggN, dtgerpP, dtgerpN)))
})
}
# Initial values
yini <- c(iggP=0, iggN=1,
tgerpP=0, tgerpN=1)
# Parameters
pars <- c(iggR = 0, tgerpR = (1/8)/12,
se_igg = 0.95, sp_igg = 0.92)
# Solve model
results<- ode(y=yini, times=time, func=ab, parms = pars)
# Plot results
plot(results, xlab="Time (years)", ylab="Proportion")

Exponential decay fit in r

I would like to fit an exponential decay function in R to the following data:
data <- structure(list(x = 0:38, y = c(0.991744340878828, 0.512512332368168,
0.41102449265681, 0.356621905557202, 0.320851602373477, 0.29499198506227,
0.275037747162642, 0.25938850981822, 0.245263623938863, 0.233655093612007,
0.224041426946405, 0.214152907133301, 0.207475138903635, 0.203270738895484,
0.194942528735632, 0.188107106969046, 0.180926819430008, 0.177028560207711,
0.172595416846822, 0.166729221891201, 0.163502461048814, 0.159286528409165,
0.156110097827889, 0.152655498715612, 0.148684858095915, 0.14733605355542,
0.144691873223729, 0.143118852619617, 0.139542186417186, 0.137730138713745,
0.134353615271572, 0.132197800438632, 0.128369567159113, 0.124971834736476,
0.120027536018095, 0.117678812415655, 0.115720611113327, 0.112491329844252,
0.109219168085624)), class = "data.frame", row.names = c(NA,
-39L), .Names = c("x", "y"))
I've tried fitting with nls but the generated curve is not close to the actual data.
enter image description here
It would be very helpful if anyone could explain how to work with such nonlinear data and find a function of best fit.
Try y ~ .lin / (b + x^c). Note that when using "plinear" one omits the .lin linear parameter when specifying the formula to nls and also omits a starting value for it.
Also note that the .lin and b parameters are approximately 1 at the optimum so we could also try the one parameter model y ~ 1 / (1 + x^c). This is the form of a one-parameter log-logistic survival curve. The AIC for this one parameter model is worse than for the 3 parameter model (compare AIC(fm1) and AIC(fm3)) but the one parameter model might still be preferable due to its parsimony and the fact that the fit is visually indistinguishable from the 3 parameter model.
opar <- par(mfcol = 2:1, mar = c(3, 3, 3, 1), family = "mono")
# data = data.frame with x & y col names; fm = model fit; main = string shown above plot
Plot <- function(data, fm, main) {
plot(y ~ x, data, pch = 20)
lines(fitted(fm) ~ x, data, col = "red")
legend("topright", bty = "n", cex = 0.7, legend = capture.output(fm))
title(main = paste(main, "- AIC:", round(AIC(fm), 2)))
}
# 3 parameter model
fo3 <- y ~ 1/(b + x^c) # omit .lin parameter; plinear will add it automatically
fm3 <- nls(fo3, data = data, start = list(b = 1, c = 1), alg = "plinear")
Plot(data, fm3, "3 parameters")
# one parameter model
fo1 <- y ~ 1 / (1 + x^c)
fm1 <- nls(fo1, data, start = list(c = 1))
Plot(data, fm1, "1 parameter")
par(read.only = opar)
AIC
Adding the solutions in the other answers we can compare the AIC values. We have labelled each solution by the number of parameters it uses (the degrees of freedom would be one greater than that) and have reworked the log-log solution to use nls instead of lm and have a LHS of y since one cannot compare the AIC values of models having different left hand sides or using different optimization routines since the log likelihood constants used could differ.
fo2 <- y ~ exp(a + b * log(x+1))
fm2 <- nls(fo2, data, start = list(a = 1, b = 1))
fo4 <- y ~ SSbiexp(x, A1, lrc1, A2, lrc2)
fm4 <- nls(fo4, data)
aic <- AIC(fm1, fm2, fm3, fm4)
aic[order(aic$AIC), ]
giving from best AIC (i.e. fm3) to worst AIC (i.e. fm2):
df AIC
fm3 4 -329.35
fm1 2 -307.69
fm4 5 -215.96
fm2 3 -167.33
A biexponential model would fit much better, though still not perfect. This would indicate that you might have two simultaneous decay processes.
fit <- nls(y ~ SSbiexp(x, A1, lrc1, A2, lrc2), data = data)
#A1*exp(-exp(lrc1)*x)+A2*exp(-exp(lrc2)*x)
plot(y ~x, data = data)
curve(predict(fit, newdata = data.frame(x)), add = TRUE)
If the measurement error depends on magnitude, you could consider using it for weighting.
However, you should consider carefully what kind of model you'd expect from your domain knowledge. Just selecting a non-linear model empirically is usually not a good idea. A non-parametric fit might be a better option.
data <- structure(list(x = 0:38, y = c(0.991744340878828, 0.512512332368168,
0.41102449265681, 0.356621905557202, 0.320851602373477, 0.29499198506227,
0.275037747162642, 0.25938850981822, 0.245263623938863, 0.233655093612007,
0.224041426946405, 0.214152907133301, 0.207475138903635, 0.203270738895484,
0.194942528735632, 0.188107106969046, 0.180926819430008, 0.177028560207711,
0.172595416846822, 0.166729221891201, 0.163502461048814, 0.159286528409165,
0.156110097827889, 0.152655498715612, 0.148684858095915, 0.14733605355542,
0.144691873223729, 0.143118852619617, 0.139542186417186, 0.137730138713745,
0.134353615271572, 0.132197800438632, 0.128369567159113, 0.124971834736476,
0.120027536018095, 0.117678812415655, 0.115720611113327, 0.112491329844252,
0.109219168085624)), class = "data.frame", row.names = c(NA,
-39L), .Names = c("x", "y"))
# Do this because the log of 0 is not possible to calculate
data$x = data$x +1
fit = lm(log(y) ~ log(x), data = data)
plot(data$x, data$y)
lines(data$x, data$x ^ fit$coefficients[2], col = "red")
This did a lot better than using the nls forumla. And when plotting the fit seems to do fairly well.

What I'm doing wrong here when trying to convert an nlmrt object to an nls object

I am trying to convert an "nlmrt object to an "nls" object using nls2. However, I can only manage to do it if I write explicitly the names of the parameters in the call. Can't I define the parameter names programmatically? See the reproducible example:
library(nlmrt)
scale_vector <- function(vector, ranges_in, ranges_out){
t <- (vector - ranges_in[1, ])/(ranges_in[2, ]-ranges_in[1, ])
vector <- (1-t) * ranges_out[1, ] + t * ranges_out[2, ]
}
shobbs.res <- function(x) {
# UNSCALED Hobbs weeds problen -- coefficients are rescaled internally using
# scale_vector
ranges_in <- rbind(c(0, 0, 0), c(100, 10, 0.1))
ranges_out <- rbind(c(0, 0, 0), c(1, 1, 1))
x <- scale_vector(x, ranges_in, ranges_out)
tt <- 1:12
res <- 100*x[1]/(1+10*x[2]*exp(-0.1*x[3]*tt)) - y }
y <- c(5.308, 7.24, 9.638, 12.866, 17.069, 23.192, 31.443,
38.558, 50.156, 62.948, 75.995, 91.972)
st <- c(b1=100, b2=10, b3=0.1)
ans1n <- nlfb(st, shobbs.res)
print(coef(ans1n))
This works:
library(nls2)
ans_nls2 <- nls2(y ~ shobbs.res(c(b1, b2, b3)) + y, start = coef(ans1n), alg = "brute")
However, this forces me to hard-code the parameters names in the call to nls2. For reasons related to my actual code, I would like to be able to do something like
ans_nls2 <- nls2(y ~ shobbs.res(names(st)) + y, start = coef(ans1n), alg = "brute")
But this returns an error:
Error in vector - ranges_in[1, ] :
non-numeric argument to binary operator
Is it possible to fix this, without having to hard-code explicitly the names of parameters in the call to nls2?
nls2 will accept a string as a formula:
co <- coef(ans1n)
fo_str <- sprintf("y ~ shobbs.res(c(%s)) + y", toString(names(co)))
nls2(fo_str, start = co, alg = "brute")
giving:
Nonlinear regression model
model: y ~ shobbs.res(c(b1, b2, b3)) + y
data: NULL
b1 b2 b3
196.1863 49.0916 0.3136
residual sum-of-squares: 2.587
Number of iterations to convergence: 3
Achieved convergence tolerance: NA

Meaning of "trait" in MCMCglmm

Like in this post I'm struggling with the notation of MCMCglmm, especially what is meant by trait. My code ist the following
library("MCMCglmm")
set.seed(123)
y <- sample(letters[1:3], size = 100, replace = TRUE)
x <- rnorm(100)
id <- rep(1:10, each = 10)
dat <- data.frame(y, x, id)
mod <- MCMCglmm(fixed = y ~ x, random = ~us(x):id,
data = dat,
family = "categorical")
Which gives me the error message For error structures involving catgeorical data with more than 2 categories pleasue use trait:units or variance.function(trait):units. (!sic). If I would generate dichotomous data by letters[1:2], everything would work fine. So what is meant by this error message in general and "trait" in particular?
Edit 2016-09-29:
From the linked question I copied rcov = ~ us(trait):units into my call of MCMCglmm. And from https://stat.ethz.ch/pipermail/r-sig-mixed-models/2010q3/004006.html I took (and slightly modified it) the prior
list(R = list(V = diag(2), fix = 1), G = list(G1 = list(V = diag(2), nu = 1, alpha.mu = c(0, 0), alpha.V = diag(2) * 100))). Now my model actually gives results:
MCMCglmm(fixed = y ~ 1 + x, random = ~us(1 + x):id,
rcov = ~ us(trait):units, prior = prior, data = dat,
family = "categorical")
But still I've got a lack of understanding what is meant by trait (and what by units and the notation of the prior, and what is us() compared to idh() and ...).
Edit 2016-11-17:
I think trait is synoym to "target variable" or "response" in general or y in this case. In the formula for random there is nothing on the left side of ~ "because the response is known from the fixed effect specification." So the rational behind specifiying that rcov needs trait:units could be that it is alread defined by the fixed formula, what trait is (y in this case).
units is the response variable value, and trait is the response variable name, which corresponds to the categories. By specifying rcov = ~us(trait):units, you are allowing the residual variance to be heterogeneous across "traits" (response categories) so that all elements of the residual variance-covariance matrix will be estimated.
In Section 5.1 of Hadfield's MCMCglmm Course Notes (vignette("CourseNotes", "MCMCglmm")) you can read an explanation for the reserved variables trait and units.

Resources