I want to fit a coxph model in r and offset one main effect and the interaction term.
I try to use offset command in front of the variable but it gives the error: Error in model.frame.default(formula = Surv(time = X, event = Delta) ~ :
variable lengths differ (found for 'offset(D:Y)')
We can generate some toy data like
require(survival)
D = rbinom(100, 1, 0.5)
Y = rbinom(100, 1, 0.5)
X = rexp(100, 1)
Delta = rbinom(100, 1, 0.5)
coxph(Surv(time = X, event = Delta) ~ D + offset(Y) + offset(D:Y))
I want to offset Y and D:Y but it keeps giving me the error. Maybe I am wrong with how to use "offset".
Despite the fact that you are using the expression D:Y within a formula expression, it is first being processed by offset which does not actually "know" what to do with the ":" operator in the R formula-parsing context. It instead is giving you a warning message that implies it is being parsed as the integer-sequence operator.
Error in model.frame.default(formula = Surv(time = X, event = Delta) ~ :
variable lengths differ (found for 'offset(D:Y)')
In addition: Warning messages:
1: In D:Y : numerical expression has 100 elements: only the first used
If you want to use the product of D and Y as an additional "interaction" offset, then you could have done this:
> coxph(Surv(time = X, event = Delta) ~ D + offset(Y) + offset(I(D*Y)))
Call:
coxph(formula = Surv(time = X, event = Delta) ~ D + offset(Y) +
offset(I(D * Y)))
coef exp(coef) se(coef) z p
D -0.9959 0.3694 0.2825 -3.525 0.000423
Likelihood ratio test=12.26 on 1 df, p=0.0004635
n= 100, number of events= 53
And looking at it I thought maybe even dropping the I() since offset wasn't using formula parsing logic anyway. And that proved to be the case.
I have made an ODE model in R using the package deSolve. Currently the output of the model gives me the "observed" prevalence of a disease (i.e. the prevalence not accounting for diagnostic imperfection).
However, I want to adjust the model to output the "true" prevalence, using a simple adjustment formula called the Rogan-Gladen estimator (http://influentialpoints.com/Training/estimating_true_prevalence.htm):
True prevalence =
(Apparent prev. + (Specificity-1)) / (Specificity + (Sensitivity-1))
As you will see in the code below, I have attempted to adjust only one of the differential equations (diggP).
Running the model without adjustment gives an expected output (a proportion between 0 and 1). However, attempting to adjust the model using the RG-estimator gives a spurious output (a proportion less than 0).
Any advice on what might be going wrong here would be very much appreciated.
# Load required packages
library(tidyverse)
library(broom)
library(deSolve)
# Set time (age) for function
time = 1:80
# Defining exponential decay of lambda over age
y1 = 0.003 + (0.15 - 0.003) * exp(-0.05 * time) %>% jitter(10)
df <- data.frame(t = time, y = y1)
fit <- nls(y ~ SSasymp(time, yf, y0, log_alpha), data = df)
fit
# Values of lambda over ages 1-80 years
data <- as.matrix(0.003 + (0.15 - 0.003) * exp(-0.05 * time))
lambda<-as.vector(data[,1])
t<-as.vector(seq(1, 80, by=1))
foi<-cbind(t, lambda)
foi[,1]
# Making lambda varying by time useable in the ODE model
input <- approxfun(x = foi[,1], y = foi[,2], method = "constant", rule = 2)
# Model
ab <- function(time, state, parms) {
with(as.list(c(state, parms)), {
# lambda, changing by time
import<-input(time)
# Derivatives
# RG estimator:
#True prevalence = (apparent prev + (sp-1)) / (sp + (se-1))
diggP<- (((import * iggN) - iggR * iggP) + (sp_igg-1)) / (sp_igg + (se_igg-1))
diggN<- (-import*iggN) + iggR*iggP
dtgerpP<- (0.5*import)*tgerpN -tgerpR*tgerpP
dtgerpN<- (0.5*-import)*tgerpN + tgerpR*tgerpP
# Return results
return(list(c(diggP, diggN, dtgerpP, dtgerpN)))
})
}
# Initial values
yini <- c(iggP=0, iggN=1,
tgerpP=0, tgerpN=1)
# Parameters
pars <- c(iggR = 0, tgerpR = (1/8)/12,
se_igg = 0.95, sp_igg = 0.92)
# Solve model
results<- ode(y=yini, times=time, func=ab, parms = pars)
# Plot results
plot(results, xlab="Time (years)", ylab="Proportion")
I would like to fit an exponential decay function in R to the following data:
data <- structure(list(x = 0:38, y = c(0.991744340878828, 0.512512332368168,
0.41102449265681, 0.356621905557202, 0.320851602373477, 0.29499198506227,
0.275037747162642, 0.25938850981822, 0.245263623938863, 0.233655093612007,
0.224041426946405, 0.214152907133301, 0.207475138903635, 0.203270738895484,
0.194942528735632, 0.188107106969046, 0.180926819430008, 0.177028560207711,
0.172595416846822, 0.166729221891201, 0.163502461048814, 0.159286528409165,
0.156110097827889, 0.152655498715612, 0.148684858095915, 0.14733605355542,
0.144691873223729, 0.143118852619617, 0.139542186417186, 0.137730138713745,
0.134353615271572, 0.132197800438632, 0.128369567159113, 0.124971834736476,
0.120027536018095, 0.117678812415655, 0.115720611113327, 0.112491329844252,
0.109219168085624)), class = "data.frame", row.names = c(NA,
-39L), .Names = c("x", "y"))
I've tried fitting with nls but the generated curve is not close to the actual data.
enter image description here
It would be very helpful if anyone could explain how to work with such nonlinear data and find a function of best fit.
Try y ~ .lin / (b + x^c). Note that when using "plinear" one omits the .lin linear parameter when specifying the formula to nls and also omits a starting value for it.
Also note that the .lin and b parameters are approximately 1 at the optimum so we could also try the one parameter model y ~ 1 / (1 + x^c). This is the form of a one-parameter log-logistic survival curve. The AIC for this one parameter model is worse than for the 3 parameter model (compare AIC(fm1) and AIC(fm3)) but the one parameter model might still be preferable due to its parsimony and the fact that the fit is visually indistinguishable from the 3 parameter model.
opar <- par(mfcol = 2:1, mar = c(3, 3, 3, 1), family = "mono")
# data = data.frame with x & y col names; fm = model fit; main = string shown above plot
Plot <- function(data, fm, main) {
plot(y ~ x, data, pch = 20)
lines(fitted(fm) ~ x, data, col = "red")
legend("topright", bty = "n", cex = 0.7, legend = capture.output(fm))
title(main = paste(main, "- AIC:", round(AIC(fm), 2)))
}
# 3 parameter model
fo3 <- y ~ 1/(b + x^c) # omit .lin parameter; plinear will add it automatically
fm3 <- nls(fo3, data = data, start = list(b = 1, c = 1), alg = "plinear")
Plot(data, fm3, "3 parameters")
# one parameter model
fo1 <- y ~ 1 / (1 + x^c)
fm1 <- nls(fo1, data, start = list(c = 1))
Plot(data, fm1, "1 parameter")
par(read.only = opar)
AIC
Adding the solutions in the other answers we can compare the AIC values. We have labelled each solution by the number of parameters it uses (the degrees of freedom would be one greater than that) and have reworked the log-log solution to use nls instead of lm and have a LHS of y since one cannot compare the AIC values of models having different left hand sides or using different optimization routines since the log likelihood constants used could differ.
fo2 <- y ~ exp(a + b * log(x+1))
fm2 <- nls(fo2, data, start = list(a = 1, b = 1))
fo4 <- y ~ SSbiexp(x, A1, lrc1, A2, lrc2)
fm4 <- nls(fo4, data)
aic <- AIC(fm1, fm2, fm3, fm4)
aic[order(aic$AIC), ]
giving from best AIC (i.e. fm3) to worst AIC (i.e. fm2):
df AIC
fm3 4 -329.35
fm1 2 -307.69
fm4 5 -215.96
fm2 3 -167.33
A biexponential model would fit much better, though still not perfect. This would indicate that you might have two simultaneous decay processes.
fit <- nls(y ~ SSbiexp(x, A1, lrc1, A2, lrc2), data = data)
#A1*exp(-exp(lrc1)*x)+A2*exp(-exp(lrc2)*x)
plot(y ~x, data = data)
curve(predict(fit, newdata = data.frame(x)), add = TRUE)
If the measurement error depends on magnitude, you could consider using it for weighting.
However, you should consider carefully what kind of model you'd expect from your domain knowledge. Just selecting a non-linear model empirically is usually not a good idea. A non-parametric fit might be a better option.
data <- structure(list(x = 0:38, y = c(0.991744340878828, 0.512512332368168,
0.41102449265681, 0.356621905557202, 0.320851602373477, 0.29499198506227,
0.275037747162642, 0.25938850981822, 0.245263623938863, 0.233655093612007,
0.224041426946405, 0.214152907133301, 0.207475138903635, 0.203270738895484,
0.194942528735632, 0.188107106969046, 0.180926819430008, 0.177028560207711,
0.172595416846822, 0.166729221891201, 0.163502461048814, 0.159286528409165,
0.156110097827889, 0.152655498715612, 0.148684858095915, 0.14733605355542,
0.144691873223729, 0.143118852619617, 0.139542186417186, 0.137730138713745,
0.134353615271572, 0.132197800438632, 0.128369567159113, 0.124971834736476,
0.120027536018095, 0.117678812415655, 0.115720611113327, 0.112491329844252,
0.109219168085624)), class = "data.frame", row.names = c(NA,
-39L), .Names = c("x", "y"))
# Do this because the log of 0 is not possible to calculate
data$x = data$x +1
fit = lm(log(y) ~ log(x), data = data)
plot(data$x, data$y)
lines(data$x, data$x ^ fit$coefficients[2], col = "red")
This did a lot better than using the nls forumla. And when plotting the fit seems to do fairly well.
I am trying to convert an "nlmrt object to an "nls" object using nls2. However, I can only manage to do it if I write explicitly the names of the parameters in the call. Can't I define the parameter names programmatically? See the reproducible example:
library(nlmrt)
scale_vector <- function(vector, ranges_in, ranges_out){
t <- (vector - ranges_in[1, ])/(ranges_in[2, ]-ranges_in[1, ])
vector <- (1-t) * ranges_out[1, ] + t * ranges_out[2, ]
}
shobbs.res <- function(x) {
# UNSCALED Hobbs weeds problen -- coefficients are rescaled internally using
# scale_vector
ranges_in <- rbind(c(0, 0, 0), c(100, 10, 0.1))
ranges_out <- rbind(c(0, 0, 0), c(1, 1, 1))
x <- scale_vector(x, ranges_in, ranges_out)
tt <- 1:12
res <- 100*x[1]/(1+10*x[2]*exp(-0.1*x[3]*tt)) - y }
y <- c(5.308, 7.24, 9.638, 12.866, 17.069, 23.192, 31.443,
38.558, 50.156, 62.948, 75.995, 91.972)
st <- c(b1=100, b2=10, b3=0.1)
ans1n <- nlfb(st, shobbs.res)
print(coef(ans1n))
This works:
library(nls2)
ans_nls2 <- nls2(y ~ shobbs.res(c(b1, b2, b3)) + y, start = coef(ans1n), alg = "brute")
However, this forces me to hard-code the parameters names in the call to nls2. For reasons related to my actual code, I would like to be able to do something like
ans_nls2 <- nls2(y ~ shobbs.res(names(st)) + y, start = coef(ans1n), alg = "brute")
But this returns an error:
Error in vector - ranges_in[1, ] :
non-numeric argument to binary operator
Is it possible to fix this, without having to hard-code explicitly the names of parameters in the call to nls2?
nls2 will accept a string as a formula:
co <- coef(ans1n)
fo_str <- sprintf("y ~ shobbs.res(c(%s)) + y", toString(names(co)))
nls2(fo_str, start = co, alg = "brute")
giving:
Nonlinear regression model
model: y ~ shobbs.res(c(b1, b2, b3)) + y
data: NULL
b1 b2 b3
196.1863 49.0916 0.3136
residual sum-of-squares: 2.587
Number of iterations to convergence: 3
Achieved convergence tolerance: NA
Like in this post I'm struggling with the notation of MCMCglmm, especially what is meant by trait. My code ist the following
library("MCMCglmm")
set.seed(123)
y <- sample(letters[1:3], size = 100, replace = TRUE)
x <- rnorm(100)
id <- rep(1:10, each = 10)
dat <- data.frame(y, x, id)
mod <- MCMCglmm(fixed = y ~ x, random = ~us(x):id,
data = dat,
family = "categorical")
Which gives me the error message For error structures involving catgeorical data with more than 2 categories pleasue use trait:units or variance.function(trait):units. (!sic). If I would generate dichotomous data by letters[1:2], everything would work fine. So what is meant by this error message in general and "trait" in particular?
Edit 2016-09-29:
From the linked question I copied rcov = ~ us(trait):units into my call of MCMCglmm. And from https://stat.ethz.ch/pipermail/r-sig-mixed-models/2010q3/004006.html I took (and slightly modified it) the prior
list(R = list(V = diag(2), fix = 1), G = list(G1 = list(V = diag(2), nu = 1, alpha.mu = c(0, 0), alpha.V = diag(2) * 100))). Now my model actually gives results:
MCMCglmm(fixed = y ~ 1 + x, random = ~us(1 + x):id,
rcov = ~ us(trait):units, prior = prior, data = dat,
family = "categorical")
But still I've got a lack of understanding what is meant by trait (and what by units and the notation of the prior, and what is us() compared to idh() and ...).
Edit 2016-11-17:
I think trait is synoym to "target variable" or "response" in general or y in this case. In the formula for random there is nothing on the left side of ~ "because the response is known from the fixed effect specification." So the rational behind specifiying that rcov needs trait:units could be that it is alread defined by the fixed formula, what trait is (y in this case).
units is the response variable value, and trait is the response variable name, which corresponds to the categories. By specifying rcov = ~us(trait):units, you are allowing the residual variance to be heterogeneous across "traits" (response categories) so that all elements of the residual variance-covariance matrix will be estimated.
In Section 5.1 of Hadfield's MCMCglmm Course Notes (vignette("CourseNotes", "MCMCglmm")) you can read an explanation for the reserved variables trait and units.