Using mle2 for parameter estimates with error and predict - r

I'm using mle2 to estimate a parameters for a non-linear model and I want estimates of error around the parameter estimates (std. error). As well, I'd like to use the model to then predict with newdata, and I'm having problems (errors) with a couple steps in this process.
Here's the data:
table<- "ac.temp performance
1 2.17 47.357923
4 2.17 234.255317
7 2.17 138.002633
10 2.17 227.545902
13 2.17 28.118072
16 9.95 175.638448
19 9.95 167.392218
22 9.95 118.162747
25 9.95 102.770622
28 9.95 191.874867
31 16.07 206.897159
34 16.07 74.741619
37 16.07 127.219884
40 16.07 208.231559
42 16.07 89.544612
44 20.14 314.946107
47 20.14 290.994063
50 20.14 243.322497
53 20.14 192.335133
56 20.14 133.841776
58 23.83 139.746673
61 23.83 224.135993
64 23.83 126.726493
67 23.83 246.443386
70 23.83 163.019896
83 28.04 4.614154
84 28.04 2.851866
85 28.04 2.935584
86 28.04 153.868415
87 28.04 103.884295
88 30.60 0.000000
89 29.60 0.000000
90 30.30 0.000000
91 29.90 0.000000
92 30.80 0.000000
93 28.90 0.000000
94 30.00 0.000000
95 30.20 0.000000
96 30.40 0.000000
97 30.70 0.000000
98 27.90 0.000000
99 28.60 0.000000
100 28.60 0.000000
101 30.40 0.000000
102 30.60 0.000000
103 29.70 0.000000
104 30.50 0.000000
105 29.30 0.000000
106 28.90 0.000000
107 30.40 0.000000
108 30.20 0.000000
109 30.10 0.000000
110 29.50 0.000000
111 31.00 0.000000
112 30.60 0.000000
113 30.90 0.000000
114 31.10 0.000000"
perfdat<- read.table(text=table, header=T)
First I have to set a couple of the fixed parameters for my non-linear model on performance of an animal with respect to temperature
pi = mean(subset(perfdat, ac.temp<5)$performance)
ti = min(perfdat$ac.temp)
define the x variable (temperature)
t = perfdat$ac.temp
create a function for non-linear model
tpc = function(tm, topt, pmax) {
perfmu = pi+(pmax-pi)*(1+((topt-t)/(topt-tm)))*(((t-ti)/(topt-ti))^((tm-ti)/(topt-tm)))
perfmu[perfmu<0] = 0.00001
return(perfmu)
}
create the negative log likelihood function
LL1 = function(tm, topt, pmax, performance=perfdat$performance) {
perfmu = tpc(tm=tm, topt=topt, pmax=pmax)
loglike = -sum(dnorm(x=performance, mean=perfmu, log=TRUE))
return(loglike)
}
model performance using mle2 - maximum likelihood estimator
m1<- mle2(minuslogl = LL1, start=list(tm=15, topt=20, pmax=150), control=list(maxit=5000))
summary(m1)
This gives me parameter estimates but not estimates of error (std. error) with the warning message: In sqrt(diag(object#vcov)) : NaNs produced. However, the parameters estimates are good and get me predictions that make sense.
Plot of non-linear curve using parameter estimates
I have tried many different optimizers and methods and get the same error about not being able to calculate std. error, usually with warnings about not being able to invert the hessian. OR I get really wonky estimates of my parameters that don't make sense.
IF I use:
confint(m1)
I get 95% intervals for each of my parameters, but I can't incorporate those into a prediction method that I could use for making a graph like below, which I made using an nls model and predict():
non-linear model with error graphed
If I recreate my mle2() model by embedding the model formula into the mle2 function:
tpcfun<- function(t, tm.f, topt.f, pmax.f) {
perfmu = pi+(pmax.f-pi)*(1+((topt.f-t)/(topt.f-tm)))*(((t-ti)/(topt.f-ti))^((tm.f-ti)/(topt.f-tm.f)))
perfmu[perfmu<0] = 0.00001
return(perfmu)
}
m2<- mle2(performance ~ dnorm(mean=-sum(log(tpcfun(t=ac.temp, tm.f, topt.f, pmax.f))), sd=1), method='L-BFGS-B', lower=c(tm.f=1, topt.f=1, pmax.f=1), control=list(maxit=5000, parscale=c(tm.f=1e1, topt.f=1e1, pmax.f=1e2)), start=list(tm.f=15, topt.f=20, pmax.f=150), data=perfdat)
summary(m2)
I get non-sensical estimates for my parameters and I still don't get estimates for error.
My question is, can anyone see anything wrong with either of my models (the model functions and likelihood functions) or anything else that I am doing wrong? I have a feeling that I may be writing the likelihood function wrong but I've tried all sorts of distributions and different ways but I may be totally screwing it up.
OR is there a way that I can get estimates of error around my parameters such that I can use them to visualize error around my model prediction in graphs?
Thanks,
Rylee
PS. I decided to make a graph with just point estimates and then the trend line without error around it, but I wanted to put the bars for 95% CI on each of the point estimates, but confint() is giving me redicylously small CI's which don't even show up on the graph because they are smaller than the point character I'm using, ha.

I think the problem is in the "maxint" arg. Try to use good start values and avoid high interations. The second problem is the algorithm "L-BFGS-B" unless the default. When we use confint function it's normal obtain the intervals whether the mle2 otimization converged. Check if the profiles can be made as a plot (plotprofmle function). It's more safe.
It's normal a "NaNs produced" error if your data contain zero values when apply log. I suggest use this sequence:
loglike = -sum(dnorm(x=performance, mean=perfmu, log=TRUE), na.rm=TRUE)
Verify if the result is plausible.

Related

Do results of survival analysis only pertain to the observations analyzed?

Hey guys, so I taught myself time-to-event analysis recently and I need some help understanding it. I made some Kaplan-Meier survival curves.
Sure, the number of observations within each node is small but let's pretend that I have plenty.
K <- HF %>%
filter(serum_creatinine <= 1.8, ejection_fraction <= 25)
## Call: survfit(formula = Surv(time, DEATH_EVENT) ~ 1, data = K)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 20 36 5 0.881 0.0500 0.788 0.985
## 45 33 3 0.808 0.0612 0.696 0.937
## 60 31 3 0.734 0.0688 0.611 0.882
## 80 23 6 0.587 0.0768 0.454 0.759
## 100 17 1 0.562 0.0776 0.429 0.736
## 110 17 0 0.562 0.0776 0.429 0.736
## 120 16 1 0.529 0.0798 0.393 0.711
## 130 14 0 0.529 0.0798 0.393 0.711
## 140 14 0 0.529 0.0798 0.393 0.711
## 150 13 1 0.488 0.0834 0.349 0.682
If someone were to ask me about the third node, would the following statements be valid?:
For any new patient that walks into this hospital with <= 1.8 in Serum_Creatine & <= 25 in Ejection Fraction, their probability of survival is 53% after 140 days.
What about:
The survival distributions for the samples analyzed, and no other future incoming samples, are visualized above.
I want to make sure these statements are correct.
I would also like to know if logistic regression could be used to predict the binary variable DEATH_EVENT? Since the TIME variable contributes to how much weight one patient's death at 20 days has over another patient's death at 175 days, I understand that this needs to be accounted for.
If logistic regression can be used, does that imply anything over keeping/removing variable TIME?
Here are some thoughts:
Logistic regression is not appropriate in your case. As it is not the correct method for time to event analysis.
If the clinical outcome observed is “either-or,” such as if a patient suffers an MI or not, logistic regression can be used.
However, if the information on the time to MI is the observed outcome, data are analyzed using statistical methods for survival analysis.
Text from here
If you want to use a regression model in survival analysis then you should use a COX PROPORTIONAL HAZARDS MODEL. To understand the difference of a Kaplan-Meier analysis and Cox proportional hazards model you should understand both of them.
The next step would be to understand what is a univariable in contrast to a multivariable Cox proportional hazard model.
At the end you should understand all 3 methods(Kaplan-Meier, Cox univariable and Cox multivariable) then you can answer your question if this is a valid statement:
For any new patient that walks into this hospital with <= 1.8 in Serum_Creatine & <= 25 in Ejection Fraction, their probability of survival is 53% after 140 days.
There is nothing wrong to state the results of a subgroup of a Kaplan-Meier method. But it has a different value if the statement comes from a multivariable Cox regression analysis.

Kaplan Meier survival plot

Good morning,
I am having trouble understanding some of my outputs for my Kaplan Meier analyses.
I have managed to produce the following plots and outputs using ggsurvplot and survfit.
I first made a plot of survival time of 55 nest with time and then did the same with the top predictors for nest failure, one being microtopography, as seen in this example.
Call: npsurv(formula = (S) ~ 1, data = nestdata, conf.type = "log-log")
26 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
55 45 0 13 29 2 NA
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata, conf.type = "log-log")
29 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
Microtopography=0 14 13 0 1 NA NA NA
Microtopography=1 26 21 0 7 NA 29 NA
Microtopography=2 12 8 0 5 3 2 NA
So, I have two primary questions.
1. The survival curves are for a ground nesting bird with an egg incubation time of 21-23 days. Incubation time is the number of days the hen sits of the eggs before they hatch. Knowing that, how is it possible that the median survival time in plot #1 is 29 days? It seems to fit with the literature I have read on this same species, however, I assume it has something to do with the left censoring in my models, but am honestly at a loss. If anyone has any insight or even any litterature that could help me understand this concept, I would really appreciate it.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt, the median survival times returned are NA. I understand I can chose another interval, such as .75, but in this example that still wouldnt help me because microtopography 0 never drops below .9 or so. How would one go about reporting this data. Would the work around be to choose a survival interval, using:
summary(s,times=c(7,14,21,29))
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata,
conf.type =
"log-log")
29 observations deleted due to missingness
Microtopography=0
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 3 0 0 1.000 0.0000 1.000 1.000
14 7 0 0 1.000 0.0000 1.000 1.000
21 13 0 0 1.000 0.0000 1.000 1.000
29 8 1 5 0.909 0.0867 0.508 0.987
Microtopography=1
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 9 0 0 1.000 0.0000 1.000 1.000
14 17 1 0 0.933 0.0644 0.613 0.990
21 21 3 0 0.798 0.0909 0.545 0.919
29 15 3 7 0.655 0.1060 0.409 0.819
Microtopography=2
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 1 2 0 0.333 0.272 0.00896 0.774
14 7 1 0 0.267 0.226 0.00968 0.686
21 8 1 0 0.233 0.200 0.00990 0.632
29 3 1 5 0.156 0.148 0.00636 0.504
Late to the party...
The median survival time of 29 days is the median incubation time that birds of this species are expected to be in the egg until they hatch - based on your data. Your median of 21-24 (based on ?) is probably based on many experiments/studies of eggs that have hatched, ignoring those that haven't hatched yet (those that failed?).
From your overall survival curve, it is clear that some eggs have not yet hatched, even after more than 35 days. These are taken into account when calculating the expected survival times. If you think that these eggs will fail, then omit them. Otherwise, the software cannot possibly know that they will eventually fail. But how can anyone know for sure if an egg is going to fail, even after 30 days? Is there a known maximum hatching time? The record-breaker of all hatched eggs?
There are not really R questions, so this question might be more appropriate for the statistics site. But the following might help.
how is it possible that the median survival time in plot #1 is 29 days?
The median survival is where the survival curve passes the 50% mark. Eyeballing it, 29 days looks right.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt.
Given your data, you cannot compare the median. You can compare the 75% or 90%, if you must. You can compare the point survival at, say, 30 days. You can compare the truncated average survival in the first 30 days.
In order to compare the median, you would have to make an assumption. I reasonable assumption would be an exponential decay after some tenure point that includes at least one failure.

How to pass nlpr (n parameter logistic regression) to stat_smooth in ggplot?

My data is as follows
# A tibble: 24 x 3
time OD600 strain
<dbl> <dbl> <chr>
1 0.0001 0.0001 M12-611020
2 1.0000 0.0880 M12-611020
3 3.0000 0.2110 M12-611020
4 4.0000 0.2780 M12-611020
5 4.5000 0.4040 M12-611020
6 5.0000 0.6060 M12-611020
7 5.5000 0.7780 M12-611020
8 6.0000 0.9020 M12-611020
9 6.5000 1.0240 M12-611020
10 8.0000 1.1000 M12-611020
11 0.0001 0.0001 M12-611025
12 1.0000 0.0770 M12-611025
13 3.0000 0.0880 M12-611025
14 4.0000 0.1250 M12-611025
15 5.0000 0.3040 M12-611025
16 5.5000 0.4210 M12-611025
17 6.0000 0.5180 M12-611025
18 6.5000 0.6160 M12-611025
19 7.0000 0.7180 M12-611025
20 7.5000 0.8520 M12-611025
21 8.0000 0.9400 M12-611025
22 8.5000 0.9500 M12-611025
23 9.0000 0.9680 M12-611025
I have 2 "strains" in the data.frame each with thier own set's of values for "time" and "OD600".
I have so far been plotting using ggplot as follows (removing asthetics for simplicity) using "loess" to fit a curve:
growth_curve_SE <- growth_curve +
stat_smooth(aes(group=strain,fill=strain, colour = strain) ,method = "loess", se = T, alpha=0.2 , span = 0.8) +
geom_point(aes(fill=factor(strain)),alpha=0.5 , size=3,shape = 21,colour = "black", stroke = 1)
What I ultimately want to achieve is fitting a 5 parameter logictical regression rather then "loess" for the method as it is a better model for the data and fits a more accurate curve.
I used the package "nplr" to fit the regression for multiple strains using a list split as per strain:
strain_list <- split(multi_strain, multi_strain$strain)
np2 <- lapply(strain_list, function(tmp) {nplr(tmp$time, tmp$OD600, useLog = F)})
Which fits the regression:
$`M12-611020`
Instance of class nplr
Call:
nplr(x = tmp$time, y = tmp$OD600, useLog = F)
weights method: residuals
5-P logistic model
Bottom asymptote: 0.03026607
Top asymptote: 1.104278
Inflexion point at (x, y): 5.297454 0.6920488
Goodness of fit: 0.9946967
Weighted Goodness of fit: 0.9998141
Standard error: 0.0308006 0.01631115
$`M12-611025`
Instance of class nplr
Call:
nplr(x = tmp$time, y = tmp$OD600, useLog = F)
weights method: residuals
5-P logistic model
Bottom asymptote: -0.0009875526
Top asymptote: 0.9902298
Inflexion point at (x, y): 6.329304 0.5919818
Goodness of fit: 0.9956551
Weighted Goodness of fit: 0.9998606
Standard error: 0.02541948 0.01577407
Any ideas how I can achieve the same in ggplot using the "stat_smooth" command to use the 5 parameter logistic regression, either with or without the "nplr" package?
For anyone interested, I found a work around for this.
The nplr packages allows you to output the curve as a series of x and y co-ordiantes as follows:
x <- getXcurve(data)
y <-getYcurve(data)
From this I used "geom_line" function using these x and y parameters and that gave me the (n)parameter logistic regression I was after inthe form of a logn. Layer one the "geom_point" as abice and you get a good looking graph

R - two types of prediction in cross validation

When i using cross validation technique with my data it gives me two types of prediction. CVpredict and Predict. What is difference between two of that? I guess cvpredict is cross validation predict but what is the other?
Here is some of my code:
crossvalpredict <- cv.lm(data = total,form.lm = formula(verim~X4+X4.1),m=5)
And this is the result:
fold 1
Observations in test set: 5
3 11 15 22 23
Predicted 28.02 32.21 26.53 25.1 21.28
cvpred 20.23 40.69 26.57 34.1 26.06
verim 30.00 31.00 28.00 24.0 20.00
CV residual 9.77 -9.69 1.43 -10.1 -6.06
Sum of squares = 330 Mean square = 66 n = 5
fold 2
Observations in test set: 5
2 7 21 24 25
Predicted 28.4 32.0 26.2 19.95 25.9
cvpred 52.0 81.8 36.3 14.28 90.1
verim 30.0 33.0 24.0 21.00 24.0
CV residual -22.0 -48.8 -12.3 6.72 -66.1
Sum of squares = 7428 Mean square = 1486 n = 5
fold 3
Observations in test set: 5
6 14 18 19 20
Predicted 34.48 36.93 19.0 27.79 25.13
cvpred 37.66 44.54 16.7 21.15 7.91
verim 33.00 35.00 18.0 31.00 26.00
CV residual -4.66 -9.54 1.3 9.85 18.09
Sum of squares = 539 Mean square = 108 n = 5
fold 4
Observations in test set: 5
1 4 5 9 13
Predicted 31.91 29.07 32.5 32.7685 28.9
cvpred 30.05 28.44 54.9 32.0465 11.4
verim 32.00 27.00 31.0 32.0000 30.0
CV residual 1.95 -1.44 -23.9 -0.0465 18.6
Sum of squares = 924 Mean square = 185 n = 5
fold 5
Observations in test set: 5
8 10 12 16 17
Predicted 27.8 30.28 26.0 27.856 35.14
cvpred 50.3 33.92 45.8 31.347 29.43
verim 28.0 30.00 24.0 31.000 38.00
CV residual -22.3 -3.92 -21.8 -0.347 8.57
Sum of squares = 1065 Mean square = 213 n = 5
Overall (Sum over all 5 folds)
ms
411
You can check that by reading the help of the function you are using cv.lm. There you will find this paragraph:
The input data frame is returned, with additional columns
‘Predicted’ (Predicted values using all observations) and ‘cvpred’
(cross-validation predictions). The cross-validation residual sum
of squares (‘ss’) and degrees of freedom (‘df’) are returned as
attributes of the data frame.
Which says that Predicted is a vector of predicted values made using all the observations. In other words it seems like a predictions made on your "training" data or made "in sample".
To check wether this is so you can fit the same model using lm:
fit <- lm(verim~X4+X4.1, data=total)
And see if the predicted values from this model:
predict(fit)
are the same as those returned by cv.lm
When I tried it on the iris dataset in R - cv.lm() predicted returned the same values as predict(lm). So in that case - they are in-sample predictions where the model is fitted and used using the same observations.
lm() does not give "better results." I am not sure how predict() and lm.cv() can be the same. Predict() returns the expected values of Y for each sample, estimated from the fitted model (covariates (X) and their corresponding estimated Beta values). Those Beta values, and the model error (E), were estimated from that original data. By using predict(), you get an overly optimistic estimate of model performance. That is why it seems better. You get a better (more realistic) estimate of model performance using an iterated sample holdout technique, like cross validation (CV). The least biased estimate comes from leave-one-out CV and the estimate with the least uncertainty (prediction error) comes from 2-fold (K=2) CV.

Using nls in R, error message: Missing value or an infinity produced when evaluating the model

> lenss
xd yd zd
1 0.0000 0.0000 2.44479
2 0.0937 0.0000 2.73183
3 0.3750 0.0000 2.97785
4 0.8437 0.0000 3.18626
5 1.5000 0.0000 3.36123
6 2.3437 0.0000 3.50624
7 3.3750 0.0000 3.62511
8 4.5937 0.0000 3.72124
9 5.9999 0.0000 3.79778
10 7.5936 0.0000 3.85744
11 9.3749 0.0000 3.90241
12 11.3436 0.0000 3.93590
13 13.4998 0.0000 3.96011
14 15.8435 0.0000 3.97648
15 18.3748 0.0000 3.98236
16 21.0935 0.0000 3.99406
17 23.9997 0.0000 3.99732
18 27.0934 0.0000 3.99911
19 30.3746 0.0000 4.00004
20 33.8433 0.0000 4.00005
21 37.4995 0.0000 4.00006
22 0.0663 0.0663 3.99973
23 0.2652 0.2652 3.99988
24 0.5966 0.5966 3.99931
25 1.0606 1.0606 3.99740
26 1.6573 1.6573 3.99375
27 2.3865 2.3865 3.98732
28 3.2482 3.2482 3.97640
29 4.2426 4.2426 3.95999
30 5.3695 5.3695 3.93598
31 6.6290 6.6290 3.90258
32 8.0211 8.0211 3.85171
33 9.5458 9.5458 3.79754
34 11.2031 11.2031 3.72156
35 12.9929 12.9929 3.62538
36 14.9153 14.9153 3.50636
37 16.9703 16.9703 3.36129
38 19.1579 19.1579 3.18622
39 21.4781 21.4781 2.97802
40 23.9308 23.9308 2.73206
41 26.5162 26.5162 2.44464
> rd=sqrt(xd^2+yd^2)
> fit=nls(zd~(rd^2/R)/(1+sqrt(1-(1+k)*rd^2/R^2))+d,start=list(R=75,k=-1,d=1))
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
In addition: Warning message:
In sqrt(1 - (1 + k) * rd^2/R^2) : NaNs produced
The function of that model was given above. The question states that there are a few inaccurate measurements in the data and I need to find them. I was going to fit the model first and work out every residuals in every measurement.
The argument of sqrt must be non-negative but there is no assurance that it is in the setup shown in the question. Furthermore, even if that is fixed it seems unlikely that the model can be fit in the way shown in the question since it consists of two distinct curves (see graphic below) which likely will have to be separately fit.
Using the drc package we can get a reasonable fit using its LL2.5 model like this:
library(drc)
plot(zd ~ rd)
g <- rep(1:2, c(21, 20))
fit1 <- drm(zd ~ rd, fct = LL2.5(), subset = g == 1)
fit2 <- drm(zd ~ rd, fct = LL2.5(), subset = g == 2)
lines(fitted(fit1) ~ rd[g == 1])
lines(fitted(fit2) ~ rd[g == 2])
This involves 10 parameters (5 for each curve). You might try the different models available there by using different models for the fct argument to see if you can find somnething more parsimonious.

Resources