I am trying to generate an inverse Weibull distribution using parameters estimated from survreg in R. By this I mean I would like to, for a given probability (which will be a random number in a small simulation model implemented in MS Excel), return the expected time to failure using my parameters. I understand the general form for the inverse Weibull distribution to be:
X=b[-ln(1-rand())]^(1/a)
where a and b are shape and scale parameters respectively and X is the time to failure I want. My problem is in the interpretation of the intercept and covariate parameters from survreg. I have these parameters, the unit of time is days:
Value Std. Error z p
(Intercept) 7.79 0.2288 34.051 0.000
Group 2 -0.139 0.2335 -0.596 0.551
Log(scale) 0.415 0.0279 14.88 0.000
Scale= 1.51
Weibull distribution
Loglik(model)= -8356.7 Loglik(intercept only)= -8356.9
Chisq = 0.37 on 1 degrees of freedom, p= 0.55
Number of Newton-Raphson Iterations: 4
n=1682 (3 observations deleted due to missing values)
I have read in the help files that the coefficients from R are from the "extreme value distribution" but I'm unsure what this really means and how I get 'back to' the standard scale parameter used directly in the formulae. Using b=7.79 and a=1.51 gives nonsensical answers. I really want to be able to generate a time for both the base group and 'Group 2'. I also should note that I did not perform the analysis myself and cannot interrogate the data further.
This is explained in the manual page, ?survreg (in the "examples" section).
library(survival)
y <- rweibull(1000, shape=2, scale=5)
r <- survreg(Surv(y)~1, dist="weibull")
a <- 1/r$scale # Approximately 2
b <- exp( coef(r) ) # Approximately 5
y2 <- b * ( -ln( 1-runif(1000) ) ) ^(1/a)
y3 <- rweibull(1000, shape=a, scale=5)
# Check graphically that the distributions are the same
plot(sort(y), sort(y2))
abline(0,1)
The key is that the shape parameter the rweibull generates is the inverse of the shape parameter the survreg inputs
Related
I would like to test the simetry in the response of an observer to a contrast stimuli with different polarity, positive (white) and negative (black). I took the reaction time (RT) as dependent variable, along four different contrasts. It is known that the response time follows a Pieron curve whose asymptotas are placed (1) at observer threshold (Inf) and (2) at a base RT placed somewere between 250 and 450 msec.
The knowledge allows us to linearize the relationship transforming the independent variable (effective contrast EC) as 1/EC^2 (tEC), so the equation linking RT to EC becomes:
RT = m * tEC + RT0
To test the symmetry I established the criteria: same slope and same intercept in the two polarities implies symmetry.
To obtain the coefficients I made a linear model with interaction (coding trough a dummy variable for Polarity: Positive or Negative). The output of lm is clear to me, but some colegues prefer somthing more similar to an ANOVA output. So I decided to use emmeans to make the contrasts. With the slope is all right, but when computing the interceps starts the problem. The intercepts computed by lm are very different from the output of emmeans, and the conclusions are also different. In what follows I reproduce the example.
The question is two fold: It is possible to use emmeans to solve my problem? If not, it is possible to make the contrasts through other packages (which one)?
Data
RT1000
EC
tEC
Polarity
596.3564
-25
0.001600
Negative
648.2471
-20
0.002500
Negative
770.7602
-17
0.003460
Negative
831.2971
-15
0.004444
Negative
1311.3331
15
0.004444
Positive
1173.8942
17
0.003460
Positive
1113.7240
20
0.002500
Positive
869.3635
25
0.001600
Positive
Code
# Model
model <- lm(RT1000 ~ tEC * Polarity, data = Data)
# emmeans
library(emmeans)
# Slopes
m.slopes <- lstrends(model, "Polarity", var="tEC")
# Intercepts
m.intercept <- lsmeans(model, "Polarity")
# Contrasts
pairs(m.slopes)
pairs(m.intercept)
Outputs
Modelo
term
estimate
std.error
statistic
p.value
(Intercept)
449.948
66.829
6.733
0.003
tEC
87205.179
20992.976
4.154
0.014
PolarityPositive
230.946
94.511
2.444
0.071
tEC:PolarityPositive
58133.172
29688.551
1.958
0.122
Slopes (it is all right)
Polarity
tEC.trend
SE
df
lower.CL
upper.CL
Negative
87205.18
20992.98
4
28919.33
145491.0
Positive
145338.35
20992.98
4
87052.51
203624.2
contrast
estimate
SE
df
t.ratio
p.value
Negative - Positive
-58133.17
29688.55
4
-1.958101
0.12182
Intercepts (problem)
Polarity
lsmean
SE
df
lower.CL
upper.CL
Negative
711.6652
22.2867
4
649.7874
773.543
Positive
1117.0787
22.2867
4
1055.2009
1178.957
contrast
estimate
SE
df
t.ratio
p.value
Negative - Positive
-405.4135
31.51816
4
-12.86285
0.000211
Computed intercepts through emmeans differs from the ones computed by lm. I think the problem is that the model is not defined for EC = 0. But I'm not sure.
What you are calling the intercepts are not; they are the model predictions at the mean value of tEC. If you want the intercepts, use instead:
m.intercept <- lsmeans(model, "Polarity", at = list(tEC = 0))
You can tell what reference levels are being used via
ref_grid(model) # or str(m.intercept)
Please note that the model fitted here consists of two lines with different slopes; hence the difference between the predictions changes depending on the value of tEC. Thus, I would strongly recommend against testing the comparison of the intercepts; those are predictions at a tEC value that, as you say, can't even occur. Instead, try to be less of a mathematician and do the comparisons at a few representative values of tEC, e.g.,
LSMs <- lsmeans(model, "Polarity", at = list(tEC = c(0.001, 0.003, 0.005)))
pairs(LSMs, by = tEC)
You can also easily visualize the fitted lines:
emmip(model, Polarity ~ tEC, cov.reduce = range)
I have been trying to perform k-fold cross-validation in R on a data set that I have created. The link to this data is as follows:
https://drive.google.com/open?id=0B6vqHScIRbB-S0ZYZW1Ga0VMMjA
I used the following code:
library(DAAG)
six = read.csv("six.csv") #opening file
fit <- lm(Height ~ GLCM.135 + Blue + NIR, data=six) #applying a regression model
summary(fit) # show results
CVlm(data =six, m=10, form.lm = formula(Height ~ GLCM.135 + Blue + NIR )) # 10 fold cross validation
This produces the following output (Summarized version)
Sum of squares = 7.37 Mean square = 1.47 n = 5
Overall (Sum over all 5 folds)
ms
3.75
Warning message:
In CVlm(data = six, m = 10, form.lm = formula(Height ~ GLCM.135 + :
As there is >1 explanatory variable, cross-validation
predicted values for a fold are not a linear function
of corresponding overall predicted values. Lines that
are shown for the different folds are approximate
I do not understand what the ms value refers to as I have seen different interpretations on the internet. It is my understanding that K-fold cross validations produce a overall RMSE value for a specified model (which is what I am trying to obtain for my research).
I also don't understand why the results generated produce a Overall (Sum over all 5 folds), when I have specified a 10 fold cross validation in the code.
If anyone can help it would be much appreciated.
When I ran this same thing, I saw that it did do 10 folds, but the final output printed was the same as yours ("Sum over all 5 folds"). The "ms" is the mean squared prediction error. The value of 3.75 is not exactly a simple average across all 10 folds either (got 3.67):
msaverage <- (1.19+6.04+1.26+2.37+3.57+5.24+8.92+2.03+4.62+1.47)/10
msaverage
Notice the average as well as most folds are higher than "Residual standard error" (1.814). This is what we would expect as the CV error represents model performance likely on "test" data (not data used to trained the model). For instance on Fold 10, notice the residuals calculated are on the predicted observations (5 observations) that were not used in the training for that model:
fold 10
Observations in test set: 5
12 14 26 54 56
Predicted 20.24 21.18 22.961 18.63 17.81
cvpred 20.15 21.14 22.964 18.66 17.86
Height 21.98 22.32 22.870 17.12 17.37
CV residual 1.83 1.18 -0.094 -1.54 -0.49
Sum of squares = 7.37 Mean square = 1.47 n = 5
It appears this warning we received may be common too -- also saw it in this article: http://www.rpubs.com/jmcimula/xCL1aXpM3bZ
One thing I can suggest that may be useful to you is that in the case of linear regression, there is a closed form solution for leave-one-out-cross-validation (loocv) without actually fitting multiple models.
predictedresiduals <- residuals(fit)/(1 - lm.influence(fit)$hat)
PRESS <- sum(predictedresiduals^2)
PRESS #Predicted Residual Sum of Squares Error
fitanova <- anova(fit) #Anova to get total sum of squares
tss <- sum(fitanova$"Sum Sq") #Total sum of squares
predrsquared <- 1 - PRESS/(tss)
predrsquared
Notice this value is 0.574 vs. the original Rsquared of 0.6422
To better convey the concept of RMSE, it is useful to see the distribution of the predicted residuals:
hist(predictedresiduals)
RMSE can then calculated simply as:
sd(predictedresiduals)
I am working with some log-normal data, and naturally I want to demonstrate log-normal distribution results in a better overlap than other possible distributions. Essentially, I want to replicate the following graph with my data:
where the fitted density curves are juxtaposed over log(time).
The text where the linked image is from describes the process as fitting each model and obtaining the following parameters:
For that purpose, I fitted four naive survival models with the above-mentioned distributions:
survreg(Surv(time,event)~1,dist="family")
and extracted the shape parameter (α) and the coefficient (β).
I have several questions regarding the process:
1) Is this the right way of going about it? I have looked into several R packages but couldn't locate one that plots density curves as a built-in function, so I feel like I must be overlooking something obvious.
2) Do the values corresponding log-normal distribution (μ and σ$^2$) just the mean and the variance of the intercept?
3) How can I create a similar table in R? (Maybe this is more of a stack overflow question) I know I can just cbind them manually, but I am more interested in calling them from the fitted models. survreg objects store the coefficient estimates, but calling survreg.obj$coefficients results a named number vector (instead of just a number).
4) Most importantly, how can I plot a similar graph? I thought it would be fairly simple if I just extract the parameters and plot them over the histrogram, but so far no luck. The author of the text says he estimated the density curves from the parameters, but I just get a point estimate - what am I missing? Should I calculate the density curves manually based on distribution before plotting?
I am not sure how to provide a mwe in this case, but honestly I just need a general solution for adding multiple density curves to survival data. On the other hand, if you think it will help, feel free to recommend a mwe solution and I will try to produce one.
Thanks for your input!
Edit: Based on eclark's post, I have made some progress. My parameters are:
Dist = data.frame(
Exponential = rweibull(n = 10000, shape = 1, scale = 6.636684),
Weibull = rweibull(n = 10000, shape = 6.068786, scale = 2.002165),
Gamma = rgamma(n = 10000, shape = 768.1476, scale = 1433.986),
LogNormal = rlnorm(n = 10000, meanlog = 4.986, sdlog = .877)
)
However, given the massive difference in scales, this is what I get:
Going back to question number 3, is this how I should get the parameters?
Currently this is how I do it (sorry for the mess):
summary(fit.exp)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "exponential")
Value Std. Error z p
(Intercept) 6.64 0.052 128 0
Scale fixed at 1
Exponential distribution
Loglik(model)= -2825.6 Loglik(intercept only)= -2825.6
Number of Newton-Raphson Iterations: 6
n= 397
summary(fit.wei)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "weibull")
Value Std. Error z p
(Intercept) 6.069 0.1075 56.5 0.00e+00
Log(scale) 0.694 0.0411 16.9 6.99e-64
Scale= 2
Weibull distribution
Loglik(model)= -2622.2 Loglik(intercept only)= -2622.2
Number of Newton-Raphson Iterations: 6
n= 397
summary(fit.gau)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "gaussian")
Value Std. Error z p
(Intercept) 768.15 72.6174 10.6 3.77e-26
Log(scale) 7.27 0.0372 195.4 0.00e+00
Scale= 1434
Gaussian distribution
Loglik(model)= -3243.7 Loglik(intercept only)= -3243.7
Number of Newton-Raphson Iterations: 4
n= 397
summary(fit.log)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "lognormal")
Value Std. Error z p
(Intercept) 4.986 0.1216 41.0 0.00e+00
Log(scale) 0.877 0.0373 23.5 1.71e-122
Scale= 2.4
Log Normal distribution
Loglik(model)= -2624 Loglik(intercept only)= -2624
Number of Newton-Raphson Iterations: 5
n= 397
I feel like I am particularly messing up the lognormal, given that it is not the standard shape-and-coefficient tandem but the mean and variance.
Try this; the idea is generating random variables using the random distribtion functions and then plotting the density functions with the output data, here is an example like you need:
require(ggplot2)
require(dplyr)
require(tidyr)
SampleData <- data.frame(Duration=rlnorm(n = 184,meanlog = 2.859,sdlog = .246)) #Asume this is data we have sampled from a lognormal distribution
#Then we estimate the parameters for different types of distributions for that sample data and come up for this parameters
#We then generate a dataframe with those distributions and parameters
Dist = data.frame(
Weibull = rweibull(10000,shape = 1.995,scale = 22.386),
Gamma = rgamma(n = 10000,shape = 4.203,scale = 4.699),
LogNormal = rlnorm(n = 10000,meanlog = 2.859,sdlog = .246)
)
#We use gather to prepare the distribution data in a manner better suited for group plotting in ggplot2
Dist <- Dist %>% gather(Distribution,Duration)
#Create the plot that sample data as a histogram
G1 <- ggplot(SampleData,aes(x=Duration)) + geom_histogram(aes(,y=..density..),binwidth=5, colour="black", fill="white")
#Add the density distributions of the different distributions with the estimated parameters
G2 <- G1 + geom_density(aes(x=Duration,color=Distribution),data=Dist)
plot(G2)
I have made a model that looks at a number of variables and the effect that has on pregnancy outcome. The outcome is a grouped binary. A mob of animals will have 34 pregnant and 3 empty, the next will have 20 pregnant and 4 empty and so on.
I have modelled this data using the glmer function where y is the pregnancy outcome (pregnant or empty).
mclus5 <- glmer(y~adg + breed + bw_start + year + (1|farm),
data=dat, family=binomial)
I get all the usual output with coefficients etc. but for interpretation I would like to transform this into odds ratios and confidence intervals for each of the coefficients.
In past logistic regression models I have used the following code
round(exp(cbind(OR=coef(mclus5),confint(mclus5))),3)
This would very nicely provide what I want, but it does not seem to work with the model I have run.
Does anyone know a way that I can get this output for my model through R?
The only real difference is that you have to use fixef() rather than coef() to extract the fixed-effect coefficients (coef() gives you the estimated coefficients for each group).
I'll illustrate with a built-in example from the lme4 package.
library("lme4")
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
Fixed-effect coefficients and confidence intervals, log-odds scale:
cc <- confint(gm1,parm="beta_") ## slow (~ 11 seconds)
ctab <- cbind(est=fixef(gm1),cc)
(If you want faster-but-less-accurate Wald confidence intervals you can use confint(gm1,parm="beta_",method="Wald") instead; this will be equivalent to #Gorka's answer but marginally more convenient.)
Exponentiate to get odds ratios:
rtab <- exp(ctab)
print(rtab,digits=3)
## est 2.5 % 97.5 %
## (Intercept) 0.247 0.149 0.388
## period2 0.371 0.199 0.665
## period3 0.324 0.165 0.600
## period4 0.206 0.082 0.449
A marginally simpler/more general solution:
library(broom.mixed)
tidy(gm1,conf.int=TRUE,exponentiate=TRUE,effects="fixed")
for Wald intervals, or add conf.method="profile" for profile confidence intervals.
I believe there is another, much faster way (if you are OK with a less accurate result).
From: http://www.ats.ucla.edu/stat/r/dae/melogit.htm
First we get the confidence intervals for the Estimates
se <- sqrt(diag(vcov(mclus5)))
# table of estimates with 95% CI
tab <- cbind(Est = fixef(mclus5), LL = fixef(mclus5) - 1.96 * se, UL = fixef(mclus5) + 1.96 * se)
Then the odds ratios with 95% CI
print(exp(tab), digits=3)
Other option I believe is to just use package emmeans :
library(emmeans)
data.frame(confint(pairs(emmeans(fit, ~ factor_name,type="response"))))
I am testing differences on the number of pollen grains loading on plant stigmas in different habitats and stigma types.
My sample design comprises two habitats, with 10 sites each habitat.
In each site, I have up to 3 stigma types (wet, dry and semidry), and for each stigma stype, I have different number of plant species, with different number of individuals per plant species (code).
So, I ended up with nested design as follow: habitat/site/stigmatype/stigmaspecies/code
As it is a descriptive study, stigmatype, stigmaspecies and code vary between sites.
My response variable (n) is the number of pollengrains (log10+1)per stigma per plant, average because i collected 3 stigmas per plant.
Data doesnt fit Poisson distribution because (i) is not integers, and (ii) variance much higher than the mean (ratio = 911.0756). So, I fitted as negative.binomial.
After model selection, I have:
m4a <- glmer(n ~ habitat*stigmatype + (1|stigmaspecies/code),
family=negative.binomial(2))
> summary(m4a)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: Negative Binomial(2) ( log )
Formula: n ~ habitat * stigmatype + (1 | stigmaspecies/code)
AIC BIC logLik deviance
993.9713 1030.6079 -487.9856 975.9713
Random effects:
Groups Name Variance Std.Dev.
code:stigmaspecies (Intercept) 1.034e-12 1.017e-06
stigmaspecies (Intercept) 4.144e-02 2.036e-01
Residual 2.515e-01 5.015e-01
Number of obs: 433, groups: code:stigmaspecies, 433; stigmaspecies, 41
Fixed effects:
Estimate Std. Error t value Pr(>|z|)
(Intercept) -0.31641 0.08896 -3.557 0.000375 ***
habitatnon-invaded -0.67714 0.10060 -6.731 1.68e-11 ***
stigmatypesemidry -0.24193 0.15975 -1.514 0.129905
stigmatypewet -0.07195 0.18665 -0.385 0.699885
habitatnon-invaded:stigmatypesemidry 0.60479 0.22310 2.711 0.006712 **
habitatnon-invaded:stigmatypewet 0.16653 0.34119 0.488 0.625491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) hbttn- stgmtyps stgmtypw hbttnn-nvdd:stgmtyps
hbttnn-nvdd -0.335
stgmtypsmdr -0.557 0.186
stigmatypwt -0.477 0.160 0.265
hbttnn-nvdd:stgmtyps 0.151 -0.451 -0.458 -0.072
hbttnn-nvdd:stgmtypw 0.099 -0.295 -0.055 -0.403 0.133
Two questions:
How do I check for overdispersion from this output?
What is the best way to go through model validation here?
I have been using:
qqnorm(resid(m4a))
hist(resid(m4a))
plot(fitted(m4a),resid(m4a))
While qqnorm() and hist() seem ok, and there is a tendency of heteroscedasticity on the 3rd graph. And here is my final question:
Can I go through model validation with this graph in glmer? or is there a better way to do it? if not, how much should I worry about the 3rd graph?
a simple way to check for overdispersion in glmer is:
> library("blmeco")
> dispersion_glmer(your_model) #it shouldn't be over
> 1.4
To solve overdispersion I usually add an observation level random factor
For model validation I usually start from these plots...but then depends on your specific model...
par(mfrow=c(2,2))
qqnorm(resid(your_model), main="normal qq-plot, residuals")
qqline(resid(your_model))
qqnorm(ranef(your_model)$id[,1])
qqline(ranef(your_model)$id[,1])
plot(fitted(your_model), resid(your_model)) #residuals vs fitted
abline(h=0)
dat_kackle$fitted <- fitted(your_model) #fitted vs observed
plot(your_data$fitted, jitter(your_data$total,0.1))
abline(0,1)
hope this helps a little....
cheers
Just an addition to Q1 for those who might find this by googling: the blmco dispersion_glmer function appears to be outdated. It is better to use #Ben_Bolker's function for this purpose:
overdisp_fun <- function(model) {
rdf <- df.residual(model)
rp <- residuals(model,type="pearson")
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}
Source: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion.
With the highlighted notion:
Do PLEASE note the usual, and extra, caveats noted here: this is an APPROXIMATE estimate of an overdispersion parameter.
PS. Why outdated?
The lme4 package includes the residuals function these days, and Pearson residuals are supposedly more robust for this type of calculation than the deviance residuals. The blmeco::dispersion_glmer sums up the deviance residuals together with u cubed, divides by residual degrees of freedom and takes a square root of the value (the function):
dispersion_glmer <- function (modelglmer)
{
n <- length(resid(modelglmer))
return(sqrt(sum(c(resid(modelglmer), modelglmer#u)^2)/n))
}
The blmeco solution gives considerably higher deviance/df ratios than Bolker's function. Since Ben is one of the authors of the lme4 package, I would trust his solution more although I am not qualified to rationalize the statistical reason.
x <- InsectSprays
x$id <- rownames(x)
mod <- lme4::glmer(count ~ spray + (1|id), data = x, family = poisson)
blmeco::dispersion_glmer(mod)
# [1] 1.012649
overdisp_fun(mod)
# chisq ratio rdf p
# 55.7160734 0.8571704 65.0000000 0.7873823