R survival package - survreg reports singularities when using big intervals - r

I came across an issue when trying to do an interval regression using the survival package. The goal was merely to estimate the constant using interval censored data.
> library(survival)
> y = with(data, Surv(lb, ub, type = "interval2"))
> model_y = survreg(y ~ 1, dist = "gaussian")
> model_y
Call:
survreg(formula = y ~ 1, dist = "gaussian")
Coefficients: (1 not defined because of singularities)
(Intercept)
NA
Scale= 256178.4
Loglik(model)= -6186.5 Loglik(intercept only)= -6186.5
n=2556 (132 observations deleted due to missingness)
For the intercept NA is reported. Additionally, the remark "(1 not defined because of singularities)" is displayed.
However, once i scale down the bounds of the intervals by a constant factor, proper results are obtained (works for integers > 3):
> library(survival)
> y = with(data, Surv(lb/4, ub/4, type = "interval2"))
> model_y = survreg(y ~ 1, dist = "gaussian")
> model_y
Call:
survreg(formula = y ~ 1, dist = "gaussian")
Coefficients:
(Intercept)
567.184
Scale= 64000.89
Loglik(model)= -6186 Loglik(intercept only)= -6186
n=2556 (132 observations deleted due to missingness)
Does anyone have any ideas on why that is?
Data
Kind regards

Related

Warning: glm.fit: algorithm did not converge [duplicate]

This question already has answers here:
Why am I getting "algorithm did not converge" and "fitted prob numerically 0 or 1" warnings with glm?
(3 answers)
Closed 2 years ago.
banking_data
library(MASS)
library(randomForest)
Bdata <- read.csv("banking_data.csv")
head(Bdata)
Bdata$y <- ifelse(Bdata$y == "y", 1, 0)
intercept_model<- glm(y~1,family = binomial("logit"),data=Bdata)
summary(intercept_model)
(exp(intercept_model$coefficients[1]))/(1+exp(intercept_model$coefficients[1]))
I tried to run this code but it shows Warning:
glm.fit: algorithm did not converge
It shows :
Call:
glm(formula = y ~ 1, family = binomial("logit"), data = Bdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.409e-06 -2.409e-06 -2.409e-06 -2.409e-06 -2.409e-06
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -26.57 1754.75 -0.015 0.988
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.0000e+00 on 41187 degrees of freedom
Residual deviance: 2.3896e-07 on 41187 degrees of freedom
AIC: 2
Number of Fisher Scoring iterations: 25
Without more information on your data it will be hard to help you.
For general advise, see this existing post.
It could be because of the absence of two classes on your y data. Indeed, for instance :
y <- c(rep(1, 1000))
df <- data.frame(y = y)
reg <- glm(y ~ 1, data = df, family = binomial("logit"))
reg
Will return the same error. Did you check the balance of y with the function table(df$y) ?
Another solution is to increase the maximum of iteration :
reg <- glm(y ~ 1, data = df, family = binomial("logit"), maxit = 100)
reg
However this solution is not useful if you have only no inside your dataset.

Calculate α and β in Probit Model in R

I am facing following issue: I want to calculate the α and β from the following probit model in R, which is defined as:
Probability = F(α + β sprd )
where sprd denotes the explanatory variable, α and β are constants, F is the cumulative normal distribution function.
I can calculate probabilities for the entire dataset, the coeffcients (see code below) etc. but I do not know how to get the constant α and β.
The purpose is to determine the Spread in Excel that corresponds to a certain probability. E.g: Which Spread corresponds to 50% etc.
Thank you in advance!
Probit model coefficients
probit<- glm(Y ~ X, family=binomial (link="probit"))
summary(probit)
Call:
glm(formula = Y ~ X, family = binomial(link = "probit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4614 -0.6470 -0.3915 -0.2168 2.5730
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3566755 0.0883634 -4.036 5.43e-05 ***
X -0.0058377 0.0007064 -8.264 < 2e-16 ***
From the help("glm") page you can see that the object returns a value named coefficients.
An object of class "glm" is a list containing at least the following
components:
coefficients a named vector of coefficients
So after you call glm() that object will be a list, and you can access each element using $name_element.
Reproducible example (not a Probit model, but it's the same):
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
d.AD <- data.frame(treatment, outcome, counts)
# fit model
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
Now glm.D93$coefficients will print the vector with all the coefficients:
glm.D93$coefficients
# (Intercept) outcome2 outcome3 treatment2 treatment3
#3.044522e+00 -4.542553e-01 -2.929871e-01 1.337909e-15 1.421085e-15
You can assign that and access each individually:
coef <- glm.D93$coefficients
coef[1] # your alpha
#(Intercept)
# 3.044522
coef[2] # your beta
# outcome2
#-0.4542553
I've seen in your deleted post that you are not convinced by #RLave's answer. Here are some simulations to convince you:
# (large) sample size
n <- 10000
# covariate
x <- (1:n)/n
# parameters
alpha <- -1
beta <- 1
# simulated data
set.seed(666)
y <- rbinom(n, 1, prob = pnorm(alpha + beta*x))
# fit the probit model
probit <- glm(y ~ x, family = binomial(link="probit"))
# get estimated parameters - very close to the true parameters -1 and 1
coef(probit)
# (Intercept) x
# -1.004236 1.029523
The estimated parameters are given by coef(probit), or probit$coefficients.

Logistic Unit Fixed Effect Model in R

I'm trying to estimate a logistic unit fixed effects model for panel data using R. My dependent variable is binary and measured daily over two years for 13 locations.
The goal of this model is to predict the value of y for a particular day and location based on x.
zero <- seq(from=0, to=1, by=1)
ids = dplyr::data_frame(location=seq(from=1, to=13, by=1))
dates = dplyr::data_frame(date = seq(as.Date("2015-01-01"), as.Date("2016-12-31"), by="days"))
data = merge(dates, ids)
data$y <- sample(zero, size=9503, replace=TRUE)
data$x <- sample(zero, size=9503, replace=TRUE)
While surveying the available packages to do so, I've read a number of ways to (apparently) do this, but I'm not confident I've understood the differences between packages and approaches.
From what I have read so far, glm(), survival::clogit() and pglm::pglm() can be used to do this, but I'm wondering if there are substantial differences between the packages and what those might be.
Here are the calls I've used:
fixed <- glm(y ~ x + factor(location), data=data)
fixed <- clogit(y ~ x + strata(location), data=data)
One of the reasons for this insecurity is the error I get when using pglm (also see this question) that pglm can't use the "within" model:
fixed <- pglm(y ~ x, data=data, index=c("location", "date"), model="within", family=binomial("logit")).
What distinguishes the "within" model of pglm from the approaches in glm() and clogit() and which of the three would be the correct one to take here when trying to predict y for a given date and unit?
I don't see that you have defined a proper hypothesis to test within the context of what you are calling "panel data", but as far as getting glm to give estimates for logistic coefficients within strata it can be accomplished by adding family="binomial" and stratifying by your "unit" variable:
> fixed <- glm(y ~ x + strata(unit), data=data, family="binomial")
> fixed
Call: glm(formula = y ~ x + strata(unit), family = "binomial", data = data)
Coefficients:
(Intercept) x strata(unit)unit=2 strata(unit)unit=3
0.10287 -0.05910 -0.08302 -0.03020
strata(unit)unit=4 strata(unit)unit=5 strata(unit)unit=6 strata(unit)unit=7
-0.06876 -0.05042 -0.10200 -0.09871
strata(unit)unit=8 strata(unit)unit=9 strata(unit)unit=10 strata(unit)unit=11
-0.09702 0.02742 -0.13246 -0.04816
strata(unit)unit=12 strata(unit)unit=13
-0.11449 -0.16986
Degrees of Freedom: 9502 Total (i.e. Null); 9489 Residual
Null Deviance: 13170
Residual Deviance: 13170 AIC: 13190
That will not take into account any date-ordering, which is what I would have expected to be the interest. But as I said above, there doesn't yet appear to be a hypothesis that is premised on any sequential ordering.
This would create a fixed effects model that included a spline relationship of date to probability of y-event. I chose to center the date rather than leaving it as a very large integer:
library(splines)
fixed <- glm(y ~ x + ns(scale(date),3) + factor(unit), data=data, family="binomial")
fixed
#----------------------
Call: glm(formula = y ~ x + ns(scale(date), 3) + factor(unit), family = "binomial",
data = data)
Coefficients:
(Intercept) x ns(scale(date), 3)1 ns(scale(date), 3)2
0.13389 -0.05904 0.04431 -0.10727
ns(scale(date), 3)3 factor(unit)2 factor(unit)3 factor(unit)4
-0.03224 -0.08302 -0.03020 -0.06877
factor(unit)5 factor(unit)6 factor(unit)7 factor(unit)8
-0.05042 -0.10201 -0.09872 -0.09702
factor(unit)9 factor(unit)10 factor(unit)11 factor(unit)12
0.02742 -0.13246 -0.04816 -0.11450
factor(unit)13
-0.16987
Degrees of Freedom: 9502 Total (i.e. Null); 9486 Residual
Null Deviance: 13170
Residual Deviance: 13160 AIC: 13200

test whether coefficients in quantile regression model differ from each other significantly

I have a quantile regression model, where I am interested in estimating effects for the .25, .5, and .875 quantile. The coefficients in my model differ from each other in a way that is in line with the substantive substantive theory underlying my model.
The next step is to test whether the coefficient of a particular explanatory variable for one quantile differs significantly from the estimated coefficient for another quantile. How do I test that?
Further, I also want to test whether the coefficient for that variable for a given quantile differs significantly from the estimnate in a OLS model. How do I do that?
I am interested in any answer, although I would prefer an answer that involves R.
Here's some test code: (NOTE: this is not my actual model or data, but is an easy example as the data is available in the R installation)
data(airquality)
library(quantreg)
summary(rq(Ozone ~ Solar.R + Wind + Temp, tau = c(.25, .5, .75), data = airquality, method = "br"), se = "nid")
tau: [1] 0.25
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -69.92874 12.18362 -5.73957 0.00000
Solar.R 0.06220 0.00917 6.77995 0.00000
Wind -2.63528 0.59364 -4.43918 0.00002
Temp 1.43521 0.14363 9.99260 0.00000
Call: rq(formula = Ozone ~ Solar.R + Wind + Temp, tau = c(0.25, 0.5,
0.75), data = airquality, method = "br")
tau: [1] 0.5
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -75.60305 23.27658 -3.24803 0.00155
Solar.R 0.03354 0.02301 1.45806 0.14775
Wind -3.08913 0.68670 -4.49853 0.00002
Temp 1.78244 0.26067 6.83793 0.00000
Call: rq(formula = Ozone ~ Solar.R + Wind + Temp, tau = c(0.25, 0.5,
0.75), data = airquality, method = "br")
tau: [1] 0.75
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -91.56585 41.86552 -2.18714 0.03091
Solar.R 0.03945 0.04217 0.93556 0.35161
Wind -2.95452 1.17821 -2.50764 0.01366
Temp 2.11604 0.45693 4.63103 0.00001
and the OLS model:
summary(lm(Ozone ~ Solar.R + Wind + Temp, data = airquality))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -64.34208 23.05472 -2.791 0.00623 **
Solar.R 0.05982 0.02319 2.580 0.01124 *
Wind -3.33359 0.65441 -5.094 1.52e-06 ***
Temp 1.65209 0.25353 6.516 2.42e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.18 on 107 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
(don't worry about the actual model estimated above, this is just for illustrative purposes)
How now to test whether, for example, the coefficient for Temp differs statistically significantly (at some give level) between quantiles .25 and .75 and whether the coefficient at the .25 quantile for Temp differs significantly from the OLS coefficient for Temp?
Answers welcome in R or those that focus on the statistical approach.
For quantile regression, I generally prefer visual inspection.
data(airquality)
library(quantreg)
q <- rq(Ozone ~ Solar.R + Wind + Temp, tau = 1:9/10, data = airquality)
plot(summary(q, se = "nid"), level = 0.95)
The dotted red lines are the 95% confidence interval for linear regression and the shaded grey area is the 95% confidence interval for each of the quantreg estimates.
In the plot below, we can visualize that the quantreg estimates are within the bounds of the linear regression estimates which suggests that there may not be a statistically significant difference.
You can further test whether the differences in the coefficients are statistically significant using anova. See ?anova.rq for more details.
q50 <- rq(Ozone ~ Solar.R + Wind + Temp, tau = 0.5, data = airquality)
q90 <- rq(Ozone ~ Solar.R + Wind + Temp, tau = 0.9, data = airquality)
anova(q50, q90)

lmer linear contrasts : Kenward Rogers or Satterthwaite DF and SE

In R, I am searching for a way to estimate confidence intervals for linear contrasts for lmer models that use either kenward-rogers or satterthwaite degrees of freedom and SE.
For example, I can compute a CI for a fixed effect parameter in a mixed model like SAS with R, using the t-value (with df from KR) and SE.
mod<-lmerTest::lmer(y~time1+treatment+time1:treatment+(1|PersonID),data=data)
lmerTest::summary(mod,ddf = "Kenward-Roger")
This output:
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 49.0768 1.0435 56.4700 47.029 < 2e-16 ***
time1 5.8224 0.5963 48.0000 9.764 5.51e-13 ***
treatment 1.6819 1.4758 56.4700 1.140 0.2592
time1:treatment 2.0425 0.8433 48.0000 2.422 0.0193 *
Allows a CI for time1 like:
5.8224+abs(qt(0.05/2, 48))*0.5963 #7.021342
5.8224-abs(qt(0.05/2, 48))*0.5963 #4.623458
I would like to do this same thing for a linear contrast of the fixed coefficients. This is the p-value but there is no SE output.
pbkrtest::KRmodcomp(mod,matrix(c(0,0,1,0),nrow = 1))
stat ndf ddf F.scaling p.value
Ftest 1.2989 1.0000 56.4670 1 0.2592
Is there anyway to get a SE or a CI from lmer linear contrasts that uses this type of df?
For this, you have at least two options: using the lsmeans package, or doing it manually (using functions vcovAdj.lmerMod and pbkrtest::get_Lb_ddf). Personally, I go with the later if the contrast to be tested is not very "simple", because I find the syntax in lsmeans a bit complicated.
To exemplify, take the following model:
library(pbkrtest)
library(lme4)
library(nlme) # for the 'Orthodont' data
# 'age' is a numeric variable, while 'Sex' and 'Subject' are factors
model <- lmer(distance ~ age : Sex + (1 | Subject), data = Orthodont)
Linear mixed model fit by REML ['lmerMod']
Formula: distance ~ age:Sex + (1 | Subject)
…
Fixed Effects:
(Intercept) age:SexMale age:SexFemale
16.7611 0.7555 0.5215
from which we would like to obtain stats on the difference between the coefficients for age in males and females (i.e., age:SexMale - age:SexFemale).
Using lsmeans:
library(lsmeans)
# Evaluate the contrast at a value of 'age' set to 1,
# so that the resulting value is equal to the regression coefficient
lsm = lsmeans(model, pairwise ~ age : Sex, at = list(age = 1))$contrast
produces:
contrast estimate SE df t.ratio p.value
1,Male - 1,Female 0.2340135 0.06113276 42.64 3.828 0.0004
Alternatively, doing the calculation manually:
# Specify the contrasts: age:SexMale - age:SexFemale
# Must have the same order as the fixed effects in the model
K = c("(Intercept)" = 0, "age:SexMale" = 1, "age:SexFemale" = -1)
# Retrieve the adjusted variance-covariance matrix, to calculate the SE
V = pbkrtest::vcovAdj.lmerMod(model, 0)
# Point estimate, SE and df
point_est = sum(K * fixef(model))
SE = sqrt(sum(K * (V %*% K)))
df = pbkrtest::get_Lb_ddf(model, K)
alpha = 0.05 # significance level
# Calculate confidence interval for the difference between the 'age' coefficients for males and females
Delta_age_CI = point_est + SE * qt(c(0.5 * alpha, 1 - 0.5 * alpha), df)
will result in a point estimate equal to 0.2340135, SE 0.06113276, df 42.63844, and confidence interval [0.1106973, 0.3573297]

Resources