I am trying to fit a broken stick model to longitudinal data. I am unable to reproduce data, here is the first six observations:
pid arm pday pardens logpd
1 MOB2004_002 SP 0 40973 10.621
2 MOB2004_002 SP 1 91404 11.423
3 MOB2004_002 SP 2 14342 9.571
4 MOB2004_002 SP 3 0 0.000
5 MOB2004_002 SP 7 0 0.000
6 MOB2004_003 SP/ART 0 11428 9.344
and a plot showing the means of 'logpd' at each day (note: it is clearly non-linear, but I am specifically asked to use broken stick assuming the lines are linear):
The breakpoint is set at day 2. I have found the following syntax from:
(1) bp1
(2)bp2
The first consist of creating new parameters, as with segmented linear regression:
bp = 2
b1 <- function(x, bp) ifelse(x < bp, bp - x, 0)
b2 <- function(x, bp) ifelse(x < bp, 0, x - bp)
#Mixed effects model with break point = 2
(mod <- lmer(logpd ~ arm * b1(pday, bp) * b2(pday, bp) + (b1(pday, bp) + b2(pday, bp) | pid), data = q1))
summary(mod)
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.0071 0.1782 291.5626 11.26 < 2e-16 ***
armSP/ART -2.0949 0.2520 289.3692 -8.31 3.7e-15 ***
b1(pday, bp) 3.3350 0.1103 305.7435 30.24 < 2e-16 ***
b2(pday, bp) -0.3713 0.0488 438.5265 -7.61 1.7e-13 ***
armSP/ART:b1(pday, bp) 0.3178 0.1562 303.0042 2.03 0.043 *
armSP/ART:b2(pday, bp) 0.3972 0.0691 437.3673 5.75 1.7e-08 ***
The second using indicator variables:
mod.1 = lmer(logpd ~ arm * I(pday * (pday <= 2)) * I(pday* (pday > 2)) + (1 | pid), data = q1)
summary(mod.1)
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 7.4796 0.2001 1051.4035 37.37 < 2e-16 ***
armSP/ART -1.4758 0.2845 1048.5079 -5.19 2.5e-07 ***
I(pday * (pday <= 2)) -2.4792 0.1469 1142.4166 -16.88 < 2e-16 ***
I(pday * (pday > 2)) -1.1930 0.0429 1151.6921 -27.78 < 2e-16 ***
armSP/ART:I(pday * (pday <= 2)) -0.5121 0.2076 1138.0203 -2.47 0.0138 *
armSP/ART:I(pday * (pday > 2)) 0.1609 0.0610 1148.1222 2.64 0.0084 **
The results are extremely different, the second model seems correct based on the plot. But I don't understand why they would be so different?
Related
My model has two IVs: t and story_type. The levels of t are t1, t2, t3 and for story_type are (for simplicity, we'll call them) A, B, C, D. Previously everything worked as usual--I would see story_typeA, story_typeB, story_typeC in my model summary. For some reason, the summary (as of my last refresh) now shows the following, with numeric markers (story_type1, story_type2, story_type3) rather than the actual level labels:
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.2047159 0.0175622 239.419 < 2e-16 ***
t1 0.0001681 0.0464882 0.004 0.997
t2 -0.2313327 0.0392468 -5.894 3.76e-09 ***
story_type1 -0.0934701 0.0034883 -26.795 < 2e-16 ***
story_type2 -0.1252931 0.0035278 -35.516 < 2e-16 ***
story_type3 0.2304953 0.0031908 72.238 < 2e-16 ***
I tried converting story_type to a factor (it was originally a character vec), this didn't help. I've also carefully run through all of the preceding code several times now to check whether something had been changed accidentally, also to no avail.
Does anyone have any idea why this would be happening, and how I'd be able to see my level names again?
(e.g. I'd have a summary that looks as follows: )
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.2047159 0.0175622 239.419 < 2e-16 ***
t1 0.0001681 0.0464882 0.004 0.997
t2 -0.2313327 0.0392468 -5.894 3.76e-09 ***
story_typeA -0.0934701 0.0034883 -26.795 < 2e-16 ***
story_typeB -0.1252931 0.0035278 -35.516 < 2e-16 ***
story_typeC 0.2304953 0.0031908 72.238 < 2e-16 ***
EDIT:
So I spun up a toy dataset and am getting the same problem:
set.seed(16)
scores <- rnorm(n = 20, mean = 0, sd = 1)
type <- rep(LETTERS[1:4], each = 5)
df <- data.frame(scores, type)
model <- lm(scores ~ type, data = df)
summary(model)
----------------------------------------------
> Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1105 0.2335 0.47 0.64
type1 -0.0731 0.4045 -0.18 0.86
type2 0.6081 0.4045 1.50 0.15
type3 -0.1164 0.4045 -0.29 0.78
> str(df)
'data.frame': 20 obs. of 2 variables:
$ scores: num -0.4684 -1.006 0.0636 1.025 0.5731 ...
$ type : chr "A" "A" "A" "A" ...
As a factor it returns the same summary, without the level names appearing in the output.
I estimate a heckit-model using the heckit-model from sampleSelection.
The model looks as follows:
library(sampleSelection) Heckman = heckit(AgencyTRACE ~ SizeCat + log(Amt_Issued) + log(daysfromissuance) + log(daystomaturity) + EoW + dMon + EoM + VIX_95_Dummy + quarter, Avg_Spread_Choi ~ SizeCat + log(Amt_Issued) + log(daysfromissuance) + log(daystomaturity) + VIX_95_Dummy + TresholdHYIG_II, data=heckmandata, method = "2step")
The summary generates a probit selection equation and an outcome equation - see below:
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
2019085 observations (1915401 censored and 103684 observed)
26 free parameters (df = 2019060)
Probit selection equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.038164 0.043275 0.882 0.378
SizeCat2 0.201571 0.003378 59.672 < 2e-16 ***
SizeCat3 0.318331 0.008436 37.733 < 2e-16 ***
log(Amt_Issued) -0.099472 0.001825 -54.496 < 2e-16 ***
log(daysfromissuance) 0.079691 0.001606 49.613 < 2e-16 ***
log(daystomaturity) -0.036434 0.001514 -24.066 < 2e-16 ***
EoW 0.021169 0.003945 5.366 8.04e-08 ***
dMon -0.003409 0.003852 -0.885 0.376
EoM 0.008937 0.007000 1.277 0.202
VIX_95_Dummy1 0.088558 0.006521 13.580 < 2e-16 ***
quarter2019.2 -0.092681 0.005202 -17.817 < 2e-16 ***
quarter2019.3 -0.117021 0.005182 -22.581 < 2e-16 ***
quarter2019.4 -0.059833 0.005253 -11.389 < 2e-16 ***
quarter2020.1 -0.005230 0.004943 -1.058 0.290
quarter2020.2 0.073175 0.005080 14.406 < 2e-16 ***
Outcome equation:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.29436 6.26019 7.395 1.41e-13 ***
SizeCat2 -25.63433 0.79836 -32.109 < 2e-16 ***
SizeCat3 -34.25275 1.48030 -23.139 < 2e-16 ***
log(Amt_Issued) -0.38051 0.39506 -0.963 0.33547
log(daysfromissuance) 0.02452 0.34197 0.072 0.94283
log(daystomaturity) 7.92338 0.24498 32.343 < 2e-16 ***
VIX_95_Dummy1 -2.34875 0.89133 -2.635 0.00841 **
TresholdHYIG_II1 10.36993 1.07267 9.667 < 2e-16 ***
Multiple R-Squared:0.0406, Adjusted R-Squared:0.0405
Error terms:
Estimate Std. Error t value Pr(>|t|)
invMillsRatio -23.8204 3.6910 -6.454 1.09e-10 ***
sigma 68.5011 NA NA NA
rho -0.3477 NA NA NA
Now I'd like to estimate a value using the outcome equation. I'd like to predict Spread_Choi_All using the following data:
newdata = data.frame(SizeCat=as.factor(1),
Amt_Issued=50*1000000,
daysfromissuance=5*365,
daystomaturity=5*365,
VIX_95_Dummy=as.factor(0),
TresholdHYIG_II=as.factor(0)
SizeCat is a categorical/factor variable with the value 1, 2 or 3.
I have tried varies ways, i.e.
predict(Heckman, part ="outcome", newdata = newdata)
I aim to predict a value (with the data from newdata) using the outcome equation (incl. the invMillsRatio). Is there a way how to predict a value from the outcome equation?
I am doing some count data analysis. The data is in this link:
[1]: https://www.dropbox.com/s/q7fwqicw3ebvwlg/stackquestion.csv?dl=0
Column A is the count data, and other columns are the independent variables. At first I used Poisson regression to analyze it:
m0<-glm(A~.,data=d,family="poisson")
summary(m0)
#We see that the residual deviance is greater than the degrees of freedom so that we have over-dispersion.
Call:
glm(formula = A ~ ., family = "poisson", data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-28.8979 -4.5110 0.0384 5.4327 20.3809
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.7054842 0.9100882 9.566 < 2e-16 ***
B -0.1173783 0.0172330 -6.811 9.68e-12 ***
C 0.0864118 0.0182549 4.734 2.21e-06 ***
D 0.1169891 0.0301960 3.874 0.000107 ***
E 0.0738377 0.0098131 7.524 5.30e-14 ***
F 0.3814588 0.0093793 40.670 < 2e-16 ***
G -0.3712263 0.0274347 -13.531 < 2e-16 ***
H -0.0694672 0.0022137 -31.380 < 2e-16 ***
I -0.0634488 0.0034316 -18.490 < 2e-16 ***
J -0.0098852 0.0064538 -1.532 0.125602
K -0.1105270 0.0128016 -8.634 < 2e-16 ***
L -0.3304606 0.0155454 -21.258 < 2e-16 ***
M 0.2274175 0.0259872 8.751 < 2e-16 ***
N 0.2922063 0.0174406 16.754 < 2e-16 ***
O 0.1179708 0.0119332 9.886 < 2e-16 ***
P 0.0618776 0.0260646 2.374 0.017596 *
Q -0.0303909 0.0060060 -5.060 4.19e-07 ***
R -0.0018939 0.0037642 -0.503 0.614864
S 0.0383040 0.0065841 5.818 5.97e-09 ***
T 0.0318111 0.0116611 2.728 0.006373 **
U 0.2421129 0.0145502 16.640 < 2e-16 ***
V 0.1782144 0.0090858 19.615 < 2e-16 ***
W -0.5105135 0.0258136 -19.777 < 2e-16 ***
X -0.0583590 0.0043641 -13.373 < 2e-16 ***
Y -0.1554609 0.0042604 -36.489 < 2e-16 ***
Z 0.0064478 0.0001184 54.459 < 2e-16 ***
AA 0.3880479 0.0164929 23.528 < 2e-16 ***
AB 0.1511362 0.0050471 29.945 < 2e-16 ***
AC 0.0557880 0.0181129 3.080 0.002070 **
AD -0.6569099 0.0368771 -17.813 < 2e-16 ***
AE -0.0040679 0.0003960 -10.273 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 97109.0 on 56 degrees of freedom
Residual deviance: 5649.7 on 26 degrees of freedom
AIC: 6117.1
Number of Fisher Scoring iterations: 6
Then I think I should use negative binomial regression for the over-dispersion data. Since you can see I have many independent variables, and I wanted to select the important variables. And I decide to use stepwise regression to select the independent variable. At first, I create a full model:
full.model <- glm.nb(A~., data=d,maxit=1000)
# when not indicating maxit, or maxit=100, it shows Warning messages: 1: glm.fit: algorithm did not converge; 2: In glm.nb(A ~ ., data = d, maxit = 100) : alternation limit reached
# When indicating maxit=1000, the warning message disappear.
summary(full.model)
Call:
glm.nb(formula = A ~ ., data = d, maxit = 1000, init.theta = 2.730327193,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5816 -0.8893 -0.3177 0.4882 1.9073
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.8228596 8.3004322 1.424 0.15434
B -0.2592324 0.1732782 -1.496 0.13464
C 0.2890696 0.1928685 1.499 0.13393
D 0.3136262 0.3331182 0.941 0.34646
E 0.3764257 0.1313142 2.867 0.00415 **
F 0.3257785 0.1448082 2.250 0.02447 *
G -0.7585881 0.2343529 -3.237 0.00121 **
H -0.0714660 0.0343683 -2.079 0.03758 *
I -0.1050681 0.0357237 -2.941 0.00327 **
J 0.0810292 0.0566905 1.429 0.15291
K 0.2582978 0.1574582 1.640 0.10092
L -0.2009784 0.1543773 -1.302 0.19296
M -0.2359658 0.3216941 -0.734 0.46325
N -0.0689036 0.1910518 -0.361 0.71836
O 0.0514983 0.1383610 0.372 0.70974
P 0.1843138 0.3253483 0.567 0.57105
Q 0.0198326 0.0509651 0.389 0.69717
R 0.0892239 0.0459729 1.941 0.05228 .
S -0.0430981 0.0856391 -0.503 0.61479
T 0.2205653 0.1408009 1.567 0.11723
U 0.2450243 0.1838056 1.333 0.18251
V 0.1253683 0.0888411 1.411 0.15820
W -0.4636739 0.2348172 -1.975 0.04831 *
X -0.0623290 0.0508299 -1.226 0.22011
Y -0.0939878 0.0606831 -1.549 0.12142
Z 0.0019530 0.0015143 1.290 0.19716
AA -0.2888123 0.2449085 -1.179 0.23829
AB 0.1185890 0.0696343 1.703 0.08856 .
AC -0.3401963 0.2047698 -1.661 0.09664 .
AD -1.3409002 0.4858741 -2.760 0.00578 **
AE -0.0006299 0.0051338 -0.123 0.90234
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(2.7303) family taken to be 1)
Null deviance: 516.494 on 56 degrees of freedom
Residual deviance: 61.426 on 26 degrees of freedom
AIC: 790.8
Number of Fisher Scoring iterations: 1
Theta: 2.730
Std. Err.: 0.537
2 x log-likelihood: -726.803
When not indicating maxit, or maxit=100, it shows Warning messages: 1: glm.fit: algorithm did not converge; 2: In glm.nb(A ~ ., data = d, maxit = 100) : alternation limit reached.
When indicating maxit=1000, the warning message disappear.
Then I create a first model:
first.model <- glm.nb(A ~ 1, data = d)
Then I tried the forward stepwise regression:
step.model <- step(first.model, direction="forward", scope=formula(full.model))
#Error in glm.fit(X, y, wt, offset = offset, family = object$family, control = object$control) :
#NA/NaN/Inf in 'x'
#In addition: Warning message:
# step size truncated due to divergence
#What is the problem?
It gives me error message: Error in glm.fit(X, y, wt, offset = offset, family = object$family, control = object$control) :
NA/NaN/Inf in 'x'
In addition: Warning message:
step size truncated due to divergence
I also tried the backward regression:
step.model2 <- step(full.model,direction="backward")
#the final step
Step: AIC=770.45
A ~ B + C + E + F + G + H + I + K + L + R + T + V + W + Y + AA +
AB + AD
Df Deviance AIC
<none> 62.375 770.45
- AB 1 64.859 770.93
- H 1 65.227 771.30
- V 1 65.240 771.31
- L 1 65.291 771.36
- Y 1 65.831 771.90
- B 1 66.051 772.12
- C 1 67.941 774.01
- AA 1 69.877 775.95
- K 1 70.411 776.48
- W 1 71.526 777.60
- I 1 71.863 777.94
- E 1 72.338 778.41
- G 1 73.344 779.42
- F 1 73.510 779.58
- AD 1 79.620 785.69
- R 1 80.358 786.43
- T 1 95.725 801.80
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: algorithm did not converge
3: glm.fit: algorithm did not converge
4: glm.fit: algorithm did not converge
My question is: Why it is different in using forward and backward stepwise regression? And why do I get the error message when performing forward selection? Also, what exactly do these warning messages mean? And how should I deal with it?
I am not a stats person but need to conduct statical analysis for my research data. So I am struggling in learning how to do different regression analyses using real data. I searched online for similar questions but I still could understand ... And please let me know if I did anything wrong in my regression analysis. I would really appreciate it if you could help me with these questions!
I have a question about Poisson GLM and formula representation:
Considering a data set:
p <- read.csv("https://raw.githubusercontent.com/Leprechault/PEN-533/master/bradysia-greenhouse.csv")
Without considering the interaction:
m1 <- glm(bradysia ~ area + mes, family="quasipoisson", data=p)
summary(m1)
#(Intercept) 4.36395 0.12925 33.765 < 2e-16 ***
#areaCV -0.19696 0.12425 -1.585 0.113
#areaMJC -0.71543 0.08553 -8.364 3.11e-16 ***
#mes -0.08872 0.01970 -4.503 7.82e-06 ***
The final formula is: bradysia = exp(4.36395*CS-0.19696*CV-0.71543-0.08872*mes)
Considering the interaction:
m2 <- glm(bradysia ~ area*mes, family="quasipoisson", data=p)
summary(m2)
#(Intercept) 4.05682 0.15468 26.227 < 2e-16 ***
#areaCV 0.15671 0.35219 0.445 0.6565
#areaMJC 0.54132 0.31215 1.734 0.0833 .
#mes -0.03943 0.02346 -1.680 0.0933 .
#areaCV:mes -0.05724 0.05579 -1.026 0.3052
#areaMJC:mes -0.22609 0.05576 -4.055 5.57e-05 **
The final formula is: bradysia = exp(?????) and any help, please?
When I run logistic regression and use predict() function and when I manually calculate with formula p=1/(1+e^-(b0+b1*x1...)) I cannot get the same answer. What could be the reason?
>test[1,]
loan_status loan_Amount interest_rate period sector sex age grade
10000 0 608 41.72451 12 Online Shop Female 44 D3
sector and period was insignificant so I removed it from the regression.
glm gives:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1542256 0.7610472 -1.517 0.12936
interest_rate -0.0479765 0.0043415 -11.051 < 2e-16 ***
sexMale -0.8814945 0.0656296 -13.431 < 2e-16 ***
age -0.0139100 0.0035193 -3.953 7.73e-05 ***
gradeB 0.3209587 0.8238955 0.390 0.69686
gradeC1 -0.7113279 0.8728260 -0.815 0.41509
gradeC2 -0.4730014 0.8427544 -0.561 0.57462
gradeC3 0.0007541 0.7887911 0.001 0.99924
gradeD1 0.5637668 0.7597531 0.742 0.45806
gradeD2 1.3207785 0.7355950 1.796 0.07257 .
gradeD3 0.9201400 0.7303779 1.260 0.20774
gradeE1 1.7245351 0.7208260 2.392 0.01674 *
gradeE2 2.1547773 0.7242669 2.975 0.00293 **
gradeE3 3.1163245 0.7142881 4.363 1.28e-05 ***
>predictions_1st <- predict(Final_Model, newdata = test[1,], type = "response")
>predictions_1st
answer: **0.05478904**
But when I calculate like this:
>prob_1 <- 1/(1+e^-((-0.0479764603)*41.72451)-0.0139099563*44)
>prob_1
answer: 0.09081154
I calculated also with insignificant coefficients but answer still is not the same. What could be the reason?
You have also an (Intercept) -1.1542256 and a gradeD3 0.9201400
1/(1+exp(-1*(-1.1542256 -0.0479764603*41.72451 -0.0139099563*44 + 0.9201400)))
#[1] 0.05478904