I have a data frame which contains some characteristics from clients and contracts and 0s and 1s showing whether a fall happened the period between 2008 and 2017. I'm using a binomial model to regress probability of fall on the characteristics. I have 38000 differents contracts.
So I'm using an binomial model like this (R-code):
formule <- y ~ Niveau_gar_incapacite + Niv_indem_mens + Regrpt_franchise + Niveau_prime + Situation_familiale + Classe_age_chute + Grde_Region + Regrpt_strate + Taille_courtier + Commission + Retention + Anciennete + Regrpt_CSP + Regrpt_sinistres + Couplage
logit <- glm(Chute_commerciale~1, data=train, family=binomial(link="logit"))
selection_asc_AIC <- step(logit, direction="forward", trace=TRUE, k=2, scope=list(upper=formule))
After some tests to find multi-collinearity, I did eliminations of variables or groupings of terms.
I have this result :
results from GLM
results from GLM 2
This results are not correct with null deviance and residual deviance.
I supposed my variable exposure that is the problem.
In fact, I have contracts beginning and finishing at differents years.
So my exposure can be 5.32 or 1.36 and I have truncation and censorship.
How can I treat this variable exposure in regression logistic binomial ?
If I duplicate my row by the number of year of exposure, there is a problem of independance of observations.
Related
In the Orthodont dataset in nlme, there are 27 subjects and each subject is measured at 4 different ages. I wish to use this data to explore at what condition the model will be overdetermined. Here are the models:
library(nlme)
library(lme4)
m1 <- lmer( distance ~ age + (age|Subject), data = Orthodont )
m2 <- lmer( distance ~ age + I(age^2) + (age|Subject), data = Orthodont )
m3 <- lmer( distance ~ age + I(age^2) + I(age^3) + (age|Subject), data = Orthodont )
m1nlme <- lme(distance ~ age, random = ~ age|Subject, data = Orthodont)
m2nlme <- lme(distance ~ age + I(age^2), random = ~ age|Subject, data = Orthodont)
m3nlme <- lme(distance ~ age + I(age^2) + I(age^3), random = ~ age|Subject, data = Orthodont)
m4nlme <- lme(distance ~ age + I(age^2) + I(age^3), random = ~ age + I(age^2) + I(age^3)|Subject, data = Orthodont)
Of all of the above models, only m3 throws a warning message:In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,:Model failed to converge with max|grad| = 0.00762984 (tol = 0.002, component 1).
Questions:
What does the warning message suggest and if it is sensible to ignore this message?
For m2, the model estimates fixed effect of intercept and fixed coefficient for age and I(age^2), together with the random effect parameter sigma^2_intercept, sigma^2_age, and sigma^2_intercept:age. So a total of 1+2+3=6 parameters are estimated for each Subject. But there are only 4 observations per subject. Why does not m2 throws an error? Isn't m2 overdetermined? Am I counting the number of paratermeters anywhere incorrectly?
The warning message means that the model fit may be a bit numerically unstable; it is done by numerically checking the scaled gradient, but as this depends in turn on the gradient and Hessian estimated by finite differences, which are themselves subject to numerical error. As I've stated in many different venues, these warnings definitely tend to be over-sensitive/likely to be false positives: see e.g. ?lme4::convergence, ?lme4::troubleshooting. The gold standard is to use allFit() to refit the model with a variety of optimizers and make sure that the results from different optimizers are close enough to the same for your purposes.
There are two random effects values (BLUPs or conditional modes) per subject - the subject-level deviation of the intercept and slope wrt age. For values, we will be in trouble if the number of values is greater than or equal to the number of observations per group (or, for GLMMs without a scale parameter such as the Poisson, if the number of values is strictly greater than the number of observations per group). For parameters, there are up to four fixed-effect parameters (intercept, linear, quadratic, cubic terms wrt age) and three RE parameters (variance of the intercept, variance of the slope, covariance between intercept and slope), but these 7 parameters are estimated at the population level - the appropriate comparisons are with either the total number of observations or with the number of groups, not with the number of observations per group.
In general you should probably think about the number of observations when considering the number of fixed effect parameters and the number of groups when considering the number of random effect parameters; "10 [observations/groups] per parameter" is a not-unreasonable starting rule of thumb.
I am aware that there are similar questions on this site, however, none of them seem to answer my question sufficiently.
I am performing a multivariate regression in order to predict real estate data using Hedonic price method.
EXCERPT OF DATA USED
Dependent variable is AV_TOTAL, which is actually the price of the unit apartments'.
Distances from the closer park/highway are expressed in meters.
U_NUM_PARKS/U_FPLACE(presence of parkings and fireplace) are taken into account as dummy variables.
1) Linear-Linear Model --> Results Model 1
lm(AV_TOTAL ~ LIVINGA_AREAM2 + NUM_FLOORS +
U_BASE_FLO + U_BDRMS + factor(U_NUM_PARK) + DIST_PARKS +
DIST_HIGHdiff + DIST_BIGDIG, data = data)
Residuals Model 1
2) Log-linear Model --> Results Model 2
lm(log(AV_TOTAL) ~ LIVINGA_AREAM2 + NUM_FLOORS +
U_BASE_FLO + U_BDRMS + factor(U_NUM_PARK) + DIST_PARKS + DIST_HIGHdiff + DIST_BIGDIG, data = data)
Residuals Model 2
3) Log-Log Model --> Results Model 3
lm(formula = log(AV_TOTAL) ~ log(LIVINGA_AREAM2) + NUM_FLOORS +
U_BASE_FLO + log(U_BDRMS) + factor(U_NUM_PARK) + log(DIST_PARKS) +
log(DIST_HIGHdiff) + log(DIST_BIGDIG), data = data)
Residuals Model 3
All the models have quite good R^2 while residuals plot shows better normal distribution for Model 2 and 3.
I can't figure out which is the difference between model 2 and 3 especially in interpreting the variable DIST_PARKS (distance from parks) and also which is the more correct model.
I am trying to predict the "medv" (median owner-owned house prices) from Boston dataset. It is a numeric variable.
I have made a linear model from a training dataset and want to calculate model accuracy by testing on a test dataset. Below is the reproducible code:
library("MASS")
Boston<-Boston
set.seed(12396911) # set random seed
index <- sample(1:nrow(Boston), floor(0.8 * nrow(Boston)), replace = FALSE)
training <- Boston[index,]
testing<- Boston[-index,]
fin_model<-lm(medv ~ lstat + rm + ptratio + black + dis + nox + zn + chas + rad + tax + crim, data = training)
prediction<-predict(fin_model,testing)
tab<-data.frame(cbind(prediction,testing$medv))
colnames(tab)<-c("pred","true")
mse<-sum((tab$pred - tab$true)^2)/(length(tab)-1)
mse
I have an idea on how to calculate accuracy when we are predicting a categorical response (it comparing the true value against predicted and hence it's either yes or no, and the proportion of matches is our accuracy).
I was thinking if correlation makes sense as a measure of accuracy: If 1 it's 100% accurate and if it's 0, predictions are useless. But I am not sure.
cor(tab$pred,tab$true)
# 0.8522107
Available in the caret package, postResample(prediction, actual) where prediction and actual are both numeric or factor vectors will give you RMSE, Rsquared, and MAE
First of all, I am relatively new in using R and haven't used lavaan (or growth models) before so please excuse my ignorance.
I am doing my thesis and analyzing the U.S. financial industry during the financial crisis of 2007. I therefore have individual banks and several variables for each bank across time (from 2007-2013), some are time-variant (such as ROA or capital adequacy) and some are time-invariant (such as size or age). Some variables are also time-variant but not multi-level since they apply to all firms (such as the average ROA of the U.S. financial industry).
Fist of all, can I use lavaan's growth curve model ("growth") in this instance? The example given on the tutorial is for either time-varying variables (c) that influence the outcome (DV) or time-invariant variables (x1 & x2) which influence the slope (s) and intercept (i). What about time varying variables that influence the slope and intercept? I couldn't find an example for this syntax.
Also, how do I specify the "groups" (i.e. different banks) in my analysis? It is actually possible to do a multi-level growth curve model in lavaan (or R for that matter)?
Last but not least, I could find how to import a multilevel dataset in R. My dataset is basically a 3-dimensional matrix (different variables for different firms across time) so how do I input that via SPSS (or notepad?)?
Any help is much appreciated, I am basically lost on how to implement this model and sincerely need some assistance...
Thank you all in advance for your time!
Harry
edit: Here is the sytanx that I have come with so far. DO you think it makes sense?
ETHthesismodel <- '
# intercept and slope with fixed coefficients
i =~ 1*t1 + 1*t2 + 1*t3 + 1*t4
s =~ 0*t1 + 1*t2 + 2*t3 + 3*t4
#regressions (independent variables that influence the slope & intercept)
i ~ high_constr_2007 + high_constr_2008 + ... + low_constr_2007 + low_constr_2008 + ... + ... diff_2013
s ~ high_constr_2007 + high_constr_2008 + ... + low_constr_2007 + low_constr_2008 + ... + ... diff_2013
# time-varying covariates (control variables)
t1 ~ size_2007 + cap_adeq_2007 + brand_2007 +... + acquisitions_2007
t2 ~ size_2008 + cap_adeq_2008 + brand_2008 + ... + acquisitions_2008
...
t7 ~ size_2013 + cap_adeq_2013 + brand_2013 + ... + acquisitions_2013
'
fit <- growth(ETHthesismodel, data = inputdata,
group = "bank")
summary(fit)
I have built a survival cox-model, which includes a covariate * time interaction (non-proportionality detected).
I am now wondering how could I most easily get survival predictions from my model.
My model was specified:
coxph(formula = Surv(event_time_mod, event_indicator_mod) ~ Sex +
ageC + HHcat_alt + Main_Branch + Acute_seizure + TreatmentType_binary +
ICH + IVH_dummy + IVH_dummy:log(event_time_mod)
And now I was hoping to get a prediction using survfit and providing new.data for the combination of variables I am doing the predictions:
survfit(cox, new.data=new)
Now as I have event_time_mod in the right-hand side in my model I need to specify it in the new data frame passed on to survfit. This event_time would need to be set at individual times of the predictions. Is there an easy way to specify event_time_mod to be the correct time to survfit?
Or are there any other options for achieving predictions from my model?
Of course I could create as many rows in the new data frame as there are distinct times in the predictions and setting to event_time_mod to correct values but it feels really cumbersome and I thought that there must be a better way.
You have done what is refereed to as
An obvious but incorrect approach ...
as stated in Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model vignette in version 2.41-3 of the R survival package. Instead, you should use the time-transform functionality, i.e., the tt function as stated in the same vignette. The code would be something similar to the example in the vignette
> library(survival)
> vfit3 <- coxph(Surv(time, status) ~ trt + prior + karno + tt(karno),
+ data=veteran,
+ tt = function(x, t, ...) x * log(t+20))
>
> vfit3
Call:
coxph(formula = Surv(time, status) ~ trt + prior + karno + tt(karno),
data = veteran, tt = function(x, t, ...) x * log(t + 20))
coef exp(coef) se(coef) z p
trt 0.01648 1.01661 0.19071 0.09 0.9311
prior -0.00932 0.99073 0.02030 -0.46 0.6462
karno -0.12466 0.88279 0.02879 -4.33 1.5e-05
tt(karno) 0.02131 1.02154 0.00661 3.23 0.0013
Likelihood ratio test=53.8 on 4 df, p=5.7e-11
n= 137, number of events= 128
The survfit though does not work when you have a tt term
> survfit(vfit3, veteran[1, ])
Error in survfit.coxph(vfit3, veteran[1, ]) :
The survfit function can not yet process coxph models with a tt term
However, you can easily get out the terms, linear predictor or mean response with predict. Further, you can create the term over time for the tt term using the answer here.