Coefficient value of covariate in Cox PH model is too big - r

I am trying to develop Cox PH model with time-varying covariates in R. I use coxph function from survival package. There was not any trouble during estimation process, though coefficient value of one covariates is too large, in particular, 2.5e+32.
I can't guess what is reason of this problem and how to tackle it. This variable is nonstationary and proportional assumption is violated. Does either of this facts may cause such a big value of coefficient?

More information could help framing your problem.
Anyway, I doubt non-proportionality is to blame. It would imply that you have some outliers heavily biasing your coefficient beyond reasonable expectations. You could give this a quick look by plotting the output of cox.zph.
Another possible explanation is that this rather depends on the unit of measure you used to define your covariate. Can the magnitude of the coefficient be meaningfully interpreted? If so, you could simply re-scale/standardise/log-transform that covariate to obtain a 'more manageable' coefficient (if this is theoretically appropriate).
This could also be due to the so called 'complete separation', which has been discussed here and here.

Related

GLMM in R versus SPSS (convergence and singularity problems vanish)

Unfortunately, I had convergence (and singularity) issues when calculating my GLMM analysis models in R. When I tried it in SPSS, I got no such warning message and the results are only slightly different. Does it mean I can interpret the results from SPSS without worries? Or do I have to test for singularity/convergence issues to be sure?
You have two questions. I will answer both.
First Question
Does it mean I can interpret the results from SPSS without worries?
You do not want to do this. The reason being is that mixed models have a very specific parameterization. Here is a screenshot of common lme4 syntax from the original article about lme4 from the author:
With this comes assumptions about what your model is saying. If for example you are running a model with random intercepts only, you are assuming that the slopes do not vary by any measure. If you include correlated random slopes and random intercepts, you are then assuming that there is a relationship between the slopes and intercepts that may either be positive or negative. If you present this data as-is without knowing why it produced this summary, you may fail to explain your data in an accurate way.
The reason as highlighted by one of the comments is that SPSS runs off defaults whereas R requires explicit parameters for the model. I'm not surprised that the model failed to converge in R but not SPSS given that SPSS assumes no correlation between random slopes and intercepts. This kind of model is more likely to converge compared to a correlated model because the constraints that allow data to fit a correlated model make it very difficult to converge. However, without knowing how you modeled your data, it is impossible to actually know what the differences are. Perhaps if you provide an edit to your question that can be answered more directly, but just know that SPSS and R do not calculate these models the same way.
Second Question
Or do I have to test for singularity/convergence issues to be sure?
SPSS and R both have singularity checks as a default (check this page as an example). If your model fails to converge, you should drop it and use an alternative model (usually something that has a simpler random effects structure or improved optimization).

How to resolve heteroskedasticity in Multiple Linear Regression in R

I'm modelling multiple linear regression. I used the bptest function to test for heteroscedasticity. The result was significant at less than 0.05.
How can I resolve the issue of heteroscedasticity?
Try using a different type of linear regression
Ordinary Least Squares (OLS) for homoscedasticity.
Weighted Least Squares (WLS) for heteroscedasticity without correlated errors.
Generalized Least Squares (GLS) for heteroscedasticity with correlated errors.
Welcome to SO, Arun.
Personally, I don't think heteroskedasticity is something you "solve". Rather, it's something you need to allow for in your model.
You haven't given us any of your data, so let's assume that the variance of your residuals increases with the magnitude of your predictor. Typically a simplistic approach to handling it is to transform the data so that the variance is constant. One way of doing this might be to log-transform your data. That might give you a more constant variance. But it also transforms your model. Your errors are no longer IID.
Alternatively, you might have two groups of observarions that you want to compare with a t-test, bit the variance in one group is larger than in the other. That's a different sot of heteroskedasticity. There are variants of the standard "pooled variance" t-test that might handle that.
I realise this isn't an answer to your question in the conventional sense. I would have made it a comment, but I knew before I started that I'd need more words than a comment would let me have.

R: Which variables to include in model?

I'm fairly new to R and am currently trying to find the best model to predict my dependent variable from a number of predictor variables. I have 20 precictor variables and I want to see which ones I should include in my model and which ones I should exclude.
I am currently just running models with different predictor variables in each and comparing them to see which one has the lowest AIC, but this is taking a really long time. Is there an easier way to do this?
Thank you in advance.
This is more of a theoretical question actually...
In principle, if all of the predictors are actually exogenous to the model, they can all be included together and assuming you have enough data (N >> 20) and they are not too similar (which could give rise to multi-collinearity), that should help prediction. In practice, you need to think about whether each of (or any of) your predictors are actually exogenous to the model (that is, independent of the error term in the model). If they are not, then they will impart a bias on the estimates. (Also, omitting explanatory variables that are actually necessary imparts a bias.)
If predictive accuracy (even spurious in-sample accuracy) is the goal, then techniques like LASSO (as mentioned in the comments) could also help.

PLM in R with time invariant variable

I am trying to analyze a panel data which includes observations for each US state collected across 45 years.
I have two predictor variables that vary across time (A,B) and one that does not vary (C). I am especially interested in knowing the effect of C on the dependent variable Y, while controlling for A and B, and for the differences across states and time.
This is the model that I have, using plm package in R.
random <- plm(Y~log1p(A)+B+C, index=c("state","year"),model="random",data=data)
My reasoning is that with a time invariant variable I should be using random rather than fixed effect model.
My question is: Is my model and thinking correct?
Thank you for your help in advance.
You base your answer about the decision between fixed and random effect soley on computational grounds. Please see the specific assumptions associated with the different models. The Hausman test is often used to discriminate between the fixed and the random effects model, but should not be taken as the definite answer (any good textbook will have further details).
Also pooled OLS could yield a good model, if it applies. Computationally, pooled OLS will also give you estimates for time-invariant variables.

Comparing nonlinear regression models

I want to compare the curve fits of three models by r-squared values. I ran models using the nls and drc packages. It appears, though, that neither of those packages calculate r-squared values; they give "residual std error" and "residual sum of squares" though.
Can these two be used to compare model fits?
This is really a statistics question, rather than a coding question: consider posting on stats.stackexchange.com; you're likely to get a better answer.
RSQ is not really meaningful for non-linear regression. This is why summary.nls(...) does not provide it. See this post for an explanation.
There is a common, and understandable, tendency to hope for a single statistic that allows one to assess which of a set of models better fits a dataset. Unfortunately, it doesn't work that way. Here are some things to consider.
Generally, the best model is the one that has a mechanistic underpinning. Do your models reflect some physical process, or are you just trying a bunch of mathematical equations and hoping for the best? The former approach almost always leads to better models.
You should consider how the models will be used. Will you be interpolating (e.g. estimating y|x within the range of your dataset), or will you be extrapolating (estimating y|x outside the range of your data)? Some models yield a fit that provides relatively accurate estimates slightly outside the dataset range, and others completely fall apart.
Sometimes the appropriate modeling technique is suggested by the type of data you have. For example, if you have data that counts something, then y is likely to be poisson distributed and a generalized linear model (glm) in the poisson family is indicated. If your data is binary (e.g. only two possible outcomes, success or failure), then a binomial glm is indicated (so-called logistic regression).
The key underlying assumption of least squares techniques is that the error in y is normally distributed with mean 0 and constant variance. We can test this after doing the fit by looking at a plot of standardized residuals vs. y, and by looking at a Normal Q-Q plot of the residuals. If the residuals plot shows scatter increasing or decreasing with y then the model in not a good one. If the Normal Q-Q plot is not close to a straight line, then the residuals are not normally distributed and probably a different model is indicated.
Sometimes certain data points have high leverage with a given model, meaning that the fit is unduly influenced by those points. If this is a problem you will see it in a leverage plot. This indicates a weak model.
For a given model, it may be the case that not all of the parameters are significantly different from 0 (e.g., p-value of the coefficient > 0.05). If this is the case, you need to explore the model without those parameters. With nls, this often implies a completely different model.
Assuming that your model passes the tests above, it is reasonable to look at the F-statistic for the fit. This is essentially the ratio of SSR/SSE corrected for the dof in the regression (R) and the residuals (E). A model with more parameters will generally have smaller residual SS, but that does not make it a better model. The F-statistic accounts for this in that models with more parameters will have larger regression dof and smaller residual dof, making the F-statistic smaller.
Finally, having considered the items above, you can consider the residual standard error. Generally, all other things being equal, smaller residual standard error is better. Trouble is, all other things are never equal. This is why I would recommend looking at RSE last.

Resources