I’m working with the software R and XLStat. I’ve conducted an one-way ANOVA (my categorical variable is 3 modal (1,2,3) and my response variable is quantitative on scale 1-10).
I’ve conducted this ANOVA on R and XLStat and the outputs for the F fisher, p-value, coefficient estimations, t-values, std error … are exactly the same.
However, XLstat offers an extra output : the standardized coefficients (called too beta coefficients). Firstly, I was surprised, because I didn’t think we could calculate beta coefficient for a categorical variable and according to the bibliography I read, it doesn’t have any sense.
Anyway, I tried to find these coefficients with R, thanks to the unique formula I found : beta = estimate * sd(x)/sd(y). sd(x) being the standard deviation of the categorical variable (which is automatically transformed as numeric variable with R, in order to calculate sd(x), seems logical ) and sd(y) being the standard deviation of my response variable.
The first beta I obtained with R is the same than in XLstat , but not the second and the third. Given that the first one is the same with R and XLStat, I suppose that Xlstat convert too the categorical variable in numeric variable (which is senseless but this is not the question).
Moreover, I conducted the anova on Statistica in order to see if XLStat did any mistake but its outputs for the beta coefficients are the same than in Xlstat …
So, my question is this one : what is the formula to obtain the beta coefficient in a one way Anova ?
Then, I would like to ask you about the relevance of these beta coefficients for a categorical variable. According to my thoughts and publications I read, it doesn't have any sense …
ps contrasts in R and Xlstat are sum(ai)=0. For beta coefficients, XLStat remove the intercept. I guess this fact could be important but I don't know somehow
The formula for obtaining beta coefficients from metric coefficients for an ANOVA is the same as for a linear regression. The coefficients have no sensible interpretation (for categorical variables), but standardized coefficients are useful in comparing the relative effects of IVs with different metrics.
In R, either use scale() to transform the data to z-scores before fitting the model, or use lm.beta() instead of lm().
It is not clear why you would obtain different beta coefficients with XLStat, but it could have something to do with degrees of freedom if it's not an error. This example compares the lm.beta() function in R with SAS and obtains the same coefficients.
Related
Can someone please help me with the equivalent of the mnrval function in R? I have not been able to find one where predicted probabilities are returned based on arguments, coefficient estimates and predictor values. I tried to rewrite the Matlab function in R but was unable to because one of the inner functions that was used was private. I would highly appreciate your help on this.
The documentation page on mnrval() states
MNRVAL Predict values for a nominal or ordinal multinomial regression model.
PHAT = MNRVAL(B,X) computes predicted probabilities for the nominal
multinomial logistic regression model with predictor values X. B is the
intercept and coefficient estimates as returned by the MNRFIT function. X
is an N-by-P design matrix with N observations on P predictor variables.
MNRVAL automatically includes intercept (constant) terms in the model; do
not enter a column of ones directly into X. PHAT is an N-by-K matrix of
predicted probabilities for each multinomial category.
When you do a multivariate linear regression you get the multiple R-squared, like this:
My question is, if I can get the R-squared for each independent variable, without having to make a regression for each of the predictor variables.
For example, is it possible to get the R-squared for each of the predictor variables, next to the p value:
In regression models, individual variables do not have an R-squared. There is only ever an R-squared for a complete model. The variance explained by any single independent variable in a regression model is depending on the other independent variables.
If you need some added value of an independent variable, that is, the variance this IV explains above all others, you can compute two regression models. One with this IV and one without. The difference in R-squared is the variance this IV explains after all others have explained their share. But if you do this for all variables, the differences won't add up to the total R-squared.
Alternatively, you may use squared Beta weights to roughly estimate the effect size of a variable in a model. But this value is not directly comparable to R-squared.
This said, this question would better be posted in CrossValidated than StackOverflow.
I am running an ordinal regression model. I have 8 explanatory variables, 4 of them categorical ('0' or '1') , 4 of them continuous. Beforehand I want to be sure there's no multicollinearity, so I use the variance inflation factor (vif function from the car package) :
mod1<-polr(Y ~ X1+X2+X3+X4+X5+X6+X7+X8, Hess = T, data=df)
vif(mod1)
but I get a VIF value of 125 for one of the variables, as well as the following warning :
Warning message: In vif.default(mod1) : No intercept: vifs may not be sensible.
However, when I convert my dependent variable to numeric (instead of a factor), and do the same thing with a linear model :
mod2<-lm(Y ~ X1+X2+X3+X4+X5+X6+X7+X8, data=df)
vif(mod2)
This time all the VIF values are below 3, suggesting that there's no multicollinearity.
I am confused about the vif function. How can it return VIFs > 100 for one model and low VIFs for another ? Should I stick with the second result and still do an ordinal model anyway ?
The vif() function uses determinants of the correlation matrix of the parameters (and subsets thereof) to calculate the VIF. In the linear model, this includes just the regression coefficients (excluding the intercept). The vif() function wasn't intended to be used with ordered logit models. So, when it finds the variance-covariance matrix of the parameters, it includes the threshold parameters (i.e., intercepts), which would normally be excluded by the function in a linear model. This is why you get the warning you get - it doesn't know to look for threshold parameters and remove them. Since the VIF is really a function of inter-correlations in the design matrix (which doesn't depend on the dependent variable or the non-linear mapping from the linear predictor into the space of the response variable [i.e., the link function in a glm]), you should get the right answer with your second solution above, using lm() with a numeric version of your dependent variable.
My response variable is Yijk corresponding to the recovery time of
patient i (i=1,...,I)
with treatment j (j=1,...,J)
and measured at time k (k=1,...,K)
I would like to fit the following model:Model equation, where:
μ is a global fixed intercept
αj is a fixed effect for the treatment
bik is a random effect with the following covariance structure. Denote bi the K-dimensional vector of effect for the patient i, then its variance-covariance matrix would have the following AR(1) structure.
Variance covariance matrix
uijk is the usual error term with variance σ²
Consider the following line of command:
lme(recovery ~ treatment, method="REML", random=~1|patient, correlation=corAR1,form=~time|patient,data=data)
Several questions:
What does this correlation argument correspond to? The structure of covariance of what? Is that the var-cov matrix which I defined as R?
Does the line actually do what I would like to?
If not, what does it do?
If not, is there a way to do what I would like to?
Thank you in advance!
First, you have a command lme, I will assume that is meant to be nlme because a) lme isn't an R command in any package that I know of or that R could find and b) correlation isn't an option in lme4
Second, in the documentation for nlme they have this:
an optional corStruct object describing the within-group correlation
structure. See the documentation of corClasses for a description of
the available corStruct classes. Defaults to NULL, corresponding to no
within-group correlations.
and in corClasses it says
corAR1 autoregressive process of order 1.
So, the answers to your first two questions appears to be "Yes".
I ran the same probit regression in SAS and R and while my coefficient estimates are (essentially) equivalent, the reported test statistics are different. Specifically, SAS reports test statistics as t-statistics whereas R reports test statistics as z-statistics.
I checked my econometrics text and found (with little elaboration) that it reports probit results in terms of t statistics.
Which statistic is appropriate? And why does R differ from SAS?
Here's my SAS code:
proc qlim data=DavesData;
model y = x1 x2 x3/ discrete(d=probit);
run;
quit;
And here's my R code:
> model.1 <- glm(y ~ x1 + x2 + x3, family=binomial(link="probit"))
> summary(model.1)
Just to answer a little bit - it's seriously off topic, question should be closed in fact - but neither the t-statistic nor the z-statistic are meaningful. They're both related though, as Z is just the standard normal distribution and T is an adapted "close-to-normal" distribution that takes into account the fact that your sample is limited to n cases.
Now, both the z and the t statistic provide the significance for the null hypothesis that the respective coefficient is equal to zero. The standard error on the coefficients, used for that test, is based on the residual error. Using the link function, you practically transform your response in such a way that the residuals become normal again, whereas in fact the residuals represent the difference between the observed and the estimated proportion. Due to this transformation, calculation of the degrees of freedom for the T-statistic isn't useful anymore and hence R assumes the standard normal distribution for the test statistic.
Both results are completely equivalent, R will just give slightly sharper p-values. It's a matter of debate, but if you look at proportion difference tests, they're also always done using the standard normal approximation (Z-test).
Which brings me back to the point that neither of these values actually has any meaning. If you want to know whether or not a variable has a significant contribution with a p-value that actually says something, you use a Chi-squared test like the Likelihood Ratio test (LR), Score test or Wald test. R just gives you the standard likelihood ratio, SAS also gives you the other two. But all three tests are essentially equivalent, if they differ seriously it's time to look again at your data.
eg in R :
anova(model.1,test="Chisq")
For SAS : see the examples here for use of contrasts, getting the LR, Score or Wald test