Replicate SAS GLM proc to R - r

I want to compute a linear model in order to get means of some Y variable adjusted on a categorial Q variable and some X numeric variables.
One told me I could easily get them with SAS, and I used this piece of code:
proc glm data=TABLE_R;
class Q(ref="Q1");
model Y = Q X2 X3 X4 / solution;
lsmeans Q/ stderr pdiff cov out=adjmeans;
run;
But being way more friendly with R, I wanted to replicate this procedure, and after some research I ended with this code:
m = glm(Y ~ Q + X2 + X3 + X4, data=db) #using lm() didn't change anything
emmeans::emmeans(m, "Q")
The problem is that, whether very close, model coefficients are different. Here is an example with the intercept and 2 levels of Q:
#in R
(Intercept) Q2 Q3
-0.1790444126 0.0051160461 -0.0013756817
#in SAS
(Intercept) Q2 Q3
-0.1767853086 0.0016709301 -0.0031477746
Actually, in SAS, I have a message saying that coefficients needed additional computation (which I unfortunately don't understand, does R glm() lack this ?):
Note: The X'X matrix has been found to be singular, and a generalized
inverse was used to solve the normal equations. Terms whose estimates
are followed by the letter 'B' are not uniquely estimable.
Which option should I add here or ther so I can find the same results with both SAS and R ?
If I cannot, how can I choose which method is best suited ?
Usefull posts : Proc GLM (SAS) using R, X'X matrix found to be singular
EDIT : This is very strange but effectives are different in SAS and R :
#SAS
Observations read: 81733
Observations used: 9000
#R
16357 Residual
(88017 observations deleted due to missingness)

You will get the same coefficients if you first do
options(contrasts=c(“contr.SAS”,”contr.poly”))
before fitting the model. This will cause R to use the same parameterization that SAS uses.
However, even without this change, the fitted values from R will be identical to those from SAS, and the EMMs from R will match the lsmeans from SAS. That’s because we are not really changing the model, we are only changing how it is parameterized.

Related

How to perform logistic regression on not binary variable?

I was searching for this answer and I'm really suprised that haven't found it. I just want to peform three level logistic regression in R.
Let's define some artificial data:
set.seed(42)
y <- sample(0:2, 100, replace = T)
x <- rnorm(100)
My variable y is containing three numbers - 0, 1 and 2. So I thought that the simplest way would be just to use:
glm(y ~ x, family = binomial("logit"))
However I got information that y should be in interval [0,1]. Do you know how I can perform this regression ?
Please notice - I know that it's not so straightforward to perform multilevel logistic regression, there are several techniques how to do so e.g. One vs all. But as I was seeking for it, I wasn't able to find any.
Logistic regression as implemented by glm only works for 2 levels of output, not 3.
The message is a little vauge because you can specify the y-variable in logistic regression as 0s and 1s, or as a proportion (between 0 and 1) with a weights argument specifying the number of subjects the proportion is of.
With 3 or more ordered levels in the response you need to use a generalization, one common generalization is proportional odds logistic regression (also goes by other names). The polr function in the MASS package and the lrm function in the rms package (and probably other functions in other packages) fit these types of models, but glm does not.
set.seed(42)
y <- sample(0:2, 100, replace = TRUE)
x <- rnorm(100)
multinomial regression
If you don't want to treat your responses as ordered (i.e., nominal or categorical values):
library(nnet) ## 'recommended' package, i.e. installed by default
multinom(y~x)
Results
# weights: 9 (4 variable)
initial value 109.861229
final value 104.977336
converged
Call:
multinom(formula = y ~ x)
Coefficients:
(Intercept) x
1 -0.001529465 0.29386524
2 -0.649236723 -0.01933747
Residual Deviance: 209.9547
AIC: 217.9547
Or, if your responses are ordered:
ordinal regression
MASS::polr() does proportional-odds logistic regression. (You may also be interested in the ordinal package, which has more features; it can also do multinomial models.)
library(MASS) ## also 'recommended'
polr(ordered(y)~x)
Results
Call:
polr(formula = ordered(y) ~ x)
Coefficients:
x
0.06411137
Intercepts:
0|1 1|2
-0.4102819 1.3218487
Residual Deviance: 212.165
AIC: 218.165
If you read the error message, it offers a hint that you might get success with:
y <- sample(seq(0,1,length=3), 100, replace = T)
And in fact, you do. Now you challenge might be to interpret that in the context of the actual situation in reality (which you have not offered.) You do get a warning, but R warnings are not errors.
You might also look up the topic of polychotomous logistic regression, which is implemented in several variants that might be useful in particular situations. Frank Harrell's book Regression Modeling Strategies has material on such techniques. You may also post further questions on CrossValidated.com if you need help choosing which route to go.

Model Syntax for Simple Moderation Model in Lavaan (with bootstrapping)

I am a social scientist currently running a simple moderation model in R, in the form of y ~ x + m + m * x. My moderator is a binary categorical variable (two separate groups).
I started out with lm(), bootstrapped estimates with boot() and obtained bca confidence intervals with boot.ci. Since there is no automated way of doing this for all parameters (at my coding level at least), this is bit tedious. Howver, I now saw that the lavaan package offer bootstrapping as part of the regular sem() function, and also bca CIs as part of parameterEstimates(). So, I was wondering (since I am using lavaan in other analyses) whether I could just replace lm() with lavaan for the sake of keeping my work more consistent.
Doing this, I was wondering about what the equivalent model for lavaan would be to test for moderation in the same way. I saw this post where Jeremy Miles proposes the code below, which I follow mostly.
mod.1 <- "
y ~ c(a, b) * x
y ~~ c(v1, v1) * y # This step needed for exact equivalence
y ~ c(int1, int2) * 1
modEff := a - b
mEff := int1 - int2"
But it would be great if you could help me figure out some final things.
1) What does the y ~~ c(v1, v1) * y part mean and and why is it needed for "exact equivalence" to the lm model? From the output it seems this constrics variances of the outcome for both groups to the same value?
2) From the post, am I right to understand that either including the interaction effect as calculated above OR constraining (only) the slope between models and looking at model fit with anova()would be the same test for moderation?
3) The lavaan page says that adding test = "bootstrap" to the sem() function allows for boostrap adjusted p-values. However, I read a lot about p-values conflicting with the bca-CIs at times, and this has happened to me. Searching around, I understand that this conflict comes from the assumptions for the distribution of the data under the H0 for p-values, but not for CIs (which just give the range of most likely values). I was therefore wondering what it exactly means that the p-values given here are "bootstrap-adjusted"? Is it technically more true to report these for my SEM models than the CIs?
Many questions, but I would be very grateful for any help you can provide.
Best,
Alex
I think I can answer at least Nr. 1 and 2 of your questions but it is probably easier to not use SEM and instead program a function that conveniently gives you CIs for all coefficients of your model.
So first, to answer your questions:
What is proposed in the code you gave is called multigroup comparison. Essentially this means that you fit the same SEM to two different groups of cases in your dataset. It is equivalent to a moderated regression with binary moderator because in both cases you get two slopes (often called „simple slopes“) for the scalar predictor, one slope per group of the moderator.
Now, in your lavaan code you only see the scalar predictor x. The binary moderator is implied by group="m" when you fit the model with fit.1 <- sem(mod.1, data = df, group = "m") (took this from the page you linked).
The two-element vectors (c( , )) in the lavaan code specify named parameters for the first and second group, respectively. By y ~~ c(v1, v1) * y , the residual variances of y are set equal in both groups because they have the same name. In contrast, the slopes c(a, b) and the intercepts c(int1, int2) are allowed to vary between groups.
Yes. If you use the SEM, you would fit the model a second time adding a == b and compare the model this to the first version where the slopes can differ. This is the same as comparing lm() models with and without a:b (or a*b) in the formula.
Here I cannot provide a direct answer to your question. I suspect if you want BCa CIs as you would get from applying boot.ci to an lm model fit, this might not be implemented. In the lavaan documentation BCa confidence intervals are only mentioned once: In the section about the parameterEstimates function, which can also perform bootstrap (see p. 89). However, it does not produce actual BCa (bias-corrected and accelerated) CIs but only bias-corrected ones.
As mentioned above, I guess the simplest solution would be to use lm() and either repeat the boot.ci procedure for each coefficient or write a wrapper function that does this for you. I suggest this also because a reviewer may be quite puzzled to see you do multigroup SEM instead of a simple moderated regression, which is much more common.
You probably did something like this already:
lm_fit <- function(dat, idx) coef( lm(y ~ x*m, data=dat[idx, ]) )
bs_out <- boot::boot(mydata, statistic=lm_fit, R=1000)
ci_out <- boot::boot.ci(bs_out, conf=.95, type="bca", index=1)
Now, either you repeat the last line for each coefficient, i.e., varying index from 1 to 4. Or you get fancy and let R do the repeating with a function like this:
all_ci <- function(bs) {
est <- bs$t0
lower <- vector("numeric", length(bs$t0))
upper <- lower
for (i in 1:length(bs$t0)) {
ci <- tail(boot::boot.ci(bs, type="bca", index=i)$bca[1,], 2)
lower[i] <- ci[1]
upper[i] <- ci[2]
}
cbind(est, lower, upper)
}
all_ci(bs_out)
I am sure this could be written more concisely but it should work fine for bootstraps of simple lm() models.

polyval: from Matlab to R

I would like to use in R the following expression given in Matlab:
y1=polyval(p,end_v);
where p in Matlab is:
p = polyfit(Nodes_2,CInt_interp,3);
Right now in R I have:
p <- lm(Spectra_BIR$y ~ poly(Spectra_BIR$x,3, raw=TRUE))
But I do not know which command in R corresponds to the polyval from Matlab.
Many thanks!
r:
library(polynom)
predict(polynomial(1:3), c(5,7,9))
[1] 86 162 262
matlab (official example):
p = [3 2 1];
polyval(p,[5 7 9])
ans = 86 162 262
There is no exact equivalence in R for polyfit and polyvar, as these MATLAB routines are so primitive compared with R's statistical tool box.
In MATLAB, polyfit mainly returns polynomial regression coefficients (covariance can be obtained if required, though). polyvar takes regression coefficients p, and a set of new x values to predict the fitted polynomial.
In R, the fashion is: use lm to obtain a regression model (much broader; not restricted to polynomial regression); use summary.lm for model summary, like obtaining covariance; use predict.lm for prediction.
So here is the way to go in R:
## don't use `$` in formula; use `data` argument
fit <- lm(y ~ poly(x,3, raw=TRUE), data = Spectra_BIR)
Note, fit not only contains coefficients, but also essential components for orthogonal computation. If you want to extract coefficients, do coef(fit), or unname(coef(fit)) if you don't want names of coefficients to be shown.
Now, to predict, we do:
x.new <- rnorm(5) ## some random new `x`
## note, `predict.lm` takes a "lm" model, not coefficients
predict.lm(fit, newdata = data.frame(x = x.new))
predict.lm is much much more powerful than polyvar. It can return confidence interval. Have a read on ?predict.lm.
There are a few sensitive issues with the use of predict.lm. There have been countless questions / answers regarding this, and you can find the root question to which I often close those questions as duplicated:
Getting Warning: “ 'newdata' had 1 row but variables found have 32 rows” on predict.lm in R
Predict() - Maybe I'm not understanding it
So make sure you get the good habit of using lm and predict at the early stage of learning R.
Extra
It is also not difficult to construct something identical to polyvar in R. The function g in my answer Function for polynomials of arbitrary order is doing this, although by setting nderiv we can also get derivatives of the polynomial.

Obtaining predicted (i.e. expected) values from the orm function (Ordinal Regression Model) from rms package in R

I've run a simple model using orm (i.e. reg <- orm(formula = y ~ x)) and I'm having trouble understanding how to get predicted values for Y. I've never worked with models that use multiple intercepts. I want to know for each and every value of Y in my dataset what the predicted value from the model would be. I tried predict(reg, type="mean") and this produced values that are close to the predicted values from an OLS regression, but I'm not sure if this is what I want. I really just want something analogous to OLS where you can obtain the E(Y) given a set of predictors. If possible, please provide code I can run to do this with a very short explanation.

beta coefficient in Anova with R and XLStat

I’m working with the software R and XLStat. I’ve conducted an one-way ANOVA (my categorical variable is 3 modal (1,2,3) and my response variable is quantitative on scale 1-10).
I’ve conducted this ANOVA on R and XLStat and the outputs for the F fisher, p-value, coefficient estimations, t-values, std error … are exactly the same.
However, XLstat offers an extra output : the standardized coefficients (called too beta coefficients). Firstly, I was surprised, because I didn’t think we could calculate beta coefficient for a categorical variable and according to the bibliography I read, it doesn’t have any sense.
Anyway, I tried to find these coefficients with R, thanks to the unique formula I found : beta = estimate * sd(x)/sd(y). sd(x) being the standard deviation of the categorical variable (which is automatically transformed as numeric variable with R, in order to calculate sd(x), seems logical ) and sd(y) being the standard deviation of my response variable.
The first beta I obtained with R is the same than in XLstat , but not the second and the third. Given that the first one is the same with R and XLStat, I suppose that Xlstat convert too the categorical variable in numeric variable (which is senseless but this is not the question).
Moreover, I conducted the anova on Statistica in order to see if XLStat did any mistake but its outputs for the beta coefficients are the same than in Xlstat …
So, my question is this one : what is the formula to obtain the beta coefficient in a one way Anova ?
Then, I would like to ask you about the relevance of these beta coefficients for a categorical variable. According to my thoughts and publications I read, it doesn't have any sense …
ps contrasts in R and Xlstat are sum(ai)=0. For beta coefficients, XLStat remove the intercept. I guess this fact could be important but I don't know somehow
The formula for obtaining beta coefficients from metric coefficients for an ANOVA is the same as for a linear regression. The coefficients have no sensible interpretation (for categorical variables), but standardized coefficients are useful in comparing the relative effects of IVs with different metrics.
In R, either use scale() to transform the data to z-scores before fitting the model, or use lm.beta() instead of lm().
It is not clear why you would obtain different beta coefficients with XLStat, but it could have something to do with degrees of freedom if it's not an error. This example compares the lm.beta() function in R with SAS and obtains the same coefficients.

Resources