Please how can this be written in R
proc glm data=DataTX;
class DAG;
by HID;
model Bwt = DAG/ss3 solution;
ods output parameterestimates =TX_BW_corrFact;
run;
The equivalent to proc glm for most purposes in R is lm, which fits linear models. It looks like you want the estimated coefficients from the model(s), which can be obtained by coef(mod) where mod is the object returned by lm.
The most complicated bit is replicating the by statement, which fits separate models for each level of the by variable (HID in this case). Try something like this. I assume you've already got your dataset imported into R.
grps <- split(DataTX, DataTX$HID)
mods <- lapply(grps, function(x) lm(Bwt ~ DAG, data=x))
sapply(mods, coef)
This splits DataTX into separate groups based on HID. For each group, it then fits the model lm(Bwt ~ DAG). The last line then extracts the fitted coefficients for each model.
This can be concatenated into a single line, but leaving it as 3 separate statements probably makes it easier to follow.
Note that the coefficients won't be the same as those from SAS, because of differences in how the two systems parametrise the model. In particular, SAS by default treats the last level of a class/factor variable as the reference, while R uses the first.
Have a look at lmList() from the lme4 or nlme package
library(lme4)
lmList(Reaction ~ Days | Subject, sleepstudy)
That is shorter than Hong's solution.
grps <- split(sleepstudy, sleepstudy$Subject)
mods <- lapply(grps, function(x) lm(Reaction ~ Days, data=x))
sapply(mods, coef)
Related
Hello (first timer here),
I would like to estimate a "two-way" cluster-robust variance-covariance matrix in R. I am using a particular canned routine from the "multiwayvcov" library. My question relates solely to the set-up of the cluster.vcov function in R. I have panel data of various crime outcomes. My cross-sectional unit is the "precinct" (over 40 precincts) and I observe crime in those precincts over several "months" (i.e., 24 months). I am evaluating an intervention that 'turns on' (dummy coded) for only a few months throughout the year.
I include "precinct" and "month" fixed effects (i.e., a full set of precinct and month dummies enter the model). I have only one independent variable I am assessing. I want to cluster on "both" dimensions but I am unsure how to set it up.
Do I estimate all the fixed effects with lm first? Or, do I simply run a model regressing crime on the independent variable (excluding fixed effects), then use cluster.vcov i.e., ~ precinct + month_year.
This seems like it would provide the wrong standard error though. Right? I hope this was clear. Sorry for any confusion. See my set up below.
library(multiwayvcov)
model <- lm(crime ~ as.factor(precinct) + as.factor(month_year) + policy, data = DATASET_full)
boot_both <- cluster.vcov(model, ~ precinct + month_year)
coeftest(model, boot_both)
### What the documentation offers as an example
### https://cran.r-project.org/web/packages/multiwayvcov/multiwayvcov.pdf
library(lmtest)
data(petersen)
m1 <- lm(y ~ x, data = petersen)
### Double cluster by firm and year using a formula
vcov_both_formula <- cluster.vcov(m1, ~ firmid + year)
coeftest(m1, vcov_both_formula)
Is is appropriate to first estimate a model that ignores the fixed effects?
First the answer: you should first estimate your lm -model using fixed effects. This will give you your asymptotically correct parameter estimates. The std errors are incorrect because they are calculated from a vcov matrix which assumes iid errors.
To replace the iid covariance matrix with a cluster robust vcov matrix, you can use cluster.vcov, i.e. my_new_vcov_matrix <- cluster.vcov(~ precinct + month_year).
Then a recommendation: I warmly recommend the function felm from lfe for both multi-way fe's and cluster-robust standard erros.
The syntax is as follows:
library(multiwayvcov)
library(lfe)
data(petersen)
my_fe_model <- felm(y~x | firmid + year | 0 | firmid + year, data=petersen )
summary(my_fe_model)
I am trying to run regressions in R (multiple models - poisson, binomial and continuous) that include fixed effects of groups (e.g. schools) to adjust for general group-level differences (essentially demeaning by group) and that cluster standard errors to account for the nesting of participants in the groups. I am also running these over imputed data frames (created with mice). It seems that different disciplines use the phrase ‘fixed effects’ differently so I am having a hard time searching to troubleshoot.
I have fit random intercept models (with lme4) but they do not account for the school fixed effects (and the random effects are not of interest to my research questions). Putting the groups in as dummies slows the run down tremendously. I could also run a single level glm/lm with group dummies but I have not been able to find a strategy to cluster the standard errors with the imputed data (tried the clusterSE package). I could hand calculate the demeaning but there seems like there should be a more direct way to achieve this.
I have also looked at the lfe package but that does not seem to have glm options and the demeanlist function does not seem to be compatible with the imputed data frames.
In Stata, the command would be xtreg, fe vce (Cluster Variable), (fe = fixed effects, vce = clustered standard errors, with mi added to run over imputed dataframes). I could switch to Stata for the modeling but would definitely prefer to stay with R if possible!
Please let me know if this is better posted in cross-validated - I was on the fence but went with this one since it seemed to be more a coding question.
Thank you!
I would block bootstrap. The "block" handles the clustering and "bootstrap" handles the generated regressors.
There is probably a more elegant way to make this extensible to other estimators, but this should get you started.
# junk data
x <- rnorm(100)
y <- 1 + 2*x + rnorm(100)
dat1 <- data.frame(y, x, id=seq_along(y))
summary(lm(y ~ x, data=dat1))
# same point estimates, but lower SEs
dat2 <- dat1[rep(seq_along(y), each=10), ]
summary(lm(y ~ x, data=dat2))
# block boostrap helper function
require(boot)
myStatistic <- function(ids, i) {
myData <- do.call(rbind, lapply(i, function(i) dat2[dat2$id==ids[i], ]))
myLm <- lm(y ~ x, data=myData)
myLm$coefficients
}
# same point estimates from helper function if original data
myStatistic(unique(dat2$id), 1:100)
# block bootstrap recovers correct SEs
boot(unique(dat2$id), myStatistic, 500)
I fit a model with biglm and lm, the returned model summary are the same (with the difference of just formatting). However when I use them to predict the same dataset, they produce different results. lm model is correct comparing to if I manually calculate them by hand using model coefficients. But biglm model is incorrect.
Here are the models:
m1 <- biglm(cost ~ d + v + zi, data = tl)
m2 <- lm(cost ~ d + v + zi, data = tl)
Here is a small piece of the model summaries:
m1:
d: coef 473.9196
m2:
d: coef 4.739e+02
the rest of the model coefficients are matching and the same as illustrated above. However, when I use the model to predict, the results are different: m1 != m1
t1$m1 <- predict(m1, t1)
t1$m2 <- predict(m2, t1)
i tried to use predict.biglm() but got an error saying the function doesn't exist.
I also looked at this post (R: lm and biglm producing different answers) and am sure it is not the reason.
The dataset is too big so I don't know how to share it here. And it also might take a while for me to de-code some of the information first.
But here is a small piece comparing of results which shows the predict is quite different.
m1 m2
1798.831, 2365.868
1801.074, 2368.112
1482.508, 2351.042
After a long day, I finally figured out the issue.
I know biglm method requires the training and testing datasets have records for all factor levels. So when I was processing the dataset, I added 1 record of each missing factor level into the dataset (similar to the adding dummies method posted by another thread cited above).
However(!!), I didn't update the factor levels using factor() function. In this case, the biglm model runs fine and syntax is ok. But the model prediction results is not!!
Anyway, after I update the factor levels, it worked just fine.
Is there a way to get R to run all possible models (with all combinations of variables in a dataset) to produce the best/most accurate linear model and then output that model?
I feel like there is a way to do this, but I am having a hard time finding the information.
There are numerous ways this could be achieved, but for a simple way of doing this I would suggest that you have a look at the glmulti package, which is described in detail in this paper:
glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models
Alternatively, very simple example of the model selection as available on the Quick-R website:
# Stepwise Regression
library(MASS)
fit <- lm(y~x1+x2+x3,data=mydata)
step <- stepAIC(fit, direction="both")
step$anova # display results
Or to simplify even more, you can do more manual model comparison:
fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ x1 + x2, data=mydata)
anova(fit1, fit2)
This should get you started. Although you should read my comment from above. This should build you a model based on all the data in your dataset and then compare all of the models with AIC and BIC.
# create a NULL vector called model so we have something to add our layers to
model=NULL
# create a vector of the dataframe column names used to build the formula
vars = names(data)
# remove variable names you don’t want to use (at least
# the response variable (if its in the first column)
vars = vars[-1]
# the combn function will run every different combination of variables and then run the glm
for(i in 1:length(vars)){
xx = combn(vars,i)
if(is.null(dim(xx))){
fla = paste("y ~", paste(xx, collapse="+"))
model[[length(model)+1]]=glm(as.formula(fla),data=data)
} else {
for(j in 1:dim(xx)[2]){
fla = paste("y ~", paste(xx[1:dim(xx)[1],j], collapse="+"))
model[[length(model)+1]]=glm(as.formula(fla),data=data)
}
}
}
# see how many models were build using the loop above
length(model)
# create a vector to extract AIC and BIC values from the model variable
AICs = NULL
BICs = NULL
for(i in 1:length(model)){
AICs[i] = AIC(model[[i]])
BICs[i] = BIC(model[[i]])
}
#see which models were chosen as best by both methods
which(AICs==min(AICs))
which(BICs==min(BICs))
I ended up running forwards, backwards, and stepwise procedures on data to select models and then comparing them based on AIC, BIC, and adj. R-sq. This method seemed most efficient. However, when I received the actual data to be used (the program I was writing was for business purposes), I was told to only model each explanatory variable against the response, so I was able to just call lm(response ~ explanatory) for each variable in question, since the analysis we ended up using it for wasn't worried about how they interacted with each other.
This is a very old question, but for those who are still encountering this discussion - the package olsrr and specifically the function ols_step_all_possible exhaustively produces an ols model for all possible subsets of variables, based on an lm object (such that by feeding it with a full model you will get all possible combinations), and returns a dataframe with R squared, adjusted R squared, aic, bic, etc. for all the models. This is very helpful in finding the best predictors but it is also very much time consuming.
see https://olsrr.rsquaredacademy.com/reference/ols_step_all_possible.html
I do not recommend just "cherry picking" the best performing model, rather I would actually look at the output and choose carefully for the most reasonable outcome. In case you would want to immediately get the best performing model (by some criteria, say number of predictors and R2) you may write a function that saves the dataframe, arranges it by number of predictors and orders it by descending R2 and spits out the top result.
The dredge() function in R also accomplishes this.
I'm working with a Probit model, and would like to calculate the probabilities from my model for each observation in my DF. I know I can calculate this using the formula, however I am wondering if there is a quick way to output the probabilities and append them to my DF.
I am running the following model:
attach(non.part.2)
y <- cbind(E)
x1 <- cbind(tech.ems, med.com, tech.nonemerg)
probit <- glm(y ~ x1, family = binomial (link = "probit"))
summary(probit)
I am running several models, so it would be nice to be able to have R spit out the probabilities and allow me to name them in my DF(non.part.2) - something like p_x1 - so that I can run summary stats on the various models later.
Any help is much appreciated!
The following should work.
non.part2$p_x1 <- predict(probit, yourDataToPredictOn, type = "response")