I'm trying to find out how well my mixed model with family effect fits the data. Is it possible to extract r squared values from lmekin functions? And if so, is it possible to extract partial r squared values for each of the covariables?
Example:
model= lmekin(formula = height ~ score + sex + age + (1 | IID), data = phenotype_df, varlist = kinship_matrix)
I have tried the MuMin package but it doesn't seem to work with lmekin models. Thanks.
I am able to use the r.squaredLR() function,
library(coxme)
library(MuMIn)
data(ergoStool, package="nlme") # use a data set from nlme
fit1 <- lmekin(effort ~ Type + (1|Subject), data=ergoStool)
r.squaredLR(fit1)
(I am pretty sure that works, but one thing that is great to do is to create a reproducible example so I can run your code to double check, for example I am not exactly sure what phenotype_df looks like, and I am not able to run your code as it is, a great resource for this is the reprex package).
Related
Is there a way to get R to run all possible models (with all combinations of variables in a dataset) to produce the best/most accurate linear model and then output that model?
I feel like there is a way to do this, but I am having a hard time finding the information.
There are numerous ways this could be achieved, but for a simple way of doing this I would suggest that you have a look at the glmulti package, which is described in detail in this paper:
glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models
Alternatively, very simple example of the model selection as available on the Quick-R website:
# Stepwise Regression
library(MASS)
fit <- lm(y~x1+x2+x3,data=mydata)
step <- stepAIC(fit, direction="both")
step$anova # display results
Or to simplify even more, you can do more manual model comparison:
fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ x1 + x2, data=mydata)
anova(fit1, fit2)
This should get you started. Although you should read my comment from above. This should build you a model based on all the data in your dataset and then compare all of the models with AIC and BIC.
# create a NULL vector called model so we have something to add our layers to
model=NULL
# create a vector of the dataframe column names used to build the formula
vars = names(data)
# remove variable names you don’t want to use (at least
# the response variable (if its in the first column)
vars = vars[-1]
# the combn function will run every different combination of variables and then run the glm
for(i in 1:length(vars)){
xx = combn(vars,i)
if(is.null(dim(xx))){
fla = paste("y ~", paste(xx, collapse="+"))
model[[length(model)+1]]=glm(as.formula(fla),data=data)
} else {
for(j in 1:dim(xx)[2]){
fla = paste("y ~", paste(xx[1:dim(xx)[1],j], collapse="+"))
model[[length(model)+1]]=glm(as.formula(fla),data=data)
}
}
}
# see how many models were build using the loop above
length(model)
# create a vector to extract AIC and BIC values from the model variable
AICs = NULL
BICs = NULL
for(i in 1:length(model)){
AICs[i] = AIC(model[[i]])
BICs[i] = BIC(model[[i]])
}
#see which models were chosen as best by both methods
which(AICs==min(AICs))
which(BICs==min(BICs))
I ended up running forwards, backwards, and stepwise procedures on data to select models and then comparing them based on AIC, BIC, and adj. R-sq. This method seemed most efficient. However, when I received the actual data to be used (the program I was writing was for business purposes), I was told to only model each explanatory variable against the response, so I was able to just call lm(response ~ explanatory) for each variable in question, since the analysis we ended up using it for wasn't worried about how they interacted with each other.
This is a very old question, but for those who are still encountering this discussion - the package olsrr and specifically the function ols_step_all_possible exhaustively produces an ols model for all possible subsets of variables, based on an lm object (such that by feeding it with a full model you will get all possible combinations), and returns a dataframe with R squared, adjusted R squared, aic, bic, etc. for all the models. This is very helpful in finding the best predictors but it is also very much time consuming.
see https://olsrr.rsquaredacademy.com/reference/ols_step_all_possible.html
I do not recommend just "cherry picking" the best performing model, rather I would actually look at the output and choose carefully for the most reasonable outcome. In case you would want to immediately get the best performing model (by some criteria, say number of predictors and R2) you may write a function that saves the dataframe, arranges it by number of predictors and orders it by descending R2 and spits out the top result.
The dredge() function in R also accomplishes this.
I'm sure this is something that can be done, just not sure how!
I have a dataset that is around 500 rows(csv) and it shows footballers match stas(e,g passes, shots on target)etc.I have some of their salaries(around 10) and I'n trying to predict their salaries using a linear regression equation.
In the below, if Y is salaries, is there a way on R to essentially autopopulate? what the rest of the salaries might be based on the ten salaries I do have?
lm(y ~ x1 + x2 +x3)
Any help would be much appreciated.
This is what the predict function does.
Note that you don't need to call predict.lm explicitly. Because the result of a call to lm is an object with class "lm", R "knows" to use predict.lm when you call predict on it.
Eg:
lm1 <- lm(y ~ x1 + x2 +x3)
y.fitted <- predict(lm1)
You should also be able to test the predictive accuracy of your model using cross validation with the function cv.lm in the DAAG library. With this function you create test data to test the model which is generated using training data.
I have imputed missing value using the Amelia package. I am now analysing the data using regressions.
I have been using the following code
require(Zelig)
z.out <- zelig(catastrophic ~ age + PC1 + sex + hh_size + wealth_quin +
hh_exp_quin, model="logit", data = a.output$imputation)
summary(z.out)
Where a.output is an imputed data set. (I still need to code the combining of the multiple imputed data sets, but know how to do that, so that comes later).
I have found that my model has quite a lot of endogeneity between hh_exp_quin and the dependent variable (which is a binary variable, hence the model is "logit").
As such, I want to use another variable (not currently included in this model, call it "var1") as an instrumental variable for hh_exp_quin.
the zelig package doesn't currently seem to support "ivreg", and I can't find anything online telling me how to deal with this.
Many thanks,
Timothy
Try the ivreg command in the package AER.
Please how can this be written in R
proc glm data=DataTX;
class DAG;
by HID;
model Bwt = DAG/ss3 solution;
ods output parameterestimates =TX_BW_corrFact;
run;
The equivalent to proc glm for most purposes in R is lm, which fits linear models. It looks like you want the estimated coefficients from the model(s), which can be obtained by coef(mod) where mod is the object returned by lm.
The most complicated bit is replicating the by statement, which fits separate models for each level of the by variable (HID in this case). Try something like this. I assume you've already got your dataset imported into R.
grps <- split(DataTX, DataTX$HID)
mods <- lapply(grps, function(x) lm(Bwt ~ DAG, data=x))
sapply(mods, coef)
This splits DataTX into separate groups based on HID. For each group, it then fits the model lm(Bwt ~ DAG). The last line then extracts the fitted coefficients for each model.
This can be concatenated into a single line, but leaving it as 3 separate statements probably makes it easier to follow.
Note that the coefficients won't be the same as those from SAS, because of differences in how the two systems parametrise the model. In particular, SAS by default treats the last level of a class/factor variable as the reference, while R uses the first.
Have a look at lmList() from the lme4 or nlme package
library(lme4)
lmList(Reaction ~ Days | Subject, sleepstudy)
That is shorter than Hong's solution.
grps <- split(sleepstudy, sleepstudy$Subject)
mods <- lapply(grps, function(x) lm(Reaction ~ Days, data=x))
sapply(mods, coef)
I am trying to learn R after using Stata and I must say that I love it. But now I am having some trouble. I am about to do some multiple regressions with Panel Data so I am using the plm package.
Now I want to have the same results with plm in R as when I use the lm function and Stata when I perform a heteroscedasticity robust and entity fixed regression.
Let's say that I have a panel dataset with the variables Y, ENTITY, TIME, V1.
I get the same standard errors in R with this code
lm.model<-lm(Y ~ V1 + factor(ENTITY), data=data)
coeftest(lm.model, vcov.=vcovHC(lm.model, type="HC1))
as when I perform this regression in Stata
xi: reg Y V1 i.ENTITY, robust
But when I perform this regression with the plm package I get other standard errors
plm.model<-plm(Y ~ V1 , index=C("ENTITY","YEAR"), model="within", effect="individual", data=data)
coeftest(plm.model, vcov.=vcovHC(plm.model, type="HC1))
Have I missed setting some options?
Does the plm model use some other kind of estimation and if so how?
Can I in some way have the same standard errors with plm as in Stata with , robust
By default the plm package does not use the exact same small-sample correction for panel data as Stata. However in version 1.5 of plm (on CRAN) you have an option that will emulate what Stata is doing.
plm.model<-plm(Y ~ V1 , index=C("ENTITY","YEAR"), model="within",
effect="individual", data=data)
coeftest(plm.model, vcov.=function(x) vcovHC(x, type="sss"))
This should yield the same clustered by group standard-errors as in Stata (but as mentioned in the comments, without a reproducible example and what results you expect it's harder to answer the question).
For more discussion on this and some benchmarks of R and Stata robust SEs see Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R.
See also:
Clustered standard errors in R using plm (with fixed effects)
Is it possible that your Stata code is different from what you are doing with plm?
plm's "within" option with "individual" effects means a model of the form:
yit = a + Xit*B + eit + ci
What plm does is to demean the coefficients so that ci drops from the equation.
yit_bar = Xit_bar*B + eit_bar
Such that the "bar" suffix means that each variable had its mean subtracted. The mean is calculated over time and that is why the effect is for the individual. You could also have a fixed time effect that would be common to all individuals in which case the effect would be through time as well (that is irrelevant in this case though).
I am not sure what the "xi" command does in STATA, but i think it expands an interaction right ? Then it seems to me that you are trying to use a dummy variable per ENTITY as was highlighted by #richardh.
For your Stata and plm codes to match you must be using the same model.
You have two options:(1) you xtset your data in stata and use the xtreg option with the fe modifier or (2) you use plm with the pooling option and one dummy per ENTITY.
Matching Stata to R:
xtset entity year
xtreg y v1, fe robust
Matching plm to Stata:
plm(Y ~ V1 + as.factor(ENTITY) , index=C("ENTITY","YEAR"), model="pooling", effect="individual", data=data)
Then use vcovHC with one of the modifiers. Make sure to check this paper that has a nice review of all the mechanics behind the "HC" options and the way they affect the variance covariance matrix.
Hope this helps.