Subset variables by significant P value - r

I'm trying to subset variables by significant P-values, and I attempted with the following code, but it only selects all variables instead of selecting by condition. Could anyone help me to correct the problem?
myvars <- names(summary(backward_lm)$coefficients[,4] < 0.05)
happiness_reduced <- happiness_nomis[myvars]
Thanks!

An alternative solution to Martin's great answer (in the comments section) using the broom package. Unfortunately, you haven't posted an data, so I'm using the mtcars dataset as a demo:
library(broom)
# build model
m = lm(disp ~ ., data = mtcars)
# create a dataframe frm model's output
tm = tidy(m)
# visualise dataframe of the model
# (using non scientific notation of numbers)
options(scipen = 999)
tm
# term estimate std.error statistic p.value
# 1 (Intercept) -5.8119829 228.0609389 -0.02548434 0.97990925639
# 2 mpg 1.9398052 2.5976340 0.74675849 0.46348865035
# 3 cyl 15.3889587 12.1518291 1.26639032 0.21924091701
# 4 hp 0.6649525 0.2259928 2.94236093 0.00777972543
# 5 drat 8.8116809 19.7390767 0.44640796 0.65987184728
# 6 wt 86.7111730 16.1127236 5.38153418 0.00002448671
# 7 qsec -12.9742622 8.6227190 -1.50466021 0.14730421493
# 8 vs -12.1152075 25.2579953 -0.47965832 0.63642812949
# 9 am -7.9135864 25.6183932 -0.30890253 0.76043942893
# 10 gear 5.1265224 18.0578153 0.28389494 0.77927112074
# 11 carb -30.1067073 7.5513212 -3.98694566 0.00067029676
# get variables with p value less than 0.05
tm$term[tm$p.value < 0.05]
# [1] "hp" "wt" "carb"
The main advantage is that by obtaining the model's output as a dataframe you can use variable names, and not variable positions and row names, to manipulate the data.
I'm using options(scipen = 999) to make it easier to check that filtering works (i.e. not using the scientific notation of numbers in the dataframe).

Related

Is there a R function or package to calculate the p value of a slope in a GLMM with interactions

Consider I have a linear mixed model with two continuous variables and use contrast coding for two factors with each two categories respectively (A,B). A random effect is optional.
contrasts(data$fac1) <- c(-.5,.5)
contrasts(data$fac2) <- c(-.5,.5)
model<-lme(Y~x1+x2+x1:fac1+x2:fac1+x1:fac2+x2:fac2+fac1+fac2+fac1:fac2, random=~1|group,data)
then the output will give me the main effects for x1 and x2 and the difference between slopes for fac1 and fac2.
But how can I calculate individual p-values for say the slope of x1 fac1=="A" and fac2=="B" ?
Is there an R package or do I have to calculate them manually ?
And if yes how? -following calls to vcov() adding up respective matrix entries and call to pt() (which df to use)?
Thanks!
You could try the marginaleffects package. (Disclaimer: I am the author.)
There are many vignettes on the website, including one with simple examples of mixed effects models with the lme4 package: https://vincentarelbundock.github.io/marginaleffects/articles/lme4.html
You can specify the values of covariates using the newdata argument and the datagrid function. The covariates you do not specify in datagrid will be held at their means or modes:
library(lme4)
library(marginaleffects)
mod <- glmer(am ~ mpg * hp + (1 | gear),
data = mtcars,
family = binomial)
marginaleffects(mod, newdata = datagrid(hp = c(100, 110), gear = 4))
#> rowid type term dydx std.error statistic p.value
#> 1 1 response mpg 0.077446700 0.33253683 0.2328966 0.8158417
#> 2 2 response mpg 0.337725702 0.90506056 0.3731526 0.7090349
#> 3 1 response hp 0.006199167 0.02647471 0.2341543 0.8148652
#> 4 2 response hp 0.025604198 0.06770870 0.3781522 0.7053175
#> conf.low conf.high mpg hp gear
#> 1 -0.57431351 0.72920691 20.09062 100 4
#> 2 -1.43616041 2.11161181 20.09062 110 4
#> 3 -0.04569032 0.05808865 20.09062 100 4
#> 4 -0.10710242 0.15831082 20.09062 110 4

A loop to create multiple data frames from a population data frame

Suppose I have a data frame called pop, and I wish to split this data frame by a categorical variable called replicate. This replicate consists out of 110 categories, and I wish to perform analyses on each data frame then the output of each must be combined to create a new data frame. In other words suppose it is replicate i then I wish to create data frame i and perform a logistic regression on i and save beta 0 for i. All the beta 0 will be combined to create a table with all the beta 0 for replicate 1-110.
I know that's A mouth full but thanks in advance.
Since you didn't give some sample data I will use mtcars. You can use split to split a data.frame on a categorical value. Combining this with map and tidy from the purrr and broom packages you can create a dataframe with all the beta's in one go.
So what happens is 1: split data.frame, 2: run regression model 3: tidy data to get the coefficients out and create a data.frame of the data.
You will need to adjust this to your data.frame and replicate variable. Broom can handle logistic regression so everything should work out.
library(purrr)
library(broom)
my_lms <- mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map_dfr(~ tidy(.))
my_lms
term estimate std.error statistic p.value
1 (Intercept) 39.571196 4.3465820 9.103980 7.771511e-06
2 wt -5.647025 1.8501185 -3.052251 1.374278e-02
3 (Intercept) 28.408845 4.1843688 6.789278 1.054844e-03
4 wt -2.780106 1.3349173 -2.082605 9.175766e-02
5 (Intercept) 23.868029 3.0054619 7.941551 4.052705e-06
6 wt -2.192438 0.7392393 -2.965803 1.179281e-02
EDIT
my_lms <- lapply(split(mtcars, mtcars$cyl), function(x) lm(mpg ~ wt, data = x))
my_coefs <- as.data.frame(sapply(my_lms, coef))
my_coefs
4 6 8
(Intercept) 39.571196 28.408845 23.868029
wt -5.647025 -2.780106 -2.192438
#Or transpose the coefficents if you want column results.
t(my_coefs)
(Intercept) wt
4 39.57120 -5.647025
6 28.40884 -2.780106
8 23.86803 -2.192438

How can I make all-possible-regressions in R also include exponents and logs of the variables?

I primarily work in research and statistics and so am not as familiar with programming. I'm using the OLSRR package for statistical analysis when trying to compare as many model specifications as possible using all-possible-regressions.
I use the code:
model <- lm(y ~ ., data = "mydata"
k <- ols_all_subset(model)
k
So far, which gives me a table with R2, adjusted R2, AIC, SIC etc for each combination of variables as is (linear). For example, if the variables are x1, x2 and x3, it gives me a table with R2, AIC, SIC etc for the specifications with every possible linear-linear specifications: with x1 x2 and x3 as regressors, with x1 and x2, with x1 and x3, with x2 and x3, and each of just x1, just x2, and just x3.
I want to also get all possibles for squares and logs of the variables to look at every possible major specification. So I don't get just those variables but also x1^2, log(x1), log(x3), and so on. How should I modify what I'm doing so I can get all-possibles that include possible exponential and possible logarithmic specifications as well in my output table?
I know I could individually create a new column and generate each x1^2, log(x1) etc individually as a new column, but sometimes I have dozens of variables and tons of data so doing every single variable individually each time for each new dataset is a pain.
Here is suggestion on how to automate the process:
You don't provide any sample data, so I'm going to use mtcars:
df <- mtcars[, c(1,5:7)];
Here, the response is column 1, and I consider 3 predictor variables in columns 5-7.
The goal is to automatically build all relevant terms and use those to construct a formula, which we then use in lm.
Build all predictor terms:
# All but the first column are predictors
terms <- sapply(colnames(df)[-1], function(x)
c(x, sprintf("I(%s^2)", x), sprintf("log(%s)", x)));
terms;
# drat wt qsec
#[1,] "drat" "wt" "qsec"
#[2,] "I(drat^2)" "I(wt^2)" "I(qsec^2)"
#[3,] "log(drat)" "log(wt)" "log(qsec)"
Construct the formula expression as a string.
exprs <- sprintf("%s ~ %s", colnames(df)[1], paste(terms, collapse = "+"));
exprs;
[1] "mpg ~ drat+I(drat^2)+log(drat)+wt+I(wt^2)+log(wt)+qsec+I(qsec^2)+log(qsec)"
Run the linear model.
model <- lm(as.formula(exprs), data = df);
Re-fit model using all combinations of predictor variables.
require(olsrr);
k <- ols_all_subset(model);
k;
# # A tibble: 511 x 6
# Index N Predictors `R-Square` `Adj. R-Square` `Mallow's Cp`
# <int> <int> <chr> <chr> <chr> <chr>
# 1 1 1 log(wt) 0.81015 0.80382 9.73408
# 2 2 1 wt 0.75283 0.74459 21.12525
# 3 3 1 I(wt^2) 0.64232 0.63040 43.08976
# 4 4 1 I(drat^2) 0.46694 0.44917 77.94664
# 5 5 1 drat 0.46400 0.44613 78.53263
# 6 6 1 log(drat) 0.45406 0.43587 80.50653
# 7 7 1 log(qsec) 0.17774 0.15033 135.42760
# 8 8 1 qsec 0.17530 0.14781 135.91242
# 9 9 1 I(qsec^2) 0.17005 0.14239 136.95476
#10 10 2 log(wt) log(qsec) 0.87935 0.87103 -2.02118
## ... with 501 more rows
A few comments:
This very quickly gets very computationally intensive.
The y ~ . formula syntax is very concise, but I haven't found a way to include e.g. quadratic terms. On one hand, y ~ . + (.)^2 works for including all interaction terms; on the other hand, y ~ . + I(.^2) does not work for quadratic terms. That's why I think building terms manually is the way to go.

R adding regression coeffcients to data frame

I have a list of dataframes that contains many subsets of data (470ish). I am trying to run a regression on each of them and add the regression coefficients to a dataframe. The dataframe will contain the coefficients for all dependent variables on each subgroup. I tried iterating with a for loop but obviously that is not the right way. I think the solution has something to do with lapply?
for (i in ListOfTraining){
lm(JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC data=ListOfTraining[[i]])
}
Thanks for any advice!
The function tidy from package broom handles this nicely.
library(dplyr) # bind_rows is more efficient than do.call(rbind, ...)
library(broom) # put statistics into data.frame
bind_rows(lapply(ListOfTraining, function(dat)
tidy(lm(JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC, data=dat))))
Example
dataList <- split(mtcars, mtcars$cyl) # list of data.frames by number of cylinders
lapply(dataList, function(dat) tidy(lm(mpg ~ disp + hp, data=dat))) %>% # fit models
bind_rows() %>% # combine into one data.frame
mutate(model=rep(1:length(dataList), each=3)) # add a model ID column
# term estimate std.error statistic p.value model
# 1 (Intercept) 43.040057552 4.235724713 10.16120274 7.531962e-06 1
# 2 disp -0.119536016 0.036945788 -3.23544366 1.195900e-02 1
# 3 hp -0.046091563 0.047423668 -0.97191054 3.595602e-01 1
# 4 (Intercept) 20.151209478 6.938235241 2.90437104 4.392508e-02 2
# 5 disp 0.001796527 0.020195109 0.08895852 9.333909e-01 2
# 6 hp -0.006032441 0.034597750 -0.17435935 8.700522e-01 2
# 7 (Intercept) 24.044775630 4.045729006 5.94324919 9.686231e-05 3
# 8 disp -0.018627566 0.009456903 -1.96973225 7.456584e-02 3
# 9 hp -0.011315585 0.012572498 -0.90002676 3.873854e-01 3
Alternatively, you could bind the data.frames beforehand, assuming they have the same columns. Then, fit models using lmList from nlme package.
## Combine list of data.frames into one data.frame with a factor variable
lengths <- sapply(dataList, nrow) # in case data.frames have different num. rows
dat <- dataList %>% bind_rows() %>%
mutate(group=rep(1:length(dataList), times=lengths)) # group id column
library(nlme) # lmList()
models <- lmList(mpg ~ disp + hp | group, data=dat) # make models, grouped by group
models$coefficients
# (Intercept) disp hp
# 1 43.04006 -0.119536016 -0.046091563
# 2 20.15121 0.001796527 -0.006032441
# 3 24.04478 -0.018627566 -0.011315585
You can solve this using the for loop, if you prefer. Your problem is that the results aren't being saved to an object as the loop progresses. You can see the below for an example using the built-in mtcars dataframe.
(This first example is revised based on OP's request for an example of how to also extract the R squared value.)
ListOfTraining <- list(mtcars, mtcars)
results <- list()
for (i in seq_along(ListOfTraining)) {
lm_obj <- lm(disp ~ qsec, data = ListOfTraining[[i]])
tmp <- c(lm_obj$coefficients, summary(lm_obj)$r.squared)
names(tmp)[length(tmp)] <- "r.squared"
results[[i]] <- tmp
}
results <- do.call(rbind, results)
results
You can also rewrite the for loop using lapply as demoed below.
ListOfTraining <- list(mtcars, mtcars)
results <- list()
results <- lapply(ListOfTraining, function(x) {
lm(disp ~ qsec, data = x)$coefficients
})
results <- do.call(rbind, results)
results
Finally, you can use the plyr package's ldply function which will convert the list applied outputs into a dataframe automatically (if possible).
ListOfTraining <- list(mtcars, mtcars)
results <- plyr::ldply(ListOfTraining, function(x) {
lm(disp ~ qsec, data = x)$coefficients
})
results
Your current code runs the regression, but does not do anything with the results (inside of a loop they are not even autoprinted), so they are just discarded. You need to have some structure to save the results into.
The following code will create a matrix of coefficients (assuming that all the regressions run without error and the number of final coefficients is the same):
my.coef <- sapply( ListOfTraining, function(dat) {
coef(lm( JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC,
data=dat) )
})
The matrix can then be converted to a data frame (you could also use lapply and convert to a data frame, but I think the sapply option is probably a little simpler).

Using R to do a regression with multiple dependent and multiple independent variables

I am trying to do a regression with multiple dependent variables and multiple independent variables. Basically I have House Prices at a county level for the whole US, this is my IV. I then have several other variables at a county level (GDP, construction employment), these constitute my dependent variables. I would like to know if there is an efficient way to do all of these regressions at the same time. I am trying to get:
lm(IV1 ~ DV11 + DV21)
lm(IV2 ~ DV12 + DV22)
I would like to do this for each independent and each dependent variable.
EDIT: The OP added this information in response to my answer, now deleted, which misunderstood the question.
I don't think I explained this question very well, I apologize. Every dependent variable has 2 independent variables associated with it, that unique. So if I have 500 dependent variables, I have 500 unique independent variable 1, and 500 unique independent variable 2.
Ok, I will try once more, if I fail to explain myself again I may just give up (haha). I don't know what you mean by mtcars from R though [this is in reference to Metrics's answer], so let me try it this way. I'm going to have 3 vectors of data roughly 500 rows in each one. I'm trying to build a regression out of each row of data. Let's say vector 1 is my dependent variable (the one I'm trying to predict), and vectors 2 and 3 make up my independent variables. So the first regression would consist of the row 1 value for each vector, the 2nd would consist of the row 2 value for each one and so on. Thank you all again.
I am assuming you have dataframe as mydata.
mydata<-mtcars #mtcars is the data in R
dep<-c("mpg~","cyl~","disp~") # list of unique dependent variables with ~
indep1<-c("hp","drat","wt") # list of first unique independent variables
indep2<-c("qsec","vs","am") # list of second unique independent variables
> myvar<-cbind(dep,indep1,indep2) # matrix of variables
> myvar
dep indep1 indep2
[1,] "mpg~" "hp" "qsec"
[2,] "cyl~" "drat" "vs"
[3,] "disp~" "wt" "am"
for (i in 1:dim(myvar)[1]){
print(paste("This is", i, "regression", "with dependent var",gsub("~","",myvar[i,1])))
k[[i]]<-lm(as.formula(paste(myvar[i,1],paste(myvar[i,2:3],collapse="+"))),mydata)
print(k[[i]]
}
[1] "This is 1 regression with dependent var mpg"
Call:
lm(formula = as.formula(paste(myvar[i, 1], paste(myvar[i, 2:3],
collapse = "+"))), data = mydata)
Coefficients:
(Intercept) hp qsec
48.32371 -0.08459 -0.88658
[1] "This is 2 regression with dependent var cyl"
Call:
lm(formula = as.formula(paste(myvar[i, 1], paste(myvar[i, 2:3],
collapse = "+"))), data = mydata)
Coefficients:
(Intercept) drat vs
12.265 -1.421 -2.209
[1] "This is 3 regression with dependent var disp"
Call:
lm(formula = as.formula(paste(myvar[i, 1], paste(myvar[i, 2:3],
collapse = "+"))), data = mydata)
Coefficients:
(Intercept) wt am
-148.59 116.47 11.31
Note: You can use the same process for the large number of variables.
Alternative approach:
Motivated by Hadley's answer here, I use function Map to solve above problem:
dep<-list("mpg~","cyl~","disp~") # list of unique dependent variables with ~
indep1<-list("hp","drat","wt") # list of first unique independent variables
indep2<-list("qsec","vs","am") # list of second unique independent variables
Map(function(x,y,z) lm(as.formula(paste(x,paste(list(y,z),collapse="+"))),data=mtcars),dep,indep1,indep2)
[[1]]
Call:
lm(formula = as.formula(paste(x, paste(list(y, z), collapse = "+"))),
data = mtcars)
Coefficients:
(Intercept) hp qsec
48.32371 -0.08459 -0.88658
[[2]]
Call:
lm(formula = as.formula(paste(x, paste(list(y, z), collapse = "+"))),
data = mtcars)
Coefficients:
(Intercept) drat vs
12.265 -1.421 -2.209
[[3]]
Call:
lm(formula = as.formula(paste(x, paste(list(y, z), collapse = "+"))),
data = mtcars)
Coefficients:
(Intercept) wt am
-148.59 116.47 11.31

Resources