Coerce model coefficients to clean, 2-column dataframe - r

I am fitting an elastic net with cross-validation and I am looking at how big the coefficients are for each predictor:
lambda <- cv.glmnet(x = features_training, y = outcomes_training, alpha = 0)
elnet <- lambda$glmnet.fit
coefs <- coef(elnet, s = lambda$lambda.min, digits = 3)
The coefs variable contains a dgCMatrix:
1
(Intercept) -1.386936e-16
ret 4.652863e-02
ind30 -2.419878e-03
spyvol 1.570406e-02
Is there a quick way to turn this into a dataframe with 2 columns (one for the predictor name and the other for the coefficient value)? as.data.frame, as.matrix or chaining both did not work. I would notably like to sort the rows according to the second column.

broom::tidy has a nice method for coercing dgCMatrix objects to long-form data frames (a bit like as.data.frame.table), which works well here:
mod <- glmnet::cv.glmnet(model.matrix(~ ., mtcars[-1]), mtcars$mpg, alpha = 0)
broom::tidy(coef(mod$glmnet.fit, s = mod$lambda.min, digits = 3))
#> row column value
#> 1 (Intercept) 1 21.171285892
#> 2 cyl 1 -0.368057153
#> 3 disp 1 -0.005179902
#> 4 hp 1 -0.011713150
#> 5 drat 1 1.053216800
#> 6 wt 1 -1.264212476
#> 7 qsec 1 0.164975032
#> 8 vs 1 0.756163432
#> 9 am 1 1.655635460
#> 10 gear 1 0.546651086
#> 11 carb 1 -0.559817882

Another way, and no hacks through attributes() function, but extracting the rownames and matrix values. The attributes(class(coefs)) informs that dgCMatrix is a sparse matrix created using Matrix package.
data.frame( predict_names = rownames(coefs),
coef_vals = matrix(coefs))
# predict_names coef_vals
# 1 (Intercept) 21.117339411
# 2 (Intercept) 0.000000000
# 3 cyl -0.371338786
# 4 disp -0.005254534
# 5 hp -0.011613216
# 6 drat 1.054768651
# 7 wt -1.234201216
# 8 qsec 0.162451314
# 9 vs 0.771959823
# 10 am 1.623812912
# 11 gear 0.544171362
# 12 carb -0.547415029

Related

fit_xy() usage for cross_validation in Tidy Models

I am new to Tidy Models and liking it so far but have a question with using a non-formula interface for resampling/cross-validation. The way I understand it so far, in order for me to apply resampling()/cross validation, I should write a
recipe with a formula: outcome ~ predictors
rf_rec <-
recipe(y_graduated ~ .,
data = trainDat_predSet)
specify a model
# Setting Random Forest Model Specifications
rf_model <-
rand_forest() %>%
set_engine("ranger") %>%
set_mode("classification") %>%
set_args(mtry = 3,
trees = 50,
min_n = 5)
create folds
set.seed(1234)
trainDatFolds <-
rsample::vfold_cv(data = trainDat, v = 5)
put recipe and model specification in a workflow
rf_workflow <-
workflow() %>%
add_recipe(rf_rec) %>%
add_model(rf_model)
Then fit the resampling.
rf_workflow %>%
fit_resamples(resamples = trainDatFolds,
metrics = metric_set(roc_auc, pr_auc, accuracy),
control = control_resamples(save_pred = TRUE)
)
For my purposes, it is far more convenient to be able to use a non-formula interface of outcome ~ predictors.
Without the recipe step and if I was doing resampling fit, I could easily use the function -
fit_xy() to specify the y - outcome and x - predictor set.
Is that an option for fitting in resampling?
Thanks a lot!
There is not an x/y interface but an easy way to get there without a formula:
library(recipes)
rec <- recipe(mtcars)
summary(rec)
#> # A tibble: 11 x 4
#> variable type role source
#> <chr> <chr> <lgl> <chr>
#> 1 mpg numeric NA original
#> 2 cyl numeric NA original
#> 3 disp numeric NA original
#> 4 hp numeric NA original
#> 5 drat numeric NA original
#> 6 wt numeric NA original
#> 7 qsec numeric NA original
#> 8 vs numeric NA original
#> 9 am numeric NA original
#> 10 gear numeric NA original
#> 11 carb numeric NA original
# now add roles
rec <-
rec %>%
update_role(mpg, new_role = "outcome") %>%
update_role(-mpg, new_role = "predictor")
summary(rec)
#> # A tibble: 11 x 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 mpg numeric outcome original
#> 2 cyl numeric predictor original
#> 3 disp numeric predictor original
#> 4 hp numeric predictor original
#> 5 drat numeric predictor original
#> 6 wt numeric predictor original
#> 7 qsec numeric predictor original
#> 8 vs numeric predictor original
#> 9 am numeric predictor original
#> 10 gear numeric predictor original
#> 11 carb numeric predictor original
Created on 2020-11-06 by the reprex package (v0.3.0)

How to stop duplicate correlations in table of Spearman Rho correlations in R?

I used the following code to make a table of Spearman Rho correlations for a CSV file with 2+ columns in R:
> myDataset <- read.csv(file.choose())
> attach(myDataset)
> spearmanRhoTestData <- cor(myDataset, use="complete.obs",method="spearman")
However, in my table (spearmanRhoTestData), the correlation between any two variables will show up twice (like the following):
Var1 Var2 Var3 Var4
Var1 1 0.5 0.7 0.9
Var2 0.5 1 0.3 0.6
Var3 0.7 0.3 1 0.2
Var4 0.9 0.6 0.2 1
Is there any way I could write code in R to get rid of the correlation values (ex: between var1 and var2) from showing up twice in the entire table??
The simplest approach assuming that you want to keep a correlation matrix format is
# set upper triangle values to NA
spearmanRhoTestData[upper.tri(spearmanRhoTestData)] = NA
# visualise updated matrix
spearmanRhoTestData
Here's an alternative approach, using the corrr package, which will give you a reshaped correlation dataframe, without duplicates:
library(corrr)
# get correaltion matrix
tbl = correlate(mtcars)
# set upper triangle values to NA
tbl[upper.tri(tbl)] = NA
# reshape and omit NAs
stretch(tbl, na.rm = T)
# # A tibble: 55 x 3
# x y r
# <chr> <chr> <dbl>
# 1 mpg cyl -0.852
# 2 mpg disp -0.848
# 3 mpg hp -0.776
# 4 mpg drat 0.681
# 5 mpg wt -0.868
# 6 mpg qsec 0.419
# 7 mpg vs 0.664
# 8 mpg am 0.600
# 9 mpg gear 0.480
# 10 mpg carb -0.551
# # ... with 45 more rows

lapply and for loop to run a function through a list of data.frames in R

I have a list of data.frame and I'd like to run cor.test through each data.frame.
The data.frame has 8 columns, I would like to run cor.test for each of the first 7 columns against the 8th column.
I first set up the lists for storing the data
estimates = list()
pvalues = list()
Then here's the loop combining with lapply
for (i in 1:7){
corr <- lapply(datalist, function(x) {cor.test(x[,i], x[,8], alternative="two-sided", method="spearman", exact=FALSE, continuity=TRUE)})
estimates= corr$estimate
pvalues= corr$p.value
}
It ran without any errors but the estimates shows NULL
Which part of this went wrong? I used to run for loop over cor.test or run is with lapply, never put them together. I wonder if there's a solution to this or an alternative. Thank you.
We can use sapply, showing with an example on mtcars where cor.test is performed with all columns against the first column.
lst <- list(mtcars, mtcars)
lapply(lst, function(x) t(sapply(x[-8], function(y) {
val <- cor.test(y, x[[8]], alternative ="two.sided",
method="spearman", exact=FALSE, continuity=TRUE)
c(val$estimate, pval = val$p.value)
})))
[[1]]
# rho pval
#mpg 0.7065968 6.176953e-06
#cyl -0.8137890 1.520674e-08
#disp -0.7236643 2.906504e-06
#hp -0.7515934 7.247490e-07
#drat 0.4474575 1.021422e-02
#wt -0.5870162 4.163577e-04
#qsec 0.7915715 6.843882e-08
#am 0.1683451 3.566025e-01
#gear 0.2826617 1.168159e-01
#carb -0.6336948 9.977275e-05
#[[2]]
# rho pval
#mpg 0.7065968 6.176953e-06
#cyl -0.8137890 1.520674e-08
#.....
This returns you list of two column matrix with estimate and p.value respectively.
Disclaimer: This answer uses the developer version of manymodelr that I also wrote.
EDIT: You can map it to your list of data frames with Map or lapply for instance:
lst <- list(mtcars, mtcars) #Line copied and pasted from #Ronak Shah's answer
Map(function(x) manymodelr::get_var_corr(x, "mpg",get_all = TRUE,
alternative="two.sided",
method="spearman",
continuity=TRUE,exact=F),lst)
For a single data.frame object, we can use get_var_corr:
manymodelr::get_var_corr(mtcars, "mpg",get_all = TRUE,
alternative="two.sided",
method="spearman",
continuity=TRUE,exact=FALSE)
# Comparison_Var Other_Var p.value Correlation
# 1 mpg cyl 4.962301e-13 -0.9108013
# 2 mpg disp 6.731078e-13 -0.9088824
# 3 mpg hp 5.330559e-12 -0.8946646
# 4 mpg drat 5.369227e-05 0.6514555
# 5 mpg wt 1.553261e-11 -0.8864220
# 6 mpg qsec 7.042244e-03 0.4669358
# 7 mpg vs 6.176953e-06 0.7065968
# 8 mpg am 8.139885e-04 0.5620057
# 9 mpg gear 1.325942e-03 0.5427816
# 10 mpg carb 4.385340e-05 -0.6574976
purrr has some convenience functions could possibly make this operation a little more simple (although its debatable whether this is actually simpler than the Map/lapply way). Using Ronak's example list lst:
library(purrr)
lst <- list(mtcars, mtcars)
map2(map(lst, ~.[-8]), map(lst, 8), ~
map(.x, cor.test, y = .y,
alternative = "two.sided",
method = "spearman",
exact = FALSE,
continuity = TRUE) %>%
map_dfr(extract, c('estimate', 'p.value'), .id = 'var'))
# [[1]]
# # A tibble: 10 x 3
# var estimate p.value
# <chr> <dbl> <dbl>
# 1 mpg 0.707 0.00000618
# 2 cyl -0.814 0.0000000152
# 3 disp -0.724 0.00000291
# 4 hp -0.752 0.000000725
# 5 drat 0.447 0.0102
# 6 wt -0.587 0.000416
# 7 qsec 0.792 0.0000000684
# 8 am 0.168 0.357
# 9 gear 0.283 0.117
# 10 carb -0.634 0.0000998
#
# [[2]]
# # A tibble: 10 x 3
# var estimate p.value
# <chr> <dbl> <dbl>
# 1 mpg 0.707 0.00000618
# 2 cyl -0.814 0.0000000152
# 3 disp -0.724 0.00000291
# 4 hp -0.752 0.000000725
# 5 drat 0.447 0.0102
# 6 wt -0.587 0.000416
# 7 qsec 0.792 0.0000000684
# 8 am 0.168 0.357
# 9 gear 0.283 0.117
# 10 carb -0.634 0.0000998

Mean with condition for multiple columns in r

Let's use mtcars to explain the situation.
What I want to do is the same below for multiple columns. To have the mean of a column qsec (in the example) regarding another column with a specific value (4 and 6, in the example below). I'll compare the result later so maybe I would store the results in a vector
table(mtcars$cyl)
4 6 8
11 7 14
mean(mtcars$qsec[mtcars$cyl == 4], na.rm = T)
mean(mtcars$qsec[mtcars$gear == 4], na.rm = T)
I would like to check the means of qsec regarding the cyl, and let's say gear and carb, with the same "pattern" for the mean i.e. mean of observations with 4 and mean of observations 6. In the true dataset would be several columns that have the same set of numbers (2, 0 and 1). I'll compare the means of a column (in the example qsec) with observations 2 and 0.
I've tried to look at the functions like tapply, apply, sapply. But I'm stuck in having the condition in the mean applying for every column (at once).
Hope I made myself clear.
Thank you!
The function you are looking for is aggregate:
aggregate(. ~ cyl, FUN=mean, data=mtcars[,c("cyl", "qsec", "gear", "carb")],
subset=cyl %in% c(4, 6)
)
cyl qsec gear carb
1 4 19.13727 4.090909 1.545455
2 6 17.97714 3.857143 3.428571
In the function above data= is the data.frame. Here we only selected the wanted columns. And the subset= specifies which rows of the data to keep (in this case only cyl 4 and 6).
The formula . ~ cyl instructs to summarise all columns according to the cyl column.
a data.table solution:
require(data.table)
mtcars[cyl %in% c(4, 6), .(mn_qsec = mean(qsec),
mn_gear = mean(gear),
mn_carb = mean(carb)),
by = cyl]
What I understand you're looking for is the mean of qsec for each level of cyl, gear, and carb separately, not in combination. This code gets you that, but doesn't directly let you select specific levels of those factors. If you need to be able to do that second part, I think you should be able to tweak this to get there, but I'm not sure how...
apply(mtcars[,c("cyl","gear","carb")], 2, function(x) {
aggregate(mtcars[,"qsec"],list(x),mean)
})
Output:
$cyl
Group.1 x
1 4 19.13727
2 6 17.97714
3 8 16.77214
$gear
Group.1 x
1 3 17.692
2 4 18.965
3 5 15.640
$carb
Group.1 x
1 1 19.50714
2 2 18.18600
3 3 17.66667
4 4 16.96500
5 6 15.50000
6 8 14.60000
On option is to use dplyr::mutate_at as OP wants to apply same function on multiple column. The solution will be as:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(c("qsec", "gear", "carb")), funs(mean), na.rm = TRUE) %>%
filter(cyl!=8)
# # A tibble: 2 x 4
# cyl qsec gear carb
# <dbl> <dbl> <dbl> <dbl>
# 1 4.00 19.1 4.09 1.55
# 2 6.00 18.0 3.86 3.43

loop or apply multiple regressions, extract coefficients and p-values into data frame

I have a data frame with 3 dependent (LHS) variables and 4 independent (RHS) variables. I'd like to run a linear regression of each LHS variable on each RHS varaiable and store the results of each regression as a row in the data frame with the columns: lhs, rhs, Estimate, Std. Error, t value, Pr(>|t|).
For example, using mtcars, I considered a nested loop:
lhs <- c('mpg', 'cyl', 'disp')
rhs <- c('hp', 'drat', 'wt', 'qsec')
reg_count <- 1
for (i in lhs){
for (j in rhs){
model <- lm(i ~ j, data = mtcars)
results[reg_count] <- coef(summary(model))
reg_count <- reg_count + 1
}
}
However, this fails for a number of reasons. Is there a simple way I can do this? Ideally using an apply() function rather than a loop?
Here's how I would do it. I shortened your example a little, but that won't matter:
lhs <- c('mpg', 'cyl', 'disp')
rhs <- c('hp', 'drat')
models = list()
for (i in lhs){
for (j in rhs){
models[[paste(i, "vs", j)]] <- lm(as.formula(paste(i, "~", j)), data = mtcars)
}
}
If you want to use apply, you'll need to start with a matrix. The difference in runtime will be negligible.
# with apply:
coefs_mat = expand.grid(lhs, rhs)
mods = apply(coefs_mat, 1, function(row) {
lm(as.formula(paste(row[1], "~", row[2])), data = mtcars)
})
names(mods) = with(coefs_mat, paste(Var1, "vs", Var2))
Both methods give the same results. Now we can pull the coefficients, etc. with broom::tidy
# get coefs
library(broom)
coefs = lapply(mods, tidy, simplify = F)
# combine
dplyr::bind_rows(coefs, .id = "mod")
# mod term estimate std.error statistic p.value
# 1 mpg vs hp (Intercept) 30.09886054 1.633921e+00 18.4212465 6.642736e-18
# 2 mpg vs hp hp -0.06822828 1.011930e-02 -6.7423885 1.787835e-07
# 3 cyl vs hp (Intercept) 3.00679525 4.254852e-01 7.0667442 7.405351e-08
# 4 cyl vs hp hp 0.02168354 2.635142e-03 8.2286042 3.477861e-09
# 5 disp vs hp (Intercept) 20.99248341 3.260662e+01 0.6438104 5.245902e-01
# 6 disp vs hp hp 1.42977003 2.019414e-01 7.0801224 7.142679e-08
# 7 mpg vs drat (Intercept) -7.52461844 5.476663e+00 -1.3739423 1.796391e-01
# 8 mpg vs drat drat 7.67823260 1.506705e+00 5.0960421 1.776240e-05
We can also pull out model summary stats:
# get summary stats
summ = lapply(mods, glance, simplify = F)
dplyr::bind_rows(summ, .id = "mod")
# mod r.squared adj.r.squared sigma statistic p.value df logLik
# 1 mpg vs hp 0.6024373 0.5891853 3.862962 45.45980 1.787835e-07 2 -87.61931
# 2 cyl vs hp 0.6929688 0.6827344 1.005944 67.70993 3.477861e-09 2 -44.56307
# 3 disp vs hp 0.6255997 0.6131197 77.089503 50.12813 7.142679e-08 2 -183.41236
# 4 mpg vs drat 0.4639952 0.4461283 4.485409 25.96964 1.776240e-05 2 -92.39996
# 5 cyl vs drat 0.4899134 0.4729105 1.296596 28.81354 8.244636e-06 2 -52.68517
# 6 disp vs drat 0.5044038 0.4878839 88.693360 30.53315 5.282022e-06 2 -187.89934
# AIC BIC deviance df.residual
# 1 181.23863 185.63584 447.67431 30
# 2 95.12614 99.52335 30.35771 30
# 3 372.82473 377.22194 178283.74604 30
# 4 190.79993 195.19714 603.56673 30
# 5 111.37033 115.76754 50.43482 30
# 6 381.79868 386.19588 235995.36410 30
You can start with expand.grid to give a nice dataframe of dependent/independent variable pairs. Then add the formulae and models to the data.
pairings <- expand.grid(
lhs = c('mpg', 'cyl', 'disp'),
rhs = c('hp', 'drat', 'wt', 'qsec')
)
pairings[["formula"]] <- lapply(
X = paste(pairings[["lhs"]], "~", pairings[["rhs"]]),
FUN = as.formula
)
pairings[["model"]] <- lapply(
X = pairings[["formula"]],
FUN = lm,
data = mtcars
)
The results:
str(pairings, max.level = 1)
# 'data.frame': 12 obs. of 4 variables:
# $ lhs : Factor w/ 3 levels "mpg","cyl","disp": 1 2 3 1 2 3 1 2 3 1 ...
# $ rhs : Factor w/ 4 levels "hp","drat","wt",..: 1 1 1 2 2 2 3 3 3 4 ...
# $ formula:List of 12
# $ model :List of 12
# - attr(*, "out.attrs")=List of 2
pairings[["model"]][[1]]
# Call:
# FUN(formula = X[[i]], data = ..1)
#
# Coefficients:
# (Intercept) hp
# 30.09886 -0.06823

Resources