can't group by with a tibble

can't group by with a tibble - r

I'm doing cross validation (five fold). Then I want to calculate the mean value for each group in a given data set I used for that cv. Please note that I need to use the following functions.
data(mpg)
library(modelr)
cv <- crossv_kfold(mpg, k = 5)
models1 <- map(cv$train, ~lm(hwy ~ displ, data = .))
get_pred <- function(model, test_data){
data <- as.data.frame(test_data)
pred <- add_predictions(data, model)
return(pred)
}
pred1 <- map2_df(models1, cv$test, get_pred, .id = "Run")
MSE1 <- pred1 %>% group_by(Run) %>%
summarise(MSE = mean( (hwy - pred)^2))
MSE1
My problem lies with the output of 'summarise'. The function should be applied to each group. The result should look something like this:
## # A tibble: 5 x 2
## Run MSE
## <chr> <dbl>
## 1 1 27.889532
## 2 2 8.673054
## 3 3 17.033056
## 4 4 12.552037
## 5 5 9.138741
Unfortunately, I get only one value:
MSE
1 14.77799
How can I get a tibble like that above?

When I run your code, I get the style of output you are expecting (though the numbers are different (as the seed wasn't set in your example)); I do not see a summarise-type problem like you do:
library(ggplot2)
library(modelr)
library(purrr)
library(dplyr)
data(mpg)
cv <- crossv_kfold(mpg, k = 5)
models1 <- map(cv$train, ~lm(hwy ~ displ, data = .))
get_pred <- function(model, test_data){
data <- as.data.frame(test_data)
pred <- add_predictions(data, model)
return(pred)
}
pred1 <- map2_df(models1, cv$test, get_pred, .id = "Run")
MSE1 <- pred1 %>% group_by(Run) %>%
summarise(MSE = mean( (hwy - pred)^2))
MSE1
# A tibble: 5 x 2
Run MSE
<chr> <dbl>
1 1 7.80
2 2 12.5
3 3 9.82
4 4 27.3
5 5 17.5

Related

How can I extract, label and data.frame values from Console in a loop?

I made a nls loop and get values calculated in console. Now I want to extract those values, specify which values are from which group and put everything in a dataframe to continue working.
my loop so far:
for (i in seq_along(trtlist2)) { loopmm.nls <-
nls(rate ~ (Vmax * conc /(Km + conc)),
data=subset(M3, M3$trtlist==trtlist2[i]),
start=list(Km=200, Vmax=2), trace=TRUE )
summary(loopmm.nls)
print(summary(loopmm.nls))
}
the output in console: (this is what I want to extract and put in a dataframe, I have this same "parameters" thing like 20 times)
Parameters:
Estimate Std. Error t value Pr(>|t|)
Km 23.29820 9.72304 2.396 0.0228 *
Vmax 0.10785 0.01165 9.258 1.95e-10 ***
---
different ways of extracting data from the console that work but not in the loop (so far!)
#####extract data in diff ways from nls#####
## extract coefficients as matrix
Kinall <- summary(mm.nls)$parameters
## extract coefficients save as dataframe
Kin <- as.data.frame(Kinall)
colnames(Kin) <- c("values", "SE", "T", "P")
###create Km Vmax df
Kms <- Kin[1, ]
Vmaxs <- Kin[2, ]
#####extract coefficients each manually
Km <- unname(coef(summary(mm.nls))["Km", "Estimate"])
Vmax <- unname(coef(summary(mm.nls))["Vmax", "Estimate"])
KmSE <- unname(coef(summary(mm.nls))["Km", "Std. Error"])
VmaxSE <- unname(coef(summary(mm.nls))["Vmax", "Std. Error"])
KmP <- unname(coef(summary(mm.nls))["Km", "Pr(>|t|)"])
VmaxP <- unname(coef(summary(mm.nls))["Vmax", "Pr(>|t|)"])
KmT <- unname(coef(summary(mm.nls))["Km", "t value"])
VmaxT <- unname(coef(summary(mm.nls))["Vmax", "t value"])
one thing that works if you extract data through append, but somehow that only works for "estimates" not the rest
Kms <- append(Kms, unname(coef(loopmm.nls)["Km"] ))
Vmaxs <- append(Vmaxs, unname(coef(loopmm.nls)["Vmax"] ))
}
Kindf <- data.frame(trt = trtlist2, Vmax = Vmaxs, Km = Kms)

I would just keep everything in the dataframe for ease. You can nest by the group and then run the regression then pull the coefficients out. Just make sure you have tidyverse and broom installed on your computer.
library(tidyverse)
#example
mtcars |>
nest(data = -cyl) |>
mutate(model = map(data, ~nls(mpg~hp^b,
data = .x,
start = list(b = 1))),
clean_mod = map(model, broom::tidy)) |>
unnest(clean_mod) |>
select(-c(data, model))
#> # A tibble: 3 x 6
#> cyl term estimate std.error statistic p.value
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 b 0.618 0.0115 53.6 2.83e- 9
#> 2 4 b 0.731 0.0217 33.7 1.27e-11
#> 3 8 b 0.504 0.0119 42.5 2.46e-15
#what I expect will work for your data
All_M3_models <- M3 |>
nest(data = -trtlist) |>
mutate(model = map(data, ~nls(rate ~ (Vmax * conc /(Km + conc)),
data=.x,
start=list(Km=200, Vmax=2))),
clean_mod = map(model, broom::tidy))|>
unnest(clean_mod) |>
select(-c(data, model))

Creating loop over columns to calculate regression and then compare best combination of variables

I am trying to run a loop which takes different columns of a dataset as the dependent variable and remaining variables as the independent variables and run the lm command.
Here's my code
quant<-function(a){
i=1
colnames1<-colnames(a)
lm_model <- linear_reg() %>%
set_engine('lm') %>% # adds lm implementation of linear regression
set_mode('regression')
for (i in 1:ncol(a)) {
lm_fit <- lm_model %>%
fit(colnames1[i] ~ ., data = set1)
comp_matrix[i]<-tidy(lm_fit)[1,2]
i<-i+1
}
}
When I provide it with a dataset. It is showing this error.
> quant(set1)
Error in model.frame.default(formula = colnames1[i] ~ ., data = data, : variable lengths differ (found for 'Imp of Family')
I will be using comp_matrix for coefficient comparison among models later on. Is there a better way to do this fundamentally?
Sample Data in picture:
Packages used:
library(dplyr)
library(haven)
library(ggplot2)
library(tidyverse)
library(broom)
library(modelsummary)
library(parsnip)

We could change the line of fit with
fit(as.formula(paste(colnames1[i], "~ .")), data = a)
-full function
quant<-function(a){
a <- janitor::clean_names(a)
colnames1 <- colnames(a)
lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
out_lst <- vector('list', ncol(a))
for (i in seq_along(a)) {
lm_fit <- lm_model %>%
fit(as.formula(paste(colnames1[i], "~ .")), data = a)
out_lst[[i]]<-tidy(lm_fit)[1,2]
}
out_lst
}
-testing
> dat <- tibble(col1 = 1:5, col2 = 5:1)
> quant(dat)
[[1]]
# A tibble: 1 × 1
estimate
<dbl>
1 6
[[2]]
# A tibble: 1 × 1
estimate
<dbl>
1 6

DALEX and step_pca

I would like to look at the compound feature importance of the principal components with DALEX model_parts but I am also interested to what extent the results are driven by variation in a specific variable in this principal component. I can look at individual feature influence very neatly when using model_profile but in that case, I cannot investigate the feature importance of the PCA variables. Is it possible to get the best of both world and look at the compound feature importance of a principal component while using model_profile partial dependence plots of individual factors as shown below?
Data:
library(tidymodels)
library(parsnip)
library(DALEXtra)
set.seed(1)
x1 <- rbinom(1000, 5, .1)
x2 <- rbinom(1000, 5, .4)
x3 <- rbinom(1000, 5, .9)
x4 <- rbinom(1000, 5, .6)
# id <- c(1:1000)
y <- as.factor(rbinom(1000, 5, .5))
df <- tibble(y, x1, x2, x3, x4, id)
df[, c("x1", "x2", "x3", "x4", "id")] <- sapply(df[, c("x1", "x2", "x3", "x4", "id")], as.numeric)
Model
# create training and test set
set.seed(20)
split_dat <- initial_split(df, prop = 0.8)
train <- training(split_dat)
test <- testing(split_dat)
# use cross-validation
kfolds <- vfold_cv(df)
# recipe
rec_pca <- recipe(y ~ ., data = train) %>%
update_role(id, new_role = "id variable") %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_pca(x1, x2, x3, threshold = 0.9, num_comp = turn_off_pca)
# parsnip engine
boost_model <- boost_tree() %>%
set_mode("classification") %>%
set_engine("xgboost")
# create wf
boosted_wf <-
workflow() %>%
add_model(boost_model) %>%
add_recipe(rec_pca)
final_boosted <- generics::fit(boosted_wf, df)
# create an explanation object
explainer_xgb <- DALEXtra::explain_tidymodels(final_boosted,
data = df[,-1],
y = df$y)
# feature importance
model_parts(explainer_xgb) %>% plot()
This gives me the following plot although even if I have reduced x1, x2 and x3 into one component in step_pca above.
I know I could reduce dimensions manually and bind it to the df like so and then look at the feature importance.
rec_pca_2 <- df %>%
select(x1, x2, x3) %>%
recipe() %>%
step_pca(all_numeric(), num_comp = 1)
df <- bind_cols(df, prep(rec_pca_2) %>% juice())
df
> df
# A tibble: 1,000 × 6
y x1 x2 x3 x4 PC1
<fct> <int> <int> <int> <int> <dbl>
1 2 0 2 4 2 -4.45
2 3 0 3 3 3 -3.95
3 0 0 2 4 4 -4.45
4 2 1 4 5 3 -6.27
5 4 0 1 5 2 -4.94
6 2 1 0 5 1 -4.63
7 3 2 2 5 4 -5.56
8 3 1 2 5 3 -5.45
9 2 1 3 5 2 -5.86
10 2 0 2 5 1 -5.35
# … with 990 more rows
I could then estimate a model with PC1 as covariate. Yet, in that case, it would be difficult to interpret what the variation in PC1 substatial means when using model_profile since everything would be collapsed into one component.
model_profile(explainer_xgb) %>% plot()
Thus, my key question is: how can I look at the feature importance of components without compromising on the interpretability of the partial dependence plot?

You may be interested in the discussion here on how to get explainability from the original predictors vs. features that have been created via feature engineering (like PCA components). We don't have a super fluent interface yet, so you have to do this is a bit manually:
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(parsnip)
library(DALEX)
#> Welcome to DALEX (version: 2.4.0).
#> Find examples and detailed introduction at: http://ema.drwhy.ai/
#>
#> Attaching package: 'DALEX'
#> The following object is masked from 'package:dplyr':
#>
#> explain
set.seed(1)
x1 <- rbinom(1000, 5, .1)
x2 <- rbinom(1000, 5, .4)
x3 <- rbinom(1000, 5, .9)
x4 <- rbinom(1000, 5, .6)
y <- as.factor(sample(c("yes", "no"), size = 1000, replace = TRUE))
df <- tibble(y, x1, x2, x3, x4) %>% mutate(across(where(is.integer), as.numeric))
# create training and test set
set.seed(20)
split_dat <- initial_split(df, prop = 0.8)
train <- training(split_dat)
test <- testing(split_dat)
# use cross-validation
kfolds <- vfold_cv(df)
# recipe
rec_pca <- recipe(y ~ ., data = train) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_pca(x1, x2, x3, threshold = 0.9)
# parsnip engine
boost_model <- boost_tree() %>%
set_mode("classification") %>%
set_engine("xgboost")
# create wf
boosted_wf <-
workflow() %>%
add_model(boost_model) %>%
add_recipe(rec_pca)
final_boosted <- generics::fit(boosted_wf, df)
#> [14:00:11] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Notice that next here I use regular DALEX (not DALEXtra), and that I manually extract out the xgboost model from inside the workflow and apply the feature engineering to the data myself:
# create an explanation object
explainer_xgb <-
DALEX::explain(
extract_fit_parsnip(final_boosted),
data = rec_pca %>% prep() %>% bake(new_data = NULL, all_predictors()),
y = as.integer(train$y)
)
#> Preparation of a new explainer is initiated
#> -> model label : model_fit ( default )
#> -> data : 800 rows 4 cols
#> -> data : tibble converted into a data.frame
#> -> target variable : 800 values
#> -> predict function : yhat.model_fit will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package parsnip , ver. 0.1.7 , task classification ( default )
#> -> predicted values : numerical, min = 0.1157353 , mean = 0.4626758 , max = 0.8343955
#> -> residual function : difference between y and yhat ( default )
#> -> residuals : numerical, min = 0.1860582 , mean = 0.9985742 , max = 1.884265
#> A new explainer has been created!
model_parts(explainer_xgb) %>% plot()
Created on 2022-03-11 by the reprex package (v2.0.1)
The only behavior supported right now in DALEXtra is based on using the original predictors so if you want to look at those engineered features, you need to do it yourself. You may be interested in this chapter of our book.

R: t test over multiple columns using t.test function

I tried to perform independent t-test for many columns of a dataframe. For example, i created a data frame
set seed(333)
a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)
To run the test, i used with(df, t.test(y ~ group))
with(test_data, t.test(a ~ grp))
with(test_data, t.test(b ~ grp))
with(test_data, t.test(c ~ grp))
I would like to have the outputs like this
mean in group m mean in group y p-value
9.747412 9.878820 0.6944
15.12936 16.49533 0.07798
20.39531 20.20168 0.9027
I wonder how can I achieve the results using
1. for loop
2. apply()
3. perhaps dplyr
This link R: t-test over all columns is related but it was 6 years old. Perhaps there are better ways to do the same thing.

Use select_if to select only numeric columns then use purrr:map_df to apply t.test against grp. Finally use broom:tidy to get the results in tidy format
library(tidyverse)
res <- test_data %>%
select_if(is.numeric) %>%
map_df(~ broom::tidy(t.test(. ~ grp)), .id = 'var')
res
#> # A tibble: 3 x 11
#> var estimate estimate1 estimate2 statistic p.value parameter conf.low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 a -0.259 9.78 10.0 -0.587 0.565 16.2 -1.19
#> 2 b 0.154 15.0 14.8 0.169 0.868 15.4 -1.78
#> 3 c -0.359 20.4 20.7 -0.287 0.778 16.5 -3.00
#> # ... with 3 more variables: conf.high <dbl>, method <chr>,
#> # alternative <chr>
Created on 2019-03-15 by the reprex package (v0.2.1.9000)

Simply extract the estimate and p-value results from t.test call while iterating through all needed columns with sapply. Build formulas from a character vector and transpose with t() for output:
formulas <- paste(names(test_data)[1:(ncol(test_data)-1)], "~ grp")
output <- t(sapply(formulas, function(f) {
res <- t.test(as.formula(f))
c(res$estimate, p.value=res$p.value)
}))
Input data (seeded for reproducibility)
set.seed(333)
a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)
Output result
# mean in group m mean in group y p.value
# a ~ grp 9.775477 10.03419 0.5654353
# b ~ grp 14.972888 14.81895 0.8678149
# c ~ grp 20.383679 20.74238 0.7776188

As you asked for a for loop:
a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)
meanM=NULL
meanY=NULL
p.value=NULL
for (i in 1:(ncol(test_data)-1)){
meanM=as.data.frame(rbind(meanM, t.test(test_data[,i] ~ grp)$estimate[1]))
meanY=as.data.frame(rbind(meanY, t.test(test_data[,i] ~ grp)$estimate[2]))
p.value=as.data.frame(rbind(p.value, t.test(test_data[,i] ~ grp)$p.value))
}
cbind(meanM, meanY, p.value)
It works, but I am a beginner in R. So maybe there is a more efficient solution

Using lapply this is rather easy.
I have tested the code with set.seed(7060) before creating the dataset, in order to make the results reproducible.
tests_list <- lapply(letters[1:3], function(x) t.test(as.formula(paste0(x, "~ grp")), data = test_data))
result <- do.call(rbind, lapply(tests_list, `[[`, "estimate"))
pval <- sapply(tests_list, `[[`, "p.value")
result <- cbind(result, p.value = pval)
result
# mean in group m mean in group y p.value
#[1,] 9.909818 9.658813 0.6167742
#[2,] 14.578926 14.168816 0.6462151
#[3,] 20.682587 19.299133 0.2735725
Note that a real life application would use names(test_data)[1:3], not letters[1:3], in the first lapply instruction.

This should be a comment rather than an answer, but I'll make it an answer. The reason is that the accepted answer is awesome but with one caveat that may cost others hours, which is at least the case for me.
The original data posted by OP
a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)
The answer provided by #Tung
library(tidyverse)
res <- test_data %>%
select_if(is.numeric) %>%
map_df(~ broom::tidy(t.test(. ~ grp)), .id = 'var')
res
The problem, or more accurately, the caveat, of this answer is that one has to define the grp variable separately. Having the group variable outside of the dataframe is not a common practice as far as I know. So, even the answer is neat, it may be better to point out this operation (define group variable outside of the dataframe). Therefore, I use this comment like answer in the hope to save some time for those late comers.

How to get r.squared for each regression?

Im working with a huge data frame with structure similar to the followings. I use output_reg to store slope and intercept for each treatment but I need to add r.squared for each lm (y~x) and store it in another column besides the other two. Any hint on that?
library(plyr)
field <- c('t1','t1','t1', 't2', 't2','t2', 't3', 't3','t3')
predictor <- c(4.2, 5.3, 5.4,6, 7,8.5,9, 10.1,11)
response <- c(5.1, 5.1, 2.4,6.1, 7.7,5.5,1.99, 5.42,2.5)
my_df <- data.frame(field, predictor, response, stringsAsFactors = F)
output_reg<-list()
B<-(unique(my_df$field))
for (i in 1:length(B)) {
index <- my_df[my_df$field==B[i],]
x<- index$predictor
y<- index$response
output_reg[[i]] <- lm (y ~ x) # gets estimates for each field
}
Thanks

r.squared can be accessed via the summary of the model, try this:
m <- lm(y ~ x)
rs <- summary(m)$r.squared
The summary object of the linear regression result contains almost everything you need:
output_reg<-list()
B<-(unique(my_df$field))
for (i in 1:length(B)) {
index <- my_df[my_df$field==B[i],]
x<- index$predictor
y<- index$response
m <- lm (y ~ x)
s <- summary(m) # get the summary of the model
# extract every thing you need from the summary object
output_reg[[i]] <- c(s$coefficients[, 'Estimate'], r.squared = s$r.squared)
}
output_reg
#[[1]]
#(Intercept) x r.squared
# 10.7537594 -1.3195489 0.3176692
#[[2]]
#(Intercept) x r.squared
# 8.8473684 -0.3368421 0.1389040
#[[3]]
#(Intercept) x r.squared
#-0.30500000 0.35963455 0.03788593
To bind the result together:
do.call(rbind, output_reg)
# (Intercept) x r.squared
# [1,] 10.753759 -1.3195489 0.31766917
# [2,] 8.847368 -0.3368421 0.13890396
# [3,] -0.305000 0.3596346 0.03788593

Check-out the broom package and sprinkle in some dplyr (see this vignette):
library(broom)
library(dplyr)
my_df %>%
group_by(field) %>%
do(glance(lm(predictor ~ response, data = .))) #also see do(tidy(...))
# field r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
# 1 t1 0.31766917 -0.3646617 0.7778175 0.46556474 0.6188153 2 -1.855107 9.710214 7.006051 0.605000 1
# 2 t2 0.13890396 -0.7221921 1.6513038 0.16131065 0.7568653 2 -4.113593 14.227185 11.523022 2.726804 1
# 3 t3 0.03788593 -0.9242281 1.3894755 0.03937779 0.8752903 2 -3.595676 13.191352 10.487189 1.930642 1
Alternatively, save the regressions first:
regressions <- my_df %>% group_by(field) %>% do(fit = lm(predictor ~ response, data = .))
regressions %>% tidy(fit)
regressions %>% glance(fit)

You can do the following using purrr
require(purrr)
my_df %>%
slice_rows("field") %>%
by_slice(partial(lm, predictor ~ response), .labels = FALSE) %>%
flatten %>%
map(~c(coef(.), r.squared=summary(.)$r.squared))
Which gives you:
[[1]]
(Intercept) response r.squared
5.9777778 -0.2407407 0.3176692
[[2]]
(Intercept) response r.squared
9.8195876 -0.4123711 0.1389040
[[3]]
(Intercept) response r.squared
9.68534163 0.10534562 0.03788593
If you want a data.frame back instead use this as last line:
map_df(~as.data.frame(t(c(coef(.), r.squared=summary(.)$r.squared))))

You can create a data frame with model stats like this:
model_stats <- data.frame(model$coefficients)
model_stats <- rbind(model_stats, r.sq = summary(model)$r.squared)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

can't group by with a tibble - r

Related

How can I extract, label and data.frame values from Console in a loop?

Creating loop over columns to calculate regression and then compare best combination of variables

DALEX and step_pca

R: t test over multiple columns using t.test function

How to get r.squared for each regression?

Categories

Resources