I need some help analyzing survey data.
Here is my code.
Data prep
library(survey)
library(srvyr)
data(api)
dclus2 <- apiclus1 %>%
as_survey_design(dnum, weights = pw, fpc = fpc)
These two codes give me the same result.
One using the package survey
#Code
survey::svymean(~awards, dclus2)
#Results
mean SE
awardsNo 0.28962 0.033
awardsYes 0.71038 0.033
One using the package srvyr
#Code
srvyr::dclus2%>%
group_by(awards)%>%
summarise(m=survey_mean())
#Results
awards m m_se
No 0.2896175 0.0330183
Yes 0.7103825 0.0330183
I would like to get the survey mean of by the variable "awards" subset by the variable "stype" with levels No and Yes.
In the survey package, interaction is used
eg.svymean(~interaction(awards,stype), dclus2) How do I get the same result using the srvyr package?
Thank you for your help
How do get the result below using the package srvyr?
#Code
svymean(~interaction(awards,stype), dclus2)
#Results
mean SE
interaction(awards, stype)No.E 0.180328 0.0250
interaction(awards, stype)Yes.E 0.606557 0.0428
interaction(awards, stype)No.H 0.043716 0.0179
interaction(awards, stype)Yes.H 0.032787 0.0168
interaction(awards, stype)No.M 0.065574 0.0230
interaction(awards, stype)Yes.M 0.071038 0.0203
You can simply imitate the recommended behavior for survey: create a new variable formed by concatenating distinct values of each of the component variables. That's all that the interaction() function is doing for svymean().
library(survey)
library(srvyr)
data(api)
# Set up design object
dclus2 <- apiclus1 %>%
as_survey_design(dnum, weights = pw, fpc = fpc)
# Create 'interaction' variable
dclus2 %>%
mutate(awards_stype = paste(awards, stype, sep = " - ")) %>%
group_by(awards_stype) %>%
summarize(
prop = survey_mean()
)
#> # A tibble: 6 x 3
#> awards_stype prop prop_se
#> <chr> <dbl> <dbl>
#> 1 No - E 0.180 0.0250
#> 2 No - H 0.0437 0.0179
#> 3 No - M 0.0656 0.0230
#> 4 Yes - E 0.607 0.0428
#> 5 Yes - H 0.0328 0.0168
#> 6 Yes - M 0.0710 0.0203
To get the various component variables split back into separate columns, you can use the separate() function from the tidyr package.
# Separate the columns afterwards
dclus2 %>%
mutate(awards_stype = paste(awards, stype, sep = " - ")) %>%
group_by(awards_stype) %>%
summarize(
prop = survey_mean()
) %>%
tidyr::separate(col = "awards_stype",
into = c("awards", "stype"),
sep = " - ")
#> # A tibble: 6 x 4
#> awards stype prop prop_se
#> <chr> <chr> <dbl> <dbl>
#> 1 No E 0.180 0.0250
#> 2 No H 0.0437 0.0179
#> 3 No M 0.0656 0.0230
#> 4 Yes E 0.607 0.0428
#> 5 Yes H 0.0328 0.0168
#> 6 Yes M 0.0710 0.0203
Created on 2021-03-30 by the reprex package (v1.0.0)
Related
I am running a mixed effects model on my dataset ,
library(lme4)
data(cake)
each dataset is a subset of a larger datsaet
subset(cake, recipe=="A")
subset(cake, recipe=="B")
subset(cake, recipe=="C")
I am using dlply to run my mixed effects model on each subset
MxM1 <- plyr::dlply(cake,
"recipe",
function(x)
lmer(angle ~ 1+ (1|replicate)+ temperature,
data=x))
This gives me a list of summaries based on each subset of data.
I know how to display the summaries one at a time using gt_summary package
lm_cake$A %>%
tbl_regression() %>%
modify_column_hide(columns = ci) %>%
modify_column_unhide(columns = std.error)
lm_cake$B %>%
tbl_regression() %>%
modify_column_hide(columns = ci) %>%
modify_column_unhide(columns = std.error)
lm_cake$B %>%
tbl_regression() %>%
modify_column_hide(columns = ci) %>%
modify_column_unhide(columns = std.error)
I am not sure how to combine the results from all 3 objects (lm_cake$A, lm_cake$B, lm_cake$C) to display them as one summary table.
Model: A Model: B Model: C
Temperature Beta SE Beta SE Beta SE
Temperature. L
Temperature. Q
Temperature. C
Temperature^4
Temperature^5
Any suggestions or help is much apricated. Thanks.
You can also merge two or more gtsummary tables using the gtsummary::tbl_merge() function. Example below!
library(gtsummary)
#> #StandWithUkraine
library(lme4)
#> Loading required package: Matrix
data(cake)
MxM1 <-
plyr::dlply(
cake,
"recipe",
function(x) {
lmer(angle ~ 1+ (1|replicate)+ temperature, data=x) %>%
tbl_regression() %>%
modify_column_hide(columns = ci) %>%
modify_column_unhide(columns = std.error)
}
)
# Merge all model summaries together with `tbl_merge()`
tbl <-
MxM1 %>%
tbl_merge(
tab_spanner = c("**A**", "**B**", "**C**")
)
Created on 2022-12-17 with reprex v2.0.2
Update:
While the answer by #Daniel D. Sjoberg is perfect and the desired one. Here is the answer to OP's question in the comments:
"How can i convert the final results from long format to wide, by each recipe?"
After filtering temperature we could use pivot_wider and some tweaking thereafter:
Note we have to use broom.mixed package for our lmer
library(lme4)
library(tidyverse)
#library(broom)
library(broom.mixed)
cake %>%
mutate(recipe = as_factor(recipe)) %>%
group_by(recipe) %>%
group_split() %>%
map_dfr(.f = function(df){
lmer(angle ~ 1 + (1|replicate) + temperature,
data=df) %>%
tidy() %>%
add_column(recipe = unique(df$recipe), .before = 1)
}) %>%
filter(str_detect(term, "temperature")) %>%
select(recipe, term, Beta=estimate, SE = std.error) %>%
pivot_wider(names_from = recipe,
values_from = c(Beta, SE)) %>%
rename_with(~ str_replace(., "(.*)_(.*)", "\\2_\\1"), -1) %>%
select(term, sort(colnames(.)))
term A_Beta A_SE B_Beta B_SE C_Beta C_SE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 temperature.L 6.43 1.22 6.88 1.16 6.52 1.12
2 temperature.Q -0.713 1.22 -0.946 1.16 0.502 1.12
3 temperature.C -2.33 1.22 0.368 1.16 0.313 1.12
4 temperature^4 -3.35 1.22 -0.328 1.16 -0.214 1.12
5 temperature^5 -0.151 1.22 -0.815 1.16 -1.78 1.12
First answer:
You need something like this?:
library(lme4)
data(cake)
library(dplyr)
library(broom)
library(broom.mixed)
cake %>%
mutate(recipe = as_factor(recipe)) %>%
group_by(recipe) %>%
group_split() %>%
map_dfr(.f = function(df){
lmer(angle ~ 1 + (1|replicate) + temperature,
data=df) %>%
tidy() %>%
add_column(recipe = unique(df$recipe), .before = 1)
})
A tibble: 24 × 7
recipe effect group term estimate std.error statistic
<fct> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 A fixed NA (Intercept) 33.1 1.42 23.3
2 A fixed NA temperature.L 6.43 1.22 5.26
3 A fixed NA temperature.Q -0.713 1.22 -0.583
4 A fixed NA temperature.C -2.33 1.22 -1.90
5 A fixed NA temperature^4 -3.35 1.22 -2.74
6 A fixed NA temperature^5 -0.151 1.22 -0.124
7 A ran_pars replicate sd__(Intercept) 5.16 NA NA
8 A ran_pars Residual sd__Observation 4.73 NA NA
9 B fixed NA (Intercept) 31.6 1.81 17.5
10 B fixed NA temperature.L 6.88 1.16 5.93
# … with 14 more rows
# ℹ Use `print(n = ...)` to see more rows
I have three columns, one per group, with numeric values. I want to analyze them using an Anova test, but I found applications when you have the different groups in a column and the respective values in the second column. I wonder if it is necessary to reorder the data like that, or if there is a method that I can use for the columns that I currently have. Here I attached a capture:
Thanks!
You can convert a wide table having many columns into another table having only two columns for key (group) and value (response) by pivoting the data:
library(tidyverse)
# create example data
set.seed(1337)
data <- tibble(
VIH = runif(100),
VIH2 = runif(100),
VIH3 = runif(100)
)
data
#> # A tibble: 100 × 3
#> VIH VIH2 VIH3
#> <dbl> <dbl> <dbl>
#> 1 0.576 0.485 0.583
#> 2 0.565 0.495 0.108
#> 3 0.0740 0.868 0.350
#> 4 0.454 0.833 0.324
#> 5 0.373 0.242 0.915
#> 6 0.331 0.0694 0.0790
#> 7 0.948 0.130 0.563
#> 8 0.281 0.122 0.287
#> 9 0.245 0.270 0.419
#> 10 0.146 0.488 0.838
#> # … with 90 more rows
data %>%
pivot_longer(everything()) %>%
aov(value ~ name, data = .)
#> Call:
#> aov(formula = value ~ name, data = .)
#>
#> Terms:
#> name Residuals
#> Sum of Squares 0.124558 25.171730
#> Deg. of Freedom 2 297
#>
#> Residual standard error: 0.2911242
#> Estimated effects may be unbalanced
Created on 2022-05-10 by the reprex package (v2.0.0)
I am doing some time series forecasting analysis with the fable and fabletools package and I am interested in comparing the accuracy of individual models and also a mixed model (consisting of the individual models I am using).
Here is some example code with a mock dataframe:-
library(fable)
library(fabletools)
library(distributional)
library(tidyverse)
library(imputeTS)
#creating mock dataframe
set.seed(1)
Date<-seq(as.Date("2018-01-01"), as.Date("2021-03-19"), by = "1 day")
Count<-rnorm(length(Date),mean = 2086, sd= 728)
Count<-round(Count)
df<-data.frame(Date,Count)
df
#===================redoing with new model================
df$Count<-abs(df$Count)#in case there is any negative values, force them to be absolute
count_data<-as_tsibble(df)
count_data<-imputeTS::na.mean(count_data)
testfrac<-count_data%>%arrange(Date)%>%sample_frac(0.8)
lastdate<-last(testfrac$Date)
#train data
train <- count_data %>%
#sample_frac(0.8)
filter(Date<=as.Date(lastdate))
set.seed(1)
fit <- train %>%
model(
ets = ETS(Count),
arima = ARIMA(Count),
snaive = SNAIVE(Count),
croston= CROSTON(Count),
ave=MEAN(Count),
naive=NAIVE(Count),
neural=NNETAR(Count),
lm=TSLM(Count ~ trend()+season())
) %>%
mutate(mixed = (ets + arima + snaive + croston + ave + naive + neural + lm) /8)# creates a combined model using the averages of all individual models
fc <- fit %>% forecast(h = 7)
accuracy(fc,count_data)
fc_accuracy <- accuracy(fc, count_data,
measures = list(
point_accuracy_measures,
interval_accuracy_measures,
distribution_accuracy_measures
)
)
fc_accuracy
# A tibble: 9 x 13
# .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1 winkler percentile CRPS
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 arima Test -191. 983. 744. -38.1 51.8 0.939 0.967 -0.308 5769. 567. 561.
#2 ave Test -191. 983. 744. -38.1 51.8 0.939 0.967 -0.308 5765. 566. 561.
#3 croston Test -191. 983. 745. -38.2 51.9 0.940 0.968 -0.308 29788. 745. 745.
#4 ets Test -189. 983. 743. -38.0 51.7 0.938 0.967 -0.308 5759. 566. 560.
#5 lm Test -154. 1017. 742. -36.5 51.1 0.937 1.00 -0.307 6417. 583. 577.
#6 mixed Test -173. 997. 747. -36.8 51.1 0.944 0.981 -0.328 29897. 747. 747.
#7 naive Test 99.9 970. 612. -19.0 38.7 0.772 0.954 -0.308 7856. 692. 685.
#8 neural Test -322. 1139. 934. -49.6 66.3 1.18 1.12 -0.404 26361. 852. 848.
#9 snaive Test -244 1192. 896. -37.1 55.5 1.13 1.17 -0.244 4663. 690. 683.
I demonstrate how to create a mixed model. However, there can be some individual models which hamper the performance of a mixed model when added to it; in other words, the mixed model could be potentially improved if it did not include the individual models which skews the accuracy in a detrimental way.
Desired outcome
What I would like to achieve is to be able to test all of the possible combinations of individual models and returns the mixed model with the most optimum performance on one of the accuracy metrics, for instance, Mean Absolute Error (MAE). But I am not sure how to do this in an automated way as there are many potential combinations.
Can someone suggest or share some code as to how I could do this?
A couple of things to consider:
While it's definitely desirable to quickly evaluate the performance of many combination models, it's pretty impractical. The best option would be to evaluate your models individually, and then create a more simple combination using, e.g. the 2 or 3 best ones
As an example, consider that you can actually have weighted combinations - e.g. 0.75 * ets + 0.25 * arima. The possibilities are now literally endless, so you start to see the limitations of the brute-force method (N.B. I don't think fable actually supports these kind of combinations yet though).
That, said, here's one approach you could use to generate all the possible combinations. Note that this might take a prohibitively long time to run - but should give you what you're after.
# Get a table of models to get combinations from
fit <- train %>%
model(
ets = ETS(Count),
arima = ARIMA(Count),
snaive = SNAIVE(Count),
croston= CROSTON(Count),
ave=MEAN(Count),
naive=NAIVE(Count),
neural=NNETAR(Count),
lm=TSLM(Count ~ trend()+season())
)
# Start with a vector containing all the models we want to combine
models <- c("ets", "arima", "snaive", "croston", "ave", "naive", "neural", "lm")
# Generate a table of combinations - if a value is 1, that indicates that
# the model should be included in the combinations
combinations <- models %>%
purrr::set_names() %>%
map(~0:1) %>%
tidyr::crossing(!!!.)
combinations
#> # A tibble: 256 x 8
#> ets arima snaive croston ave naive neural lm
#> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 0 0 0 0 0 0 0 0
#> 2 0 0 0 0 0 0 0 1
#> 3 0 0 0 0 0 0 1 0
#> 4 0 0 0 0 0 0 1 1
#> 5 0 0 0 0 0 1 0 0
#> 6 0 0 0 0 0 1 0 1
#> 7 0 0 0 0 0 1 1 0
#> 8 0 0 0 0 0 1 1 1
#> 9 0 0 0 0 1 0 0 0
#> 10 0 0 0 0 1 0 0 1
#> # ... with 246 more rows
# This just filters for combinations with at least 2 models
relevant_combinations <- combinations %>%
filter(rowSums(across()) > 1)
# We can use this table to generate the code we would put in a call to `mutate()`
# to generate the combination. {fable} does something funny with code
# evaluation here, meaning that more elegant approaches are more trouble
# than they're worth
specs <- relevant_combinations %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = "model", values_to = "flag_present") %>%
filter(flag_present == 1) %>%
group_by(id) %>%
summarise(
desc = glue::glue_collapse(model, "_"),
model = glue::glue(
"({model_sums}) / {n_models}",
model_sums = glue::glue_collapse(model, " + "),
n_models = n()
)
) %>%
select(-id) %>%
pivot_wider(names_from = desc, values_from = model)
# This is what the `specs` table looks like:
specs
#> # A tibble: 1 x 247
#> neural_lm naive_lm naive_neural naive_neural_lm ave_lm ave_neural
#> <glue> <glue> <glue> <glue> <glue> <glue>
#> 1 (neural + lm) / 2 (naive +~ (naive + neu~ (naive + neural ~ (ave +~ (ave + ne~
#> # ... with 241 more variables: ave_neural_lm <glue>, ave_naive <glue>,
#> # ave_naive_lm <glue>, ave_naive_neural <glue>, ave_naive_neural_lm <glue>,
#> # croston_lm <glue>, croston_neural <glue>, croston_neural_lm <glue>,
#> # croston_naive <glue>, croston_naive_lm <glue>, croston_naive_neural <glue>,
#> # croston_naive_neural_lm <glue>, croston_ave <glue>, croston_ave_lm <glue>,
#> # croston_ave_neural <glue>, croston_ave_neural_lm <glue>,
#> # croston_ave_naive <glue>, croston_ave_naive_lm <glue>, ...
# We can combine our two tables and evaluate the generated code to produce
# combination models as follows:
combinations <- fit %>%
bind_cols(rename_with(specs, ~paste0("spec_", .))) %>%
mutate(across(starts_with("spec"), ~eval(parse(text = .))))
# Compute the accuracy for 2 random combinations to demonstrate:
combinations %>%
select(sample(seq_len(ncol(.)), 2)) %>%
forecast(h = 7) %>%
accuracy(count_data, measures = list(
point_accuracy_measures,
interval_accuracy_measures,
distribution_accuracy_measures
))
#> # A tibble: 2 x 13
#> .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1 winkler
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 spec_ets_arima~ Test -209. 1014. 771. -40.1 54.0 0.973 0.998 -0.327 30825.
#> 2 spec_ets_snaiv~ Test -145. 983. 726. -34.5 48.9 0.917 0.967 -0.316 29052.
#> # ... with 2 more variables: percentile <dbl>, CRPS <dbl>
When I run the below data it shows an incorrect roc_curve.
Prep
The below code should be run-able for anyone using r-studio. The dataframe contains characteristics of different employees regarding: performance ratings, sales figures, and whether
or not they were promoted.
I am attempting to create a decision tree model that uses all other variables to predict if an employee was promoted. The primary purpose of this question is to find out what I am doing incorrectly when tring to use the roc_curve() function.
library(tidyverse)
library(tidymodels)
library(peopleanalyticsdata)
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read.csv(url)
salespeople <- salespeople %>% mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
creating testing/training data
Using my own homemade train_test() function just for kicks!
train_test <- function(data, train.size=0.7, na.rm=FALSE) {
if(na.rm == TRUE) {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
data_nm <- na.omit(data)
train<-data_nm[dt,]
test<- data_nm[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
} else {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
train<-data[dt,]
test<- data[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
}
}
tt_list <- train_test(salespeople)
sales_train <- tt_list$train
sales_test <- tt_list$test
'''
creating decision tree model structure/final model/prediction dataframe
'''
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
model <- tree %>% fit(promoted ~ ., data = sales_train)
predictions <- predict(model,
sales_test,
type = "prob") %>%
bind_cols(sales_test)
'''
Calculate & Plot the ROC curve
When I use the .pred_yes column as the estimate column, it calculates an ROC curve that is the inverse of what I want. It seems that it has identified .pred_no as the "real" estimate column
'''
roc <- roc_curve(predictions,
estimate = .pred_yes,
truth = promoted)
autoplot(roc)
'''
Thoughts
Seems like the issue goes away when I supply pred_no as the estimate column to roc_curve()
FYI: this is my first stack overflow post, if you have any suggestions to make this post more clear/better formatted please let me know!
In factor(c("yes", "no")), "no" is the first level, the level that most modeling packages assume is the one of interest. In tidymodels, you can adjust the level of interest via the event_level argument, as documented here:
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read_csv(url) %>%
mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
#> Rows: 351 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (4): promoted, sales, customer_rate, performance
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sales_split <- initial_split(salespeople)
sales_train <- training(sales_split)
sales_test <- testing(sales_split)
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
tree_fit <- tree %>% fit(promoted ~ ., data = sales_train)
sales_preds <- augment(tree_fit, sales_test)
sales_preds
#> # A tibble: 88 × 7
#> promoted sales customer_rate performance .pred_class .pred_no .pred_yes
#> <fct> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
#> 1 no 364 4.89 1 no 0.973 0.0267
#> 2 no 342 3.74 3 no 0.973 0.0267
#> 3 yes 716 3.16 3 yes 0 1
#> 4 no 450 3.21 3 no 0.973 0.0267
#> 5 no 372 3.87 3 no 0.973 0.0267
#> 6 no 535 4.47 2 no 0.973 0.0267
#> 7 yes 736 3.94 4 yes 0 1
#> 8 no 330 2.54 2 no 0.973 0.0267
#> 9 no 478 3.48 2 no 0.973 0.0267
#> 10 yes 728 2.66 3 yes 0 1
#> # … with 78 more rows
sales_preds %>%
roc_curve(promoted, .pred_yes, event_level = "second") %>%
autoplot()
Created on 2021-09-08 by the reprex package (v2.0.1)
The answer to this question clearly explains how to retrieve tidy regression results by group when running a regression through a dplyr pipe, but the solution is no longer reproducible.
How can one use dplyr and broom in combination to run a regression by group and retrieve tidy results using R 4.02, dplyr 1.0.0, and broom 0.7.0?
Specifically, the example answer from the question linked above,
library(dplyr)
library(broom)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
dfHour = df.h %>% group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .))
# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)
returns the following error (and three warnings) when I run it on my system:
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
Calling var(x) on a factor x is defunct.
Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
In addition: Warning messages:
1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom.
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
If I reformat df.h$hour as a character rather than factor,
df.h <- df.h %>%
mutate(
hour = as.character(hour)
)
re-run the regression by group, and again attempt to retrieve the results using broom::tidy,
dfHour = df.h %>% group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .))
# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)
I get this error:
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
is.atomic(x) is not TRUE
I assume that the problem has to do with the fact that the group-level regression results are stored as lists in dfHour$fitHour, but I am unsure how to correct the error and once again tidily and quickly compile the regression results, as used to work in the originally posted code/answer.
****** Updated with more succinct code pulled from the dplyr 1.0.0 release notes ******
Thank you. I was struggling with a similar question with the update to dplyr 1.0.0 related to using the examples in the provided link. This was both a helpful question and answer.
One note as an FYI, do() has been superseded as of dplyr 1.0.0, so may consider using the updated language (now very efficient with my update):
dfHour = df.h %>%
# replace group_by() with nest_by()
# to convert your model data to a vector of lists
nest_by(hour) %>%
# change do() to mutate(), then add list() before your model
# make sure to change data = . to data = data
mutate(fitHour = list(lm(price ~ wind + temp, data = data))) %>%
summarise(tidy(mod))
Done!
This gives a very efficient df with select output stats. The last line replaces the following code (from my original response), which does the same thing, but less easily:
ungroup() %>%
# then leverage the feedback from #akrun
transmute(hour, HourCoef = map(fitHour, tidy)) %>%
unnest(HourCoef)
dfHour
Which gives the outupt:
# A tibble: 72 x 6
hour term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 68.6 21.0 3.27 0.00428
2 1 wind 0.000558 0.0124 0.0450 0.965
3 1 temp -0.866 0.907 -0.954 0.353
4 2 (Intercept) 31.9 17.4 1.83 0.0832
5 2 wind 0.00950 0.0113 0.838 0.413
6 2 temp 1.69 0.802 2.11 0.0490
7 3 (Intercept) 85.5 22.3 3.83 0.00122
8 3 wind -0.0210 0.0165 -1.27 0.220
9 3 temp 0.276 1.14 0.243 0.811
10 4 (Intercept) 73.3 15.1 4.86 0.000126
# ... with 62 more rows
Thanks for the patience, I am working through this myself!
Issue would be that there is a grouping attribute rowwise after the do call and the column 'fitHour' is a list. We can ungroup, loop over the list with map and tidy it to a list column
library(dplyr)
library(purrr)
library(broom)
df.h %>%
group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .)) %>%
ungroup %>%
mutate(HourCoef = map(fitHour, tidy))
Or use unnest after the mtuate
df.h %>%
group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .)) %>%
ungroup %>%
transmute(hour, HourCoef = map(fitHour, tidy)) %>%
unnest(HourCoef)
# A tibble: 72 x 6
# hour term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 89.8 20.2 4.45 0.000308
# 2 1 wind 0.00493 0.0151 0.326 0.748
# 3 1 temp -1.84 1.08 -1.71 0.105
# 4 2 (Intercept) 75.6 23.7 3.20 0.00500
# 5 2 wind -0.00910 0.0146 -0.622 0.542
# 6 2 temp 0.192 0.853 0.225 0.824
# 7 3 (Intercept) 44.0 23.9 1.84 0.0822
# 8 3 wind -0.00158 0.0166 -0.0953 0.925
# 9 3 temp 0.622 1.19 0.520 0.609
#10 4 (Intercept) 57.8 18.9 3.06 0.00676
# … with 62 more rows
If we wanted a single dataset, pull the 'fitHour', loop over the list with map, condense it to a single dataset by row binding (suffix _dfr)
df.h %>%
group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .)) %>%
ungroup %>%
pull(fitHour) %>%
map_dfr(tidy, .id = 'grp')
NOTE: The OP's error message was able to be replicated with R 4.02, dplyr 1.0.0 and broom 0.7.0
tidy(dfHour,fitHour)
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm) :
Calling var(x) on a factor x is defunct.
Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
In addition: Warning messages:
1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom.
2: In mean.default(X[[i]], ...) :
Your code actually works. Maybe package version or re starting a new R session could help:
library(dplyr)
library(broom)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
dfHour = df.h %>% group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .))
tidy(dfHour,fitHour)
# A tibble: 72 x 6
# Groups: hour [24]
hour term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 66.4 14.8 4.48 0.000288
2 1 wind 0.000474 0.00984 0.0482 0.962
3 1 temp 0.0691 0.945 0.0731 0.943
4 2 (Intercept) 66.5 20.4 3.26 0.00432
5 2 wind -0.00540 0.0127 -0.426 0.675
6 2 temp -0.306 0.944 -0.324 0.750
7 3 (Intercept) 86.5 17.3 5.00 0.0000936
8 3 wind -0.0119 0.00960 -1.24 0.232
9 3 temp -1.18 0.928 -1.27 0.221
10 4 (Intercept) 59.8 17.5 3.42 0.00304
# ... with 62 more rows