I came across this example
library(mtcars)
set.seed(17)
cv.error.10 = rep(0,10)
for (i in 1:10){
glm.fit = glm(mpg∼poly(horsepower ,i),data=Auto)
cv.error.10[i] = cv.glm(Auto,glm.fit,K=10)$delta[1]
}
cv.error.10
[1] 24.21 19.19 19.31 19.34 18.88 19.02 18.90 19.71 18.95 19.50
I have been trying to pick up purrr and modelr. This seemed like a good example to try to replicate as it includes both a loop and cross validation. How would I convert this code to something more tidy verse like?
Update
With the below suggestions, this is where the code is at
data(mtcars)
cv_mtcars = mtcars %>%
crossv_kfold(k = 5)
cv_models = cv_mtcars %>%
mutate(model = map(train, ~lm(mpg ~ hp, data = .)),
rmse_all_models = map2_dbl(model, test, ~rmse(.x, .y)))
print(cv_models)
What I would like to do is repeat this for increasing polynomials of hp such as hp^2, hp^3 etc. I am guessing there is a purr way to do this.
Update 2
Here is an example of the un-iterated code
data(mtcars)
cv_mtcars = mtcars %>%
crossv_kfold(k = 5)
cv_models = cv_mtcars %>%
mutate(model1 = map(train, ~lm(mpg ~ hp, data = .)),
model2 = map(train, ~lm(mpg ~I(hp^2), data = .)),
model3 = map(train, ~lm(mpg ~I(hp^3), data = .)),
model4 = map(train, ~lm(mpg ~I(hp^4), data = .)),
model5 = map(train, ~lm(mpg ~I(hp^5), data = .)),
model6 = map(train, ~lm(mpg ~I(hp^6), data = .)),
rmse_all_models1 = map2_dbl(model1, test, ~rmse(.x, .y)),
rmse_all_models2 = map2_dbl(model2, test, ~rmse(.x, .y)),
rmse_all_models3 = map2_dbl(model3, test, ~rmse(.x, .y)),
rmse_all_models4 = map2_dbl(model4, test, ~rmse(.x, .y)),
rmse_all_models5 = map2_dbl(model5, test, ~rmse(.x, .y)),
rmse_all_models6 = map2_dbl(model6, test, ~rmse(.x, .y)))
print(cv_models)
I don't know the mtcars library but if you need access to the mtcars data, you can use the following:
data(mtcars)
library(tidyverse)
library(modelr)
Then you can create a list of resamples with cross_mc()
cv_mtcars = mtcars %>%
crossv_mc(n = 50)
print(cv_mtcars)
Now you can train your model on the resamples. train is the column holding the data frames for training. I use mutate() to a column called model were I mapped the lm() function (or any other model) to the data.
cv_models = cv_mtcars %>%
mutate(model = map(train, ~lm(mpg ~ horsepower, data = .)))
print(cv_models)
You can add the root mean square error with the rmse() function from modelr:
rmse_cv = cv_models %>%
mutate(rmse_all_models = map2_dbl(model, test, ~rmse(.x, .y))) %>%
pull(rmse_all_models)
print(rmse_cv)
You can then compute any statistic of the rmse() you need. If you're not familiar with the concept of list columns, this code can bea bit overwhelming. You can read more about list columns here: https://campus.datacamp.com/courses/exploratory-data-analysis-in-r-case-study/tidy-modeling-with-broom?ex=10&_escaped_fragment_=#skiponboarding
I'm on a public computer so I could not try the code, but it should work.
Update
So I misunderstood the question a little, here is some more pointers:
powers = seq(1:6)
create_form = function(power){
rhs = substitute(I(hp^pow), list(pow=power))
rlang::new_formula(quote(mpg), rhs)
}
This function creates formulae, and then you can map a sequence of powers to this function:
list_forms = map(seq(1,6), create_form)
And then map the resulting list to lm:
map(list_forms, lm, data=mtcars)
To integrate this in a pipe workflow you need to create a new function:
train_model = function(cv_data, form){
cv_data %>%
mutate(model = map(train, ~lm(form, data = .)))
}
Test it on one model:
test = train_model(cv_mtcars, list_forms[[1]])
And now run it on everything:
all_models = map(list_forms, train_model, cv_data=cv_mtcars)
Hope this helps.
Related
I am analyzing the data using ANCOVA method and also trying to run several multiple linear regression from the purrr package.
The model should be:
m1 <- lm(PD~SR*Type, data = df (with interaction term)
m2 <- lm(PD~SR+Type, data = df (without interaction term)
My example code
library(broom)
library(purrr)
df <- data.frame(PD=c(10,20,30,40,50,10,20,33,12,52,21,43),
SR=c(5,10,20,24,6,21,59,22,1,11,12,3),
n=c("tree1", "tree1", "tree1",
"tree2", "tree2","tree2",
"tree3", "tree3", "tree3",
"tree4", "tree4","tree4"),
Type=c("a", "b",'c',
"a", "b",'c',
"a", "b",'c',
"a", "b",'c'))
x <- df %>%
nest(data = -n) %>%
mutate(fit = map(data, ~lm(PD~SR * Type, data= .x)),
tidied = map(fit, tidy),
glanced = map(fit, glance),
augmented = map(fit, augment))
tidy <- x %>% unnest(data) %>%
select(n, Type, PD, SR, tidied) %>%
unnest(tidied) %>%
select(n, Type, PD, SR, term, estimate) %>%
rename(, species_richness = SR)%>%
spread(term, estimate)
My problem is that I am expecting the intercept/slope is different to each Type.
However, the intercept/slope given by the code is the same for each Type.
You can check from tidy dataframe.
Any suggestions for this point? I am learning how to use the purrr package since it is quite powerfull.
or can u have other methods to do that?
I have a nested GBM, and am looking to extract the partial depndence, tryingto use the following query:
library(rsample) # data splitting
library(gbm) # basic implementation
library(xgboost) # a faster implementation of gbm
library(caret) # an aggregator package for performing many machine learning models
library(h2o) # a java-based platform
library(pdp) # model visualization
basic_gbm <- function(data) {
mymodel <- gbm(formula = mpg ~ . ,
distribution = "gaussian",
data = data ,
n.minobsinnode = 1,
bag.fraction = 1
)
return(mymodel)
}
blah_model <- mtcars %>%
group_by() %>%
nest() %>%
mutate(model = map(data, basic_gbm))
blah_summary <- mtcars %>%
group_by() %>%
nest() %>%
mutate(model = map(data, basic_gbm)) %>%
mutate(summary = map(model, summary)) %>%
mutate(all_data = pmap(list(data, summary), .f =left_join, by = character())) %>%
select(cols=c(all_data)) %>%
unnest(cols = c(cols)) %>%
ungroup()
blah_model %>%
left_join(blah_summary, by = character()) %>%
mutate(pred = map(model, partial, pred.var = var, n.trees = model$n.trees, train = data)) -- this does not work
This does work and is what I would want as a nested df for each var:
coeffs <- blah_model$model[[1]] %>%
partial(pred.var = 'disp', n.trees = blah_model$model[[1]]$n.trees, train = blah_model$data[[1]])
However, it is saying it is not finding the variables in the training data - the data I am passing through is the training data. The var in the map is from the summary functions - these are prediction variables.
I gave a better example
I am trying to implement parallel processing to make the STEPAIC function running on several models more efficient.
I've came into some errors, like vectors differing in length.
This is my code:
library(MASS)
library(furrr)
library(tidyverse)
library(prediction)
nb_thesis_inter <- function (df,mdl){
target_col <- "mpg"
interactions<- predictors(mdl)
target_formula <- as.formula(sprintf("%s ~ (%s)^2",
target_col,
paste(interactions, collapse = " + ")))
model <- MASS::glm.nb(target_formula, data = df)
model <- (MASS::stepAIC(model))
return (model)}
m<- mtcars %>% group_by(cyl) %>% nest()
m<- m %>% mutate(model= map(.x = data, .f = ~lm(mpg ~ .)))
m<- m %>%
mutate(model= future_map2(.x= data, .y=model, .f = nb_thesis_inter))
I am performing simple linear regressions on multiple groups from my data set. However, I want extract the summaries from each of these regressions and put them into a master table by organized by group. I can run it like this (and it works):
fit_basic <- rs2_anova %>% #Run multiple simple linear regressions
group_by(quant_method) %>%
nest() %>%
mutate(model = map(data, ~lm(recoveries ~ treatment, data = .)))
fit_basic_A <- fit_basic[[1,"model"]] #Remove the model from fit_basic
fit_basic_B <- fit_basic[[1,"model"]] #Remove the model from fit_basic
fit_basic_table_A <- get_regression_table(fit_basic_A) %>%
select("term", "estimate") %>%
pivot_wider(names_from = "term", values_from = "estimate") %>%
mutate(quant_method = "A")
fit_basic_table_B <- get_regression_table(fit_basic_A) %>%
select("term", "estimate") %>%
pivot_wider(names_from = "term", values_from = "estimate") %>%
mutate(quant_method = "B")
fit_basic_table <- rbind(fit_basic_table_A, fit_basic_table_B)
To save myself some lines of code (because I have many more groups than presented here) I thought I could use the map function, but I keep getting stuck at mapping the summary table, which throws an error:
fit_basic <- rs2_anova %>%
group_by(quant_method) %>%
nest() %>%
mutate(model = map(data, ~lm(recoveries ~ treatment, data = .))) %>%
mutate(summaries = map(data, get_regression_table(.$model)))
Error in input_checks(model, digits, print) :
Only simple linear regression models are supported. Try again using only `lm()` models as appropriate.
I also tried something along this line:
fit_basic_table <- map(fit_basic$model,
function(x) {
p <- get_regression_table(x)
cbind(par=rownames(p), p)
})
But I get a list of dataframes that I can't breakdown into a single dataframe and I have lost my group designations. I have tried:
fit_basic_table <- map(fit_basic$model,
function(x) {
p <- get_regression_table(x)
cbind(par=rownames(p), p)
}) %>%
map_df(as_tibble, .id = "id")
and
fit_basic_table <- map(fit_basic$model,
function(x) {
p <- get_regression_table(x)
cbind(par=rownames(p), p)
}) %>%
unnest(cols = "id")
Any thoughts on how to automate this?
*Random test dataframe:
quant_method <- c("A", "A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
treatment <- c("x","x","x","x","x","y","y","y","y","y","x","x","x","x","x","y","y","y","y","y")
recoveries <-c("88","86","87","82","85","76","65","55","72","71","98","96","97","92","99","66",
"55","55","62","61")
rs2_anova <- data.frame(quant_method, treatment, recoveries)
Here is one solution using tidyverse and broom packages. It is slightly different from the purrr method you were attempting, but I think the result shows the terms that you were interested in extracting from the lm object (i.e., term and estimate).
library(tidyverse)
library(broom)
#Added the stringsAsFactors argument = F to avoid an error in the lm model
rs2_anova <- data.frame(quant_method,
treatment,
recoveries,
stringsAsFactors = F)
fit_basic <- rs2_anova %>%
#Group by quant_method column
group_by(quant_method) %>%
#do the linear models by grouping var
do(model = lm(recoveries ~ treatment, data = .)) %>%
#tidy lm object and order it as tibble
tidy(model)
I found the answer here: unnest a list column after modeling after group_by in r
and modified it to:
fit_cec <- rs2_anova %>%
group_by(quant_method) %>%
nest %>%
mutate(data = map(data, ~ .x %>%
summarise(model = list(broom::tidy(lm(recoveries ~ treatment)))))) %>%
unnest(data) %>%
unnest(model)
or to get all estimates, predictions and summaries (https://drsimonj.svbtle.com/running-a-model-on-separate-groups). This also works nicely:
fit_cec <- rs2_anova %>%
group_by(quant_method) %>%
nest %>%
mutate(fit = map(data, ~ lm(loss_abs_BC_2 ~ cec, data = .)),
parameters = map(fit, tidy), #provides estimate table for slope and y-intercept with std.error and estimate p-values
summary = map(fit, broom::glance), #provides R2, adj.R2, sigma, and model p-value
predictions = map(fit, augment)) %>% #provides fitted values with residuals and errors
unnest(parameters) %>%
pivot_wider(names_from = term, values_from = c(estimate, std.error, statistic, p.value)) %>%
unnest(summary) %>%
unnest(predictions)
Consider the following example longitudinal/repeated measures dataset
library(tidyverse)
library(broom)
library(nlme)
data <- read.csv("https://stats.idre.ucla.edu/stat/data/study2.csv")
data <- data %>% mutate(dbp =rnorm(120, 30:150), sbp = rnorm(120, 50:200),bmi
= rnorm(120,15:40), chol = rnorm(120,50:350), insulin = rnorm(120,2:40), educ = rnorm(120,5:10))
I can use group_by %>% do(tidy(*)) to run several unadjusted and adjusted single-level regression models (looping through list of outcomes & exposures) and extract model results to the data frame as follows
out <-c("pulse","insulin","chol")
exp <- c("factor(exertype)","sbp","dbp")
conf <- c("bmi","factor(diet)")
#Unadjusted models - single level regression (lm)
#################################################
Unadjusted <- expand.grid(out, exp) %>%
group_by(Var1) %>% rowwise() %>%
summarise(frm = paste0(Var1, "~", Var2)) %>%
group_by(model_id = row_number(),frm) %>%
do(tidy(lm(.$frm, data = data))) %>%
mutate(lci = estimate-(1.96*std.error)) %>%
mutate(uci = estimate+(1.96*std.error))
#Adjusted models - single level regression (lm)
###############################################
Adjusted <- expand.grid(out, exp, conf) %>%
group_by(Var1, Var2) %>%
summarise(Var3 = paste0(Var3, collapse = "+")) %>%
rowwise() %>%
summarise(frm = paste0(Var1, "~", Var2, "+", Var3)) %>%
group_by(model_id = row_number(), frm) %>%
do(tidy(lm(.$frm, data = data))) %>%
mutate(lci = estimate-(1.96*std.error)) %>%
mutate(uci = estimate+(1.96*std.error))
I would like to use the same process to fit multilevel models to account for the repeated data. using the example code:
lme(sbp ~ pulse+factor(diet)+time, data=data, random= ~time|id, method ="ML")
HOWEVER, when I try to do this, e.g. using:
#Unadjusted models - multi-level regression (lme)
#################################################
Unadjusted <- expand.grid(out, exp) %>%
group_by(Var1) %>% rowwise() %>%
summarise(frm = paste0(Var1, "~", Var2)) %>%
group_by(model_id = row_number(),frm) %>%
do(tidy(lme(.$frm, data = data, random= ~time|id, method = "ML"))) %>%
mutate(lci = estimate-(1.96*std.error)) %>%
mutate(uci = estimate+(1.96*std.error))
I get the following error message:
Error in UseMethod("lme") : no applicable method for 'lme' applied to an object of class "character"
Any ideas on how to get this working for lme type models?
Unlike lm, lme will not accept a formula as a character value. You need to explicitly cast it to a formula. Try just adding as.formula()
do(tidy(lme(as.formula(.$frm), ...))) %>%