I'm building dozens of predictive models in an effort to identify a champion model. I'm working with gigabytes of data, so tracking run time is important.
I'd like to build all my models in a list-type format, so I don't have to manage all the different model names within the Global Environment. However, it seems that the only way to get timings per model is to have separate named objects.
Here's a basic method that approaches what I'm looking for:
library(tidyverse)
# Basic Approach
Time_1 <- system.time(
Model_1 <- lm(am ~ disp, mtcars)
)
Time_2 <- system.time(
Model_2 <- lm(am ~ disp + cyl, mtcars)
)
# etc. for dozens more
Time_List <-
mget(ls(pattern = "Time")) %>%
bind_rows()
However, as you can see, I have to manually name each model and time record. What I'm looking for is something similar to the table produced with the following code, where "xxx" is an actual record of run time.
# Tribble Output
tribble(
~Model_Name, ~Model_Function, ~Run_Time,
"Model_1", lm(am ~ disp, mtcars), "xxx",
"Model_2", lm(am ~ disp + cyl, mtcars), "xxx"
)
# A tibble: 2 × 3
Model_Name Model_Function Run_Time
<chr> <list> <chr>
1 Model_1 <S3: lm> xxx
2 Model_2 <S3: lm> xxx
I'd appreciate any input provided, regardless of packages used.
If you assign within system.time, you can save both the time and what's computed. If you assign the results to a list column, you can unpack it:
library(tidyverse)
data_frame(formula = c(mpg ~ wt, mpg ~ wt + hp)) %>%
mutate(model_time = map(formula, ~{
time <- system.time(model <- lm(.x, mtcars));
lst(model, time)
}),
model = map(model_time, 'model'),
time = map(model_time, 'time')) %>%
select(-model_time)
#> # A tibble: 2 × 3
#> formula model time
#> <list> <list> <list>
#> 1 <S3: formula> <S3: lm> <S3: proc_time>
#> 2 <S3: formula> <S3: lm> <S3: proc_time>
Because the columns are all still lists it doesn't look like much, but all of the data is now there and can be further extracted.
An equivalent alternative:
data_frame(formula = c(mpg ~ wt, mpg ~ wt + hp)) %>%
mutate(model_time = map(formula, ~{
time <- system.time(model <- lm(.x, mtcars));
data_frame(model = list(model),
time = list(time))
})) %>%
unnest(model_time)
Related
I need to run a lot of replicates on the same model but cycle different data into it on each iteration.
e.g.
db1 <- mtcars
db2 <- mtcars
db3 <- mtcars
for(i in 1:db) {
# keep model structure but alternate the data
lm(mpg ~ wt, data = db[i])
}
I need to create a for-loop or a function that can run the model on db1, then swap in db2 and run the same model. I also need them to be stored as separate objects in my R environment e.g. lm1 (for db1) and lm2 (for db2)
Cn someone please help me automate this.
thanks
The method I would use to do something like this would be to use a map function over a list of dataframes. My preferred method would to use a nested dataframe where we have a column for dataframe name, the dataframe and we add a linear model column.
I have coded a version of this below using the map function which takes our vector of dataframes and applies lm to each entry.
library(tidyverse)
db1 <- mtcars
db2 <- mtcars
db3 <- mtcars
# Place dataframes in a liset (note do not use c() to put dfs into an array)
a <- list(db1, db2 , db3)
# Construct our dataframe
df <- tibble(entry = 1:3, dataframes = a)
df %>%
# Map the lm function to all of the dataframes
mutate(lm = map(dataframes, ~lm(mpg~wt, data = .x)))
#> # A tibble: 3 x 3
#> entry dataframes lm
#> <int> <list> <list>
#> 1 1 <df[,11] [32 x 11]> <lm>
#> 2 2 <df[,11] [32 x 11]> <lm>
#> 3 3 <df[,11] [32 x 11]> <lm>
Created on 2021-04-06 by the reprex package (v2.0.0)
A slighlty more intuitive method with lists only could be as follows:
(Note that some information i.e. the call to lm is lost)
library(tidyverse)
db1 <- mtcars
db2 <- mtcars
db3 <- mtcars
a <- list(db1, db2 , db3)
b <- rep(list(), 3)
for(i in 1:3) {
b[i] <- lm(mpg~wt, data = a[[i]])
}
#> Warning in b[i] <- lm(mpg ~ wt, data = a[[i]]): number of items to replace is
#> not a multiple of replacement length
b
#> [[1]]
#> (Intercept) wt
#> 37.285126 -5.344472
#>
#> [[2]]
#> (Intercept) wt
#> 37.285126 -5.344472
#>
#> [[3]]
#> (Intercept) wt
#> 37.285126 -5.344472
Created on 2021-04-06 by the reprex package (v2.0.0)
Create a list of data frames rather than individual data-frames as objects, as it is harder to loop db1, db2,db3 rather create data frames which are easier to loop inside lists. Here dfs created is basically list of dataframes on which you can create your models. Now here I have created random dataset with mtcars, In your case you might be having dataset already saved as db1, db2 or db3, so you can do either of these things:
a) dfs = list(db1, db2, db3) Use this dfs with lapply like this: mymodels <- lapply(dfs, function(x)lm(mpg ~ wt, data=x))
b) dfs <- mget(ls(pattern='^db\\d+'), envir = globalenv()) , here inside pattern you put your pattern of data , In this case it starts with db word and ending with a number, now use the similar lapply like above: mymodels <- lapply(dfs, function(x)lm(mpg ~ wt, data=x))
I have given one example from mtcars data using randomly selected rows to propose a way of doing it.
# Creating a list of data-frames randomly
# Using replicate function n(3) times here and picking 80% of data randomly, using seed value 1 for reproducibility
set.seed(1)
n <- 3
prop = .8
dfs <- lapply(data.frame(replicate(n, sample(1:nrow(mtcars), prop*nrow(mtcars)))), function(x)mtcars[x,])
## replicate function here replicates sample command n number of times and create a matrix of indexs of rows taken as different data points from mtcars dataset
mymodels <- lapply(dfs, function(x)lm(mpg ~ wt, data=x)) #mymodels is your output
Output:
$X1
Call:
lm(formula = mpg ~ wt, data = x)
Coefficients:
(Intercept) wt
38.912167 -5.874795
$X2
Call:
lm(formula = mpg ~ wt, data = x)
Coefficients:
(Intercept) wt
37.740419 -5.519547
$X3
Call:
lm(formula = mpg ~ wt, data = x)
Coefficients:
(Intercept) wt
39.463332 -6.051852
I would like to apply a function across columns of a nested grouped tibble as in the example below.
library(tidyverse)
df <- swiss %>%
group_by(Catholic > 20) %>%
nest()
Which results in a tibble that looks like:
> df
# A tibble: 2 x 2
# Groups: Catholic > 20 [2]
`Catholic > 20` data
<lgl> <list>
1 FALSE <tibble [26 × 6]>
2 TRUE <tibble [21 × 6]>
Now I make some function to build a model
fit <- function(df, modL = NA){
if (modL == 1) {fit <- lm(Fertility ~ Education, data = df)}
if (modL == 2) {fit <- lm(Fertility ~ Education + Examination, data = df)}
fit
}
Now I map that model to columns of the grouped data and make two new variables to store the model fits.
df <- df %>%
mutate(model1 = map(data, fit, modL = 1)) %>%
mutate(model2 = map(data, fit, modL = 2))
Which produces a tibble with two new columns that contain the model fits
> df
# A tibble: 2 x 4
# Groups: Catholic > 20 [2]
`Catholic > 20` data model1 model2
<lgl> <list> <list> <list>
1 FALSE <tibble [26 × 6]> <lm> <lm>
2 TRUE <tibble [21 × 6]> <lm> <lm>
What I want to achieve is a purr-type map function that does the same thing as the following code.
anova(df$model1[[1]], df$model2[[1]])
anova(df$model1[[2]], df$model2[[2]])
I though the following code would work, but it does not.
map(df[,3:4], anova)
Gurus, how do I map a function across columns of a nested and grouped dataset to give one result per row using the columns of that row as input?
Brant
df %>%
mutate(anova = map2(model1, model2, ~ anova(.x,.y)))%>%
mutate(pvalue = map_dbl(anova, ~.$`Pr(>F)`[2]))
I think this is what you want? Can you clarify please! Second mutate will pull out the p-value of the anova for each pairwise comparison.
I am trying to write a function that iterates (or uses purrr::map()) through every level of a factor, and fits an lm() model for the subset of the data where the factor is equal to that level.
To make a simple reproducable example with mtcars, just say that I'd like a different lm model for each value of mtcars$gear. I'll start by making it a factor, because my real problem involves iteration through a factor:
library(tidyverse)
mtcars <- mtcars %>%
mutate(factor_gear = factor(gear))
I'd like the function to fit every level of factor_gear. The levels are given by:
levels(mtcars$factor_gear)
i.e.
[1] "3" "4" "5"
So the output I would be looking for would be:
fit1 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="3"))
fit2 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="4"))
fit3 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="5"))
fits <- list(fit1, fit2, fit3)
I've made a start on the function, but wasn't able to get it to work.
I thought that a function should:
get every level of of the factor into a vector
run an lm model for each level.
fit_each_level <- function(factor_variable) {
# trying to: 1. get every level of of the factor into a vector
factor_levels <- levels(df_cars$factor_variable)
# trying to: 2. run an lm model for each level.
for i in factor_levels {
fit <- mtcars %>% filter(factor_variable==i [# every value of segment_levels]) %>%
lm(mpg ~ cyl, data = . )
}
}
fit_each_level(factor_gear)
If the function worked well, I'd ultimately be able to do do it on another factor, eg:
mtcars <- mtcars %>%
mutate(factor_carb = factor(carb))
fit_each_level(factor_carb)
You can nest the dataframe and use map to apply lm for each factor_gear.
library(dplyr)
mtcars %>%
group_by(factor_gear) %>%
tidyr::nest() %>%
mutate(model = map(data, ~lm(mpg ~ cyl, data = .x)))
# factor_gear data model
# <fct> <list> <list>
#1 4 <tibble [12 × 11]> <lm>
#2 3 <tibble [15 × 11]> <lm>
#3 5 <tibble [5 × 11]> <lm>
In the new dplyr you can use cur_data to refer to current data in group which avoids the need of nest and map.
mtcars %>%
group_by(factor_gear) %>%
summarise(model = list(lm(mpg ~ cyl, data = cur_data())))
Make sure you have the latest version of dplyr (1.0.0). Then you can use:
model_coefs <- function(formula, data) {
coefs <- lm(formula, data)$coefficients
data.frame(coef = names(coefs), value = coefs)
}
mtcars %>%
dplyr::mutate(factor_gear = factor(gear)) %>%
dplyr::nest_by(factor_gear) %>%
dplyr::summarise(model_coefs(mpg ~ cyl, data)) %>%
tidyr::pivot_wider(names_from = coef, values_from = value)
# A tibble: 3 x 3
# Groups: factor_gear [3]
factor_gear `(Intercept)` cyl
<fct> <dbl> <dbl>
1 3 29.8 -1.83
2 4 41.3 -3.59
3 5 40.6 -3.2
In the spirit of purr, broom, modelr, I am trying to create a "meta" data.frame in which each row denotes the dataset (d) and the model parameters (yvar, xvars, FEvars). For instance:
iris2 <- iris %>% mutate(Sepal.Length=Sepal.Length^2)
meta <- data.frame(n=1:4,
yvar = c('Sepal.Length','Sepal.Length','Sepal.Length','Sepal.Length'),
xvars= I(list(c('Sepal.Width'),
c('Sepal.Width','Petal.Length'),
c('Sepal.Width'),
c('Sepal.Width','Petal.Length'))),
data= I(list(iris,iris,iris2,iris2)) )
Now, I would like to run a model for each column of "meta". And then add a list column "model" with the model output object. To run the model I use an auxiliary function that uses a dataset, a y variable and a vector of x variables:
OLS_help <- function(d,y,xvars){
paste(y, paste(xvars, collapse=" + "), sep=" ~ ") %>% as.formula %>%
lm(d)
}
y <- 'Sepal.Length'
xvars <- c('Sepal.Width','Petal.Length')
OLS_help(iris,y,xvars)
How can I execute OLS_help for all the rows of meta and adding the output of OLS_help as a list column in meta? I tryed the following code, but it did not work:
meta %>% mutate(model = map2(d,yvar,xvars,OLS_help) )
Error: Can't convert a `AsIs` object to function
Call `rlang::last_error()` to see a backtrace
OBS: The solution to when only the "data" (nested) list column (corvered in Hadley's book here) is:
by_country <- gapminder %>% group_by(country, continent) %>% nest()
country_model <- function(df) { lm(lifeExp ~ year, data = df) }
by_country <- by_country %>% mutate(model = map(data, country_model))
We can use pmap in the following way
df <- meta %>%
as_tibble() %>%
mutate_if(is.factor, as.character) %>%
mutate(fit = pmap(
list(yvar, xvars, data),
function(y, x, df) lm(reformulate(x, response = y), data = df)))
## A tibble: 4 x 5
# n yvar xvars data fit
# <int> <chr> <I<list>> <I<list>> <list>
#1 1 Sepal.Length <chr [1]> <df[,5] [150 × 5]> <lm>
#2 2 Sepal.Length <chr [2]> <df[,5] [150 × 5]> <lm>
#3 3 Sepal.Length <chr [1]> <df[,5] [150 × 5]> <lm>
#4 4 Sepal.Length <chr [2]> <df[,5] [150 × 5]> <lm>
Explanation: pmap iterates over multiple arguments simultaneously (similar to base R's Map); here we simultaneously loop throw entries in column yvar, xvar and data, then use reformulate to construct the formula to be used within lm. We store the lm fit object in column fit.
I am trying to add a set of extrapolated "observations" to a matrix in R. I know how to do this using normal programming techniques (read; bunch of nested loops and functions) but I feel this must be possible in a much more clean way by using build in R-functionality.
The code below illustrates the point, and where it breaks down
Many thanks in advance for your help!
With kind regards
Sylvain
library(dplyr)
# The idea is that i have a table of observations for e.g. x=5, 6, 7, 8, 9 and 10. The observations (in this example 2)
# conform fairly decently to sets of 2nd order polynomials.
# Now, I want to add an extrapolated value to this table (e.g. x=4). I know how to do this programmically
# but I feel there must be a cleaner solution to do this.
#generate dummy data table
x <- 5:10
myData <- tibble(x, a = x^2 * 2 + x * 3 + 4 + rnorm(1,0,0.01), b = x^2 * 3 + x * 4 + 5 + rnorm(1,0,0.01) )
#Gather (put in Data-Key format)
myDataKeyFormat <- gather(myData,key = "someLabel", value = "myObservation", -x)
fitted_models <- myDataKeyFormat %>% group_by(someLabel) %>% do(model = lm(myObservation ~ poly(x,2), data = .))
myExtrapolatedDataPointx <- tibble(x = 4)
#Add the x=4 field
fitted_points <- fitted_models %>% group_by(someLabel) %>% do(predict(.$model,myExtrapolatedDataPointx)) #R really doesnt like this bit
#append the fitted_points to the myDataKeyFormat
myDataKeyFormatWithExtrapolation <- union(myDataKeyFormat,fitted_points)
#use spread to
myDataWithExtrapolation <- myDataKeyFormatWithExtrapolation %>% spread(someLabel,myObservation)
Here is a solution in the tidyverse, and using purrr to create the different models. The idea is to nest (using tidyr::nest) and then purrr::map to train the model. I will then add new values and compute the predictions using modelr::add_predictions. Here you have all the data in the same place : training data, models, testing data and prediction, by your variable someLabel. I also give you a way to visualise the data.
You can check R for Data Science by Hadley Wickham & Garrett Grolemund, and especially the part about models for more information.
library(dplyr)
library(tibble)
library(tidyr)
library(purrr)
library(modelr)
library(ggplot2)
set.seed(1) # For reproducibility
x <- 5:10
myData <- tibble(x,
a = x^2 * 2 + x * 3 + 4 + rnorm(1,0,0.01),
b = x^2 * 3 + x * 4 + 5 + rnorm(1,0,0.01))
#Gather (put in Data-Key format)
myDataKeyFormat <- gather(myData,key = "someLabel", value = "myObservation", -x)
myModels <- myDataKeyFormat %>%
nest(-someLabel) %>%
mutate(model = map(data, ~lm(myObservation ~ poly(x,2), data = .x)))
Here is the result at this point : you have a model for each value of someLabel.
# A tibble: 2 × 3
someLabel data model
<chr> <list> <list>
1 a <tibble [6 × 2]> <S3: lm>
2 b <tibble [6 × 2]> <S3: lm>
I'll add some data points in a new column (map is to create it as a tibble for each line of the data frame).
# New data
new_data <- myModels %>%
mutate(new = map(data, ~tibble(x = c(3, 4, 11, 12))))
I add the predictions: add_predictions take a data frame and a model as argument, so I use map2 to map over the new data and the models.
fitted_models <- new_data %>%
mutate(new = map2(new, model, ~add_predictions(.x, .y)))
fitted_models
# A tibble: 2 × 4
someLabel data model new
<chr> <list> <list> <list>
1 a <tibble [6 × 2]> <S3: lm> <tibble [4 × 2]>
2 b <tibble [6 × 2]> <S3: lm> <tibble [4 × 2]>
There you go: you have for each label the data and model trained on this data, and the new data with predictions.
In order to plot it, I use unnest to take the data back to the data frame, and I bind the rows to have the "old" data and the new values together.
my_points <- bind_rows(unnest(fitted_models, data),
unnest(fitted_models, new))
ggplot(my_points)+
geom_point(aes(x = x, y = myObservation), color = "black") +
geom_point(aes(x = x, y = pred), color = "red")+
facet_wrap(~someLabel)