automate repeating models with different data forloop in R - r

I need to run a lot of replicates on the same model but cycle different data into it on each iteration.
e.g.
db1 <- mtcars
db2 <- mtcars
db3 <- mtcars
for(i in 1:db) {
# keep model structure but alternate the data
lm(mpg ~ wt, data = db[i])
}
I need to create a for-loop or a function that can run the model on db1, then swap in db2 and run the same model. I also need them to be stored as separate objects in my R environment e.g. lm1 (for db1) and lm2 (for db2)
Cn someone please help me automate this.
thanks

The method I would use to do something like this would be to use a map function over a list of dataframes. My preferred method would to use a nested dataframe where we have a column for dataframe name, the dataframe and we add a linear model column.
I have coded a version of this below using the map function which takes our vector of dataframes and applies lm to each entry.
library(tidyverse)
db1 <- mtcars
db2 <- mtcars
db3 <- mtcars
# Place dataframes in a liset (note do not use c() to put dfs into an array)
a <- list(db1, db2 , db3)
# Construct our dataframe
df <- tibble(entry = 1:3, dataframes = a)
df %>%
# Map the lm function to all of the dataframes
mutate(lm = map(dataframes, ~lm(mpg~wt, data = .x)))
#> # A tibble: 3 x 3
#> entry dataframes lm
#> <int> <list> <list>
#> 1 1 <df[,11] [32 x 11]> <lm>
#> 2 2 <df[,11] [32 x 11]> <lm>
#> 3 3 <df[,11] [32 x 11]> <lm>
Created on 2021-04-06 by the reprex package (v2.0.0)
A slighlty more intuitive method with lists only could be as follows:
(Note that some information i.e. the call to lm is lost)
library(tidyverse)
db1 <- mtcars
db2 <- mtcars
db3 <- mtcars
a <- list(db1, db2 , db3)
b <- rep(list(), 3)
for(i in 1:3) {
b[i] <- lm(mpg~wt, data = a[[i]])
}
#> Warning in b[i] <- lm(mpg ~ wt, data = a[[i]]): number of items to replace is
#> not a multiple of replacement length
b
#> [[1]]
#> (Intercept) wt
#> 37.285126 -5.344472
#>
#> [[2]]
#> (Intercept) wt
#> 37.285126 -5.344472
#>
#> [[3]]
#> (Intercept) wt
#> 37.285126 -5.344472
Created on 2021-04-06 by the reprex package (v2.0.0)

Create a list of data frames rather than individual data-frames as objects, as it is harder to loop db1, db2,db3 rather create data frames which are easier to loop inside lists. Here dfs created is basically list of dataframes on which you can create your models. Now here I have created random dataset with mtcars, In your case you might be having dataset already saved as db1, db2 or db3, so you can do either of these things:
a) dfs = list(db1, db2, db3) Use this dfs with lapply like this: mymodels <- lapply(dfs, function(x)lm(mpg ~ wt, data=x))
b) dfs <- mget(ls(pattern='^db\\d+'), envir = globalenv()) , here inside pattern you put your pattern of data , In this case it starts with db word and ending with a number, now use the similar lapply like above: mymodels <- lapply(dfs, function(x)lm(mpg ~ wt, data=x))
I have given one example from mtcars data using randomly selected rows to propose a way of doing it.
# Creating a list of data-frames randomly
# Using replicate function n(3) times here and picking 80% of data randomly, using seed value 1 for reproducibility
set.seed(1)
n <- 3
prop = .8
dfs <- lapply(data.frame(replicate(n, sample(1:nrow(mtcars), prop*nrow(mtcars)))), function(x)mtcars[x,])
## replicate function here replicates sample command n number of times and create a matrix of indexs of rows taken as different data points from mtcars dataset
mymodels <- lapply(dfs, function(x)lm(mpg ~ wt, data=x)) #mymodels is your output
Output:
$X1
Call:
lm(formula = mpg ~ wt, data = x)
Coefficients:
(Intercept) wt
38.912167 -5.874795
$X2
Call:
lm(formula = mpg ~ wt, data = x)
Coefficients:
(Intercept) wt
37.740419 -5.519547
$X3
Call:
lm(formula = mpg ~ wt, data = x)
Coefficients:
(Intercept) wt
39.463332 -6.051852

Related

How can I make a constrained linear model by group in df?

I need to make a constrained model by group in R. I tried the group_by and do() functions to estimate the unconstrained lm, but when I try the same for a constrained model with ConsReg it doesn´t work.
This worked for the unconstrained lm:
df_grouped <- df %>%
group_by(type, Region)
grouped_lm <- df_grouped %>%
do(tidy(lm(y ~ x, data =.)))
For the constrained model I tried this:
grouped_lm_constrained <- df_grouped %>%
do(ConsReg(formula = y ~ x, family = 'gaussian', optimizer = 'mcmc', LOWER = 0, UPPER = 1, data =.))
but gives me this error:
"Error in `do()`:
! Results 1, 2, 3, 4, 5, ... must be data frames, not ConsReg."
Does anyone know what's happening?
The problem you are facing stems from the broom::tidy function, which has no implementation for ConsReg models/objects. What you could do is write your custom function for extraction of the desired content from a ConsReg model/object. To know what the model object has in its belly you can i.e. generate just one model (one group) and call str(model) on it as well as str(summary(model)) to see what base R can do for you in terms of structuring the data. In the example below I extracted a not selection of what could be importante model content. You might have to adapt this according to your usecase and needs.
I really like the aproach of nested lists in tibbles and running models on those. Anyhow you can run the do() approach or even split the data.frame into a list where each item is a group and work mapping functions for example on those.
library(ConsReg)
library(dplyr)
library(purrr)
library(tidyr)
# Dummy data
df <- data.frame(g = sort(rep(c("A", "B") , "10")),
x = rep(1:10, 2),
y = c(1:10, seq(from = 1, to = 100, by = 10)))
# custom function which takes a model as input and parses the formula, coefficients plus aditional data and MAPE as a data.frame
myfun <- function(x){
cbind(fromula = x$formula %>% deparse,
as.data.frame(summary(x)$coefficients) %>% tibble::rownames_to_column() %>% dplyr::rename(Term = 1),
MAPE = x$metrics$MAPE)
}
# group the df for nesting in the next step
dplyr::group_by(df, g) %>%
# nest the columns of interest into a list where each item (aka group) contains the mentioned variables
tidyr::nest(data = c("x", "y")) %>%
# run run map functions on data to generate model and the custom extraction function
dplyr::mutate(crmod = purrr::map(data, ~ ConsReg(y ~ x, family = 'gaussian', optimizer = 'mcmc', LOWER = 0, UPPER = 1, data = .x)),
stats = purrr::map(crmod, ~ myfun(.x))) %>%
# unnest the stats column from list items do df row(s)
tidyr::unnest(stats)
# Groups: g [2]
g data crmod fromula Term Estimate StdErr t.value p.value MAPE
<chr> <list> <list> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A <tibble [10 x 2]> <ConsReg> y ~ x (Intercept) 1.12e-15 7.79e-16 1.44 1.87e- 1 1.87e-16
2 A <tibble [10 x 2]> <ConsReg> y ~ x x 1 e+ 0 1.03e- 1 9.67 1.09e- 5 1.87e-16
3 B <tibble [10 x 2]> <ConsReg> y ~ x (Intercept) 9.84e- 1 3.03e- 2 32.5 8.79e-10 8.58e- 1
4 B <tibble [10 x 2]> <ConsReg> y ~ x x 9.98e- 1 7.83e- 3 128. 1.59e-14 8.58e- 1

Keep identifying features when using map() to "spread_predictions" in R?

I am trying to build a linear model, and then make predictions with new data based on that linear model. The following chunk of code takes a given set of data (data1), and produces 20 models based on the fact that when I group by ID and plot, there are 20 groups:
modelobject <- data_1 %>%
group_by(ID, plot) %>%
do(model = lm(air_temp ~ water_temp, data = .)) %>%
ungroup()
Now that the model is designed, I want to use the map() function to make predictions across a new set of data (data_2) for each of those models:
modelled_values <- map(modelobject$model, ~ spread_predictions(data = data_2, models = .x))
This works great, except for the fact that the subsequent object modelled_values doesn't have the identifying features of the original models (i.e. their given ID and plot) as can be seen in the following output for the Value column (it produces 11 columns, none of which are identifying features):
Value
List of length 20
A data.frame with 52606 rows and 11 columns
....
I have ended up having to assume that they are just in the order I produced them in and manually label each model object with the following style of code:
modelled_values[[1]]$ID <- "ID1"
modelled_values[[2]]$ID <- "ID1"
modelled_values[[3]]$ID <- "ID2"
modelled_values[[4]]$ID <- "ID2"
...
Is there any way I can carry the identifying features of the original models over to these predicted data?
What about something like this:
modelobject <- mtcars %>%
group_by(vs, am) %>%
do(model = lm(mpg ~ hp, data = .))
preds <- modelobject %>%
group_by(vs, am) %>%
rowwise %>%
summarise(preds = list(predict(model, newdata=mtcars)))
preds
# # A tibble: 4 x 3
# # Groups: vs, am [4]
# vs am preds
# <dbl> <dbl> <list>
# 1 0 0 <dbl [32]>
# 2 0 1 <dbl [32]>
# 3 1 0 <dbl [32]>
# 4 1 1 <dbl [32]>
In the code above, preds is now a tibble with a column called preds where each element is as vector of predictions from the model for the relevant vs and am values in the row.

fit an `lm` model for every level of a factor

I am trying to write a function that iterates (or uses purrr::map()) through every level of a factor, and fits an lm() model for the subset of the data where the factor is equal to that level.
To make a simple reproducable example with mtcars, just say that I'd like a different lm model for each value of mtcars$gear. I'll start by making it a factor, because my real problem involves iteration through a factor:
library(tidyverse)
mtcars <- mtcars %>%
mutate(factor_gear = factor(gear))
I'd like the function to fit every level of factor_gear. The levels are given by:
levels(mtcars$factor_gear)
i.e.
[1] "3" "4" "5"
So the output I would be looking for would be:
fit1 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="3"))
fit2 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="4"))
fit3 <- lm(mpg ~ cyl, data = mtcars %>% filter(factor_gear=="5"))
fits <- list(fit1, fit2, fit3)
I've made a start on the function, but wasn't able to get it to work.
I thought that a function should:
get every level of of the factor into a vector
run an lm model for each level.
fit_each_level <- function(factor_variable) {
# trying to: 1. get every level of of the factor into a vector
factor_levels <- levels(df_cars$factor_variable)
# trying to: 2. run an lm model for each level.
for i in factor_levels {
fit <- mtcars %>% filter(factor_variable==i [# every value of segment_levels]) %>%
lm(mpg ~ cyl, data = . )
}
}
fit_each_level(factor_gear)
If the function worked well, I'd ultimately be able to do do it on another factor, eg:
mtcars <- mtcars %>%
mutate(factor_carb = factor(carb))
fit_each_level(factor_carb)
You can nest the dataframe and use map to apply lm for each factor_gear.
library(dplyr)
mtcars %>%
group_by(factor_gear) %>%
tidyr::nest() %>%
mutate(model = map(data, ~lm(mpg ~ cyl, data = .x)))
# factor_gear data model
# <fct> <list> <list>
#1 4 <tibble [12 × 11]> <lm>
#2 3 <tibble [15 × 11]> <lm>
#3 5 <tibble [5 × 11]> <lm>
In the new dplyr you can use cur_data to refer to current data in group which avoids the need of nest and map.
mtcars %>%
group_by(factor_gear) %>%
summarise(model = list(lm(mpg ~ cyl, data = cur_data())))
Make sure you have the latest version of dplyr (1.0.0). Then you can use:
model_coefs <- function(formula, data) {
coefs <- lm(formula, data)$coefficients
data.frame(coef = names(coefs), value = coefs)
}
mtcars %>%
dplyr::mutate(factor_gear = factor(gear)) %>%
dplyr::nest_by(factor_gear) %>%
dplyr::summarise(model_coefs(mpg ~ cyl, data)) %>%
tidyr::pivot_wider(names_from = coef, values_from = value)
# A tibble: 3 x 3
# Groups: factor_gear [3]
factor_gear `(Intercept)` cyl
<fct> <dbl> <dbl>
1 3 29.8 -1.83
2 4 41.3 -3.59
3 5 40.6 -3.2

Using dplyr() to retrieve model object created via group_by() and do()

I'm trying to use dplyr and the pipe operator (%>%) to retrieve model objects stored in a dataframe.
With example data
library(dplyr)
set.seed(256)
dat <-
data.frame(x = rnorm(100),
y = rnorm(100, 10),
spec = sample(c("1", "2"), 100, TRUE)) %>%
group_by(spec) %>%
do(lm = lm(y ~ x, data = .))
I can subset and retrieve an actual model object
> dat$lm[dat$spec == "1"][[1]]
Call:
lm(formula = y ~ x, data = .)
Coefficients:
(Intercept) x
9.8171 -0.2292
> dat$lm[dat$spec == "1"][[1]] %>% class()
[1] "lm
But I think this is an inelegant way of retrieving the lm() model object contained therein, especially given that the rest of my code is structured the "dplyr way". I'd like to use dplyr but I can't figure out how. For example, using
dat %>% filter(spec == "1") %>% select(lm)
doesn't work as it returns
Source: local data frame [1 x 1]
Groups: <by row>
# A tibble: 1 x 1
lm
<list>
1 <S3: lm>
and
dat %>% filter(spec == "1") %>% .$lm
only gets me to the first object in list, e.g.,
> dat %>% filter(spec == "1") %>% .$lm
[[1]]
Call:
lm(formula = y ~ x, data = .)
Coefficients:
(Intercept) x
10.01495 -0.07438
I can't figure out a way to get to the actual model object in the dat with dplyr. Certainly, I could use broom and tidy() to condense everything
library(broom)
tidy(dat, lm)
but this still doesn't return the actual model object:
> tidy(dat, lm)
# A tibble: 4 x 6
# Groups: spec [2]
spec term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 10.0 0.120 83.3 1.91e-54
2 1 x - 0.0744 0.111 - 0.671 5.05e- 1
3 2 (Intercept) 9.86 0.131 75.0 1.42e-50
4 2 x - 0.0793 0.148 - 0.535 5.95e- 1
I can even use dplyr to summarise() the output from a do() call and retrieve the coefficients from the models, but this still doesn't give me the model object itself:
dat %>%
select(spec) %>%
bind_cols(dat %>%
summarize(lm_i = coefficients(lm)[[1]],
lm_s = coefficients(lm)[[2]]))
Is there a dplyr way to retrieve the actual model object from models created with do()?
do returns a list column, so to extract its individual elements, you need to use list subsetting. There are various ways to do that, but in the tidyverse, purrr::pluck is a nice option to extract a single [possibly deeply nested] element:
library(tidyverse)
dat %>% pluck('lm', 1)
#>
#> Call:
#> lm(formula = y ~ x, data = .)
#>
#> Coefficients:
#> (Intercept) x
#> 10.01495 -0.07438
It's mostly equivalent to [[ subsetting, i.e.
dat[['lm']][[1]]
To get what you have to work, you need to keep subsetting, as .$lm returns the list column, which in this case is a list of a model. .[[1]] (akin to the 1 above) extracts the model from the list:
dat %>% filter(spec == "1") %>% .$lm %>% .[[1]]
or a hybrid approach, if you like:
dat %>% filter(spec == "1") %>% pluck('lm', 1)
or use pull to extract the column with NSE semantics:
dat %>% filter(spec == "1") %>% pull(lm) %>% pluck(1)
All return the same thing.

Managing multiple models and run times in R

I'm building dozens of predictive models in an effort to identify a champion model. I'm working with gigabytes of data, so tracking run time is important.
I'd like to build all my models in a list-type format, so I don't have to manage all the different model names within the Global Environment. However, it seems that the only way to get timings per model is to have separate named objects.
Here's a basic method that approaches what I'm looking for:
library(tidyverse)
# Basic Approach
Time_1 <- system.time(
Model_1 <- lm(am ~ disp, mtcars)
)
Time_2 <- system.time(
Model_2 <- lm(am ~ disp + cyl, mtcars)
)
# etc. for dozens more
Time_List <-
mget(ls(pattern = "Time")) %>%
bind_rows()
However, as you can see, I have to manually name each model and time record. What I'm looking for is something similar to the table produced with the following code, where "xxx" is an actual record of run time.
# Tribble Output
tribble(
~Model_Name, ~Model_Function, ~Run_Time,
"Model_1", lm(am ~ disp, mtcars), "xxx",
"Model_2", lm(am ~ disp + cyl, mtcars), "xxx"
)
# A tibble: 2 × 3
Model_Name Model_Function Run_Time
<chr> <list> <chr>
1 Model_1 <S3: lm> xxx
2 Model_2 <S3: lm> xxx
I'd appreciate any input provided, regardless of packages used.
If you assign within system.time, you can save both the time and what's computed. If you assign the results to a list column, you can unpack it:
library(tidyverse)
data_frame(formula = c(mpg ~ wt, mpg ~ wt + hp)) %>%
mutate(model_time = map(formula, ~{
time <- system.time(model <- lm(.x, mtcars));
lst(model, time)
}),
model = map(model_time, 'model'),
time = map(model_time, 'time')) %>%
select(-model_time)
#> # A tibble: 2 × 3
#> formula model time
#> <list> <list> <list>
#> 1 <S3: formula> <S3: lm> <S3: proc_time>
#> 2 <S3: formula> <S3: lm> <S3: proc_time>
Because the columns are all still lists it doesn't look like much, but all of the data is now there and can be further extracted.
An equivalent alternative:
data_frame(formula = c(mpg ~ wt, mpg ~ wt + hp)) %>%
mutate(model_time = map(formula, ~{
time <- system.time(model <- lm(.x, mtcars));
data_frame(model = list(model),
time = list(time))
})) %>%
unnest(model_time)

Resources