How can I make a constrained linear model by group in df? - r

I need to make a constrained model by group in R. I tried the group_by and do() functions to estimate the unconstrained lm, but when I try the same for a constrained model with ConsReg it doesn´t work.
This worked for the unconstrained lm:
df_grouped <- df %>%
group_by(type, Region)
grouped_lm <- df_grouped %>%
do(tidy(lm(y ~ x, data =.)))
For the constrained model I tried this:
grouped_lm_constrained <- df_grouped %>%
do(ConsReg(formula = y ~ x, family = 'gaussian', optimizer = 'mcmc', LOWER = 0, UPPER = 1, data =.))
but gives me this error:
"Error in `do()`:
! Results 1, 2, 3, 4, 5, ... must be data frames, not ConsReg."
Does anyone know what's happening?

The problem you are facing stems from the broom::tidy function, which has no implementation for ConsReg models/objects. What you could do is write your custom function for extraction of the desired content from a ConsReg model/object. To know what the model object has in its belly you can i.e. generate just one model (one group) and call str(model) on it as well as str(summary(model)) to see what base R can do for you in terms of structuring the data. In the example below I extracted a not selection of what could be importante model content. You might have to adapt this according to your usecase and needs.
I really like the aproach of nested lists in tibbles and running models on those. Anyhow you can run the do() approach or even split the data.frame into a list where each item is a group and work mapping functions for example on those.
library(ConsReg)
library(dplyr)
library(purrr)
library(tidyr)
# Dummy data
df <- data.frame(g = sort(rep(c("A", "B") , "10")),
x = rep(1:10, 2),
y = c(1:10, seq(from = 1, to = 100, by = 10)))
# custom function which takes a model as input and parses the formula, coefficients plus aditional data and MAPE as a data.frame
myfun <- function(x){
cbind(fromula = x$formula %>% deparse,
as.data.frame(summary(x)$coefficients) %>% tibble::rownames_to_column() %>% dplyr::rename(Term = 1),
MAPE = x$metrics$MAPE)
}
# group the df for nesting in the next step
dplyr::group_by(df, g) %>%
# nest the columns of interest into a list where each item (aka group) contains the mentioned variables
tidyr::nest(data = c("x", "y")) %>%
# run run map functions on data to generate model and the custom extraction function
dplyr::mutate(crmod = purrr::map(data, ~ ConsReg(y ~ x, family = 'gaussian', optimizer = 'mcmc', LOWER = 0, UPPER = 1, data = .x)),
stats = purrr::map(crmod, ~ myfun(.x))) %>%
# unnest the stats column from list items do df row(s)
tidyr::unnest(stats)
# Groups: g [2]
g data crmod fromula Term Estimate StdErr t.value p.value MAPE
<chr> <list> <list> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A <tibble [10 x 2]> <ConsReg> y ~ x (Intercept) 1.12e-15 7.79e-16 1.44 1.87e- 1 1.87e-16
2 A <tibble [10 x 2]> <ConsReg> y ~ x x 1 e+ 0 1.03e- 1 9.67 1.09e- 5 1.87e-16
3 B <tibble [10 x 2]> <ConsReg> y ~ x (Intercept) 9.84e- 1 3.03e- 2 32.5 8.79e-10 8.58e- 1
4 B <tibble [10 x 2]> <ConsReg> y ~ x x 9.98e- 1 7.83e- 3 128. 1.59e-14 8.58e- 1

Related

Running linear models for groups within dataframe and storing outputs in dataframe in R

I am trying to run multiple linear models for a very large dataset and store the outputs in a dataframe. I have managed to get estimates and p-values into dataframe (see below) but I also want to store the AIC for each model.
#example dataframe
dt = data.frame(x = rnorm(40, 5, 5),
y = rnorm(40, 3, 4),
group = rep(c("a","b"), 20))
library(dplyr)
library(broom)
# code that runs lm for each group in row z and stores output
dt_lm <- dt %>%
group_by(group) %>%
do(tidy(lm(y~x, data=.)))
Use glance instead of tidy:
dt_lm <- dt %>%
group_by(group) %>%
do(glance(lm(y~x, data=.))) %>%
select(AIC)
which gives:
Adding missing grouping variables: `group`
# A tibble: 2 x 2
# Groups: group [2]
group AIC
<chr> <dbl>
1 a 119.
2 b 114.
If you not only want to store the AIC but other metrics just skip the select part.
In the newer version of dplyr i.e. >= 1.0, we can also use nest_by
library(dplyr)
library(tidyr)
library(broom)
dt %>%
nest_by(group) %>%
transmute(out = list(glance(lm(y ~ x, data = data)))) %>%
unnest(c(out)) %>%
select(AIC)
# A tibble: 2 x 2
# Groups: group [2]
# group AIC
# <chr> <dbl>
#1 a 115.
#2 b 100.

Running multiple Cox-PH models with tidyr

I have a regular Surv object from the survival package;
s <- Surv(sample(100:150, 5), sample(c(T, F), 5, replace = T))
And a matrix of multiple variables;
df <- data.frame(var1 = rnorm(5),
var2 = rnorm(5),
var3 = rnorm(5))
I need to fit a Cox-PH model for each variable separately. My code currently uses a loop as follows:
for (v in colnames(df)) {
coxph(s ~ df[[v]])
}
Of course, in reality there are thousands of variables and this process takes a bit. I wanted to follow the answer given here to try and do it all with tidyr but I'm kinda stumped because the predictand isn't a factor, it's a survival object, so I don't quite know how to handle it as part of a tibble.
Assuming your response is s for the survival model, you can use a nested dataframe similar to the answer you link to, then map the model to the different variables:
library(tidyverse)
df_nested <- df %>% pivot_longer(cols = var1:var3) %>% group_by(name) %>% nest()
surv_model <- function(df) {
coxph(s ~ df$value)
}
df_nested <- df_nested %>% mutate(model = map(data, surv_model))
df_nested
# A tibble: 3 x 3
# Groups: name [3]
name data model
<chr> <list> <list>
1 var1 <tibble [5 x 1]> <coxph>
2 var2 <tibble [5 x 1]> <coxph>
3 var3 <tibble [5 x 1]> <coxph>

R Use map2 to iterate over columns within a list of data frames to fit statistical models

I'm trying to figure out a purrr approach to iteratively map over columns within a list of data frames to fit univariate GLMs. Using map2, the first element, .x, would be the three pred columns, and the second element, .y, would be the list of data frames (or vice-versa). map2 seems to be able to do this, but I recognize that I need to cross the .x and .y elements first, so I use tidyr::crossing first to do this. From here, I am unsure how to properly reference the columns to select within the data frames. Example code is below:
#Sample data
set.seed(100)
test_df <- tibble(pred1 = sample(40:80, size = 1000, replace = TRUE),
pred2 = sample(40:80, size = 1000, replace = TRUE),
pred3 = sample(40:80, size = 1000, replace = TRUE),
resp = sample(100:200, size = 1000, replace = TRUE),
group = sample(c('a','b','c'), size = 1000, replace = TRUE))
#Split into list
test_ls <- test_df %>%
group_by(group) %>%
{df_groups <<- .} %>%
group_split()
#Obtain keys and name list elements
group_keys <- df_groups %>%
group_keys() %>%
pull()
test_ls <- test_ls %>% setNames(nm = group_keys)
#Cross all combinations of pred columns and list element names
preds <- c('pred1','pred2','pred3')
map_keys <- crossing(preds, group_keys)
#.y = list of data frames; iterate over data frames
#.x = three pred columns; iterate over columns
#Use purrr to fit glm of each .x columns within each of .y dfs
#Example structure - does not work
map2(.x, .y, .f = ~glm(resp ~ .x, data = .y))
#Workaround that does work
lapply(test_ls, function(x) {
x %>%
select(pred1, pred2, pred3) %>%
map(.f = ~glm(resp ~ .x, data = x))
})
There's something I'm missing, and I can't seem to figure it out. I've gotten a variety of errors with a few approaches, but I think it's coming down to not properly referencing the .x columns within the .y data frames. My approaches don't seem to recognize that .x is a column within .y. The workaround does the trick, but I'd prefer to avoid using both lapply and map.
My suggestion would be to NOT split the data before fitting models, since you are considering all possible combinations of variables that are already available directly in your original dataset. Instead, consider converting the original data frame to the "long" format, and then grouping by the necessary variables:
test_df %>% gather( pred, value, pred1:pred3 ) %>%
nest( -c(group, pred) ) %>%
mutate( models = map(data, ~glm(resp ~ value, data=.x)) )
# # A tibble: 9 x 4
# group pred data models
# <chr> <chr> <list> <list>
# 1 b pred1 <tibble [340 x 2]> <glm>
# 2 a pred1 <tibble [317 x 2]> <glm>
# 3 c pred1 <tibble [343 x 2]> <glm>
# 4 b pred2 <tibble [340 x 2]> <glm>
# 5 a pred2 <tibble [317 x 2]> <glm>
# 6 c pred2 <tibble [343 x 2]> <glm>
# 7 b pred3 <tibble [340 x 2]> <glm>
# 8 a pred3 <tibble [317 x 2]> <glm>
# 9 c pred3 <tibble [343 x 2]> <glm>
This substantially simplifies your code, and you can now split the result, if you still need those models in a list.

Using dplyr() to retrieve model object created via group_by() and do()

I'm trying to use dplyr and the pipe operator (%>%) to retrieve model objects stored in a dataframe.
With example data
library(dplyr)
set.seed(256)
dat <-
data.frame(x = rnorm(100),
y = rnorm(100, 10),
spec = sample(c("1", "2"), 100, TRUE)) %>%
group_by(spec) %>%
do(lm = lm(y ~ x, data = .))
I can subset and retrieve an actual model object
> dat$lm[dat$spec == "1"][[1]]
Call:
lm(formula = y ~ x, data = .)
Coefficients:
(Intercept) x
9.8171 -0.2292
> dat$lm[dat$spec == "1"][[1]] %>% class()
[1] "lm
But I think this is an inelegant way of retrieving the lm() model object contained therein, especially given that the rest of my code is structured the "dplyr way". I'd like to use dplyr but I can't figure out how. For example, using
dat %>% filter(spec == "1") %>% select(lm)
doesn't work as it returns
Source: local data frame [1 x 1]
Groups: <by row>
# A tibble: 1 x 1
lm
<list>
1 <S3: lm>
and
dat %>% filter(spec == "1") %>% .$lm
only gets me to the first object in list, e.g.,
> dat %>% filter(spec == "1") %>% .$lm
[[1]]
Call:
lm(formula = y ~ x, data = .)
Coefficients:
(Intercept) x
10.01495 -0.07438
I can't figure out a way to get to the actual model object in the dat with dplyr. Certainly, I could use broom and tidy() to condense everything
library(broom)
tidy(dat, lm)
but this still doesn't return the actual model object:
> tidy(dat, lm)
# A tibble: 4 x 6
# Groups: spec [2]
spec term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 10.0 0.120 83.3 1.91e-54
2 1 x - 0.0744 0.111 - 0.671 5.05e- 1
3 2 (Intercept) 9.86 0.131 75.0 1.42e-50
4 2 x - 0.0793 0.148 - 0.535 5.95e- 1
I can even use dplyr to summarise() the output from a do() call and retrieve the coefficients from the models, but this still doesn't give me the model object itself:
dat %>%
select(spec) %>%
bind_cols(dat %>%
summarize(lm_i = coefficients(lm)[[1]],
lm_s = coefficients(lm)[[2]]))
Is there a dplyr way to retrieve the actual model object from models created with do()?
do returns a list column, so to extract its individual elements, you need to use list subsetting. There are various ways to do that, but in the tidyverse, purrr::pluck is a nice option to extract a single [possibly deeply nested] element:
library(tidyverse)
dat %>% pluck('lm', 1)
#>
#> Call:
#> lm(formula = y ~ x, data = .)
#>
#> Coefficients:
#> (Intercept) x
#> 10.01495 -0.07438
It's mostly equivalent to [[ subsetting, i.e.
dat[['lm']][[1]]
To get what you have to work, you need to keep subsetting, as .$lm returns the list column, which in this case is a list of a model. .[[1]] (akin to the 1 above) extracts the model from the list:
dat %>% filter(spec == "1") %>% .$lm %>% .[[1]]
or a hybrid approach, if you like:
dat %>% filter(spec == "1") %>% pluck('lm', 1)
or use pull to extract the column with NSE semantics:
dat %>% filter(spec == "1") %>% pull(lm) %>% pluck(1)
All return the same thing.

R - Adding an extrapolated (lm) value to a matrix of observations

I am trying to add a set of extrapolated "observations" to a matrix in R. I know how to do this using normal programming techniques (read; bunch of nested loops and functions) but I feel this must be possible in a much more clean way by using build in R-functionality.
The code below illustrates the point, and where it breaks down
Many thanks in advance for your help!
With kind regards
Sylvain
library(dplyr)
# The idea is that i have a table of observations for e.g. x=5, 6, 7, 8, 9 and 10. The observations (in this example 2)
# conform fairly decently to sets of 2nd order polynomials.
# Now, I want to add an extrapolated value to this table (e.g. x=4). I know how to do this programmically
# but I feel there must be a cleaner solution to do this.
#generate dummy data table
x <- 5:10
myData <- tibble(x, a = x^2 * 2 + x * 3 + 4 + rnorm(1,0,0.01), b = x^2 * 3 + x * 4 + 5 + rnorm(1,0,0.01) )
#Gather (put in Data-Key format)
myDataKeyFormat <- gather(myData,key = "someLabel", value = "myObservation", -x)
fitted_models <- myDataKeyFormat %>% group_by(someLabel) %>% do(model = lm(myObservation ~ poly(x,2), data = .))
myExtrapolatedDataPointx <- tibble(x = 4)
#Add the x=4 field
fitted_points <- fitted_models %>% group_by(someLabel) %>% do(predict(.$model,myExtrapolatedDataPointx)) #R really doesnt like this bit
#append the fitted_points to the myDataKeyFormat
myDataKeyFormatWithExtrapolation <- union(myDataKeyFormat,fitted_points)
#use spread to
myDataWithExtrapolation <- myDataKeyFormatWithExtrapolation %>% spread(someLabel,myObservation)
Here is a solution in the tidyverse, and using purrr to create the different models. The idea is to nest (using tidyr::nest) and then purrr::map to train the model. I will then add new values and compute the predictions using modelr::add_predictions. Here you have all the data in the same place : training data, models, testing data and prediction, by your variable someLabel. I also give you a way to visualise the data.
You can check R for Data Science by Hadley Wickham & Garrett Grolemund, and especially the part about models for more information.
library(dplyr)
library(tibble)
library(tidyr)
library(purrr)
library(modelr)
library(ggplot2)
set.seed(1) # For reproducibility
x <- 5:10
myData <- tibble(x,
a = x^2 * 2 + x * 3 + 4 + rnorm(1,0,0.01),
b = x^2 * 3 + x * 4 + 5 + rnorm(1,0,0.01))
#Gather (put in Data-Key format)
myDataKeyFormat <- gather(myData,key = "someLabel", value = "myObservation", -x)
myModels <- myDataKeyFormat %>%
nest(-someLabel) %>%
mutate(model = map(data, ~lm(myObservation ~ poly(x,2), data = .x)))
Here is the result at this point : you have a model for each value of someLabel.
# A tibble: 2 × 3
someLabel data model
<chr> <list> <list>
1 a <tibble [6 × 2]> <S3: lm>
2 b <tibble [6 × 2]> <S3: lm>
I'll add some data points in a new column (map is to create it as a tibble for each line of the data frame).
# New data
new_data <- myModels %>%
mutate(new = map(data, ~tibble(x = c(3, 4, 11, 12))))
I add the predictions: add_predictions take a data frame and a model as argument, so I use map2 to map over the new data and the models.
fitted_models <- new_data %>%
mutate(new = map2(new, model, ~add_predictions(.x, .y)))
fitted_models
# A tibble: 2 × 4
someLabel data model new
<chr> <list> <list> <list>
1 a <tibble [6 × 2]> <S3: lm> <tibble [4 × 2]>
2 b <tibble [6 × 2]> <S3: lm> <tibble [4 × 2]>
There you go: you have for each label the data and model trained on this data, and the new data with predictions.
In order to plot it, I use unnest to take the data back to the data frame, and I bind the rows to have the "old" data and the new values together.
my_points <- bind_rows(unnest(fitted_models, data),
unnest(fitted_models, new))
ggplot(my_points)+
geom_point(aes(x = x, y = myObservation), color = "black") +
geom_point(aes(x = x, y = pred), color = "red")+
facet_wrap(~someLabel)

Resources