Writing a function to enclose another function with regression models [duplicate] - r

This question already has answers here:
Novice needs to loop lm in R
(1 answer)
R Loop for Variable Names to run linear regression model
(2 answers)
Closed 4 years ago.
Goal: Run three regression models with three different outcome variables, as seen below, but ideally in a more efficient way than seen in the model1, model2, model3 version seen in the last three lines.
Specific question: How can I write a function that iterates over the set of dv's and creates model + # indicator as an object (e.g. model1, model2, etc.) AND switches the dv (e.g. dv1, dv2, etc...)? I assume there is a forloop and function solution to this but I am not getting it...
mydf <- data.frame(dv1 = rnorm(100),
dv2 = rnorm(100),
dv3 = rnorm(100),
iv1 = rnorm(100),
iv2 = rnorm(100),
iv3 = rnorm(100))
mymodel <- function(dv, df) {
lm(dv ~ iv1 + iv2 + iv3, data = df)
}
model1 <- mymodel(dv = mydf$dv1, df = mydf)
model2 <- mymodel(dv = mydf$dv2, df = mydf)
model3 <- mymodel(dv = mydf$dv3, df = mydf)

Here's another approach using the tidyverse packages, since dplyr has more or less supplanted plyr.
library(tidyverse)
mydf <- data.frame(dv1 = rnorm(100),
dv2 = rnorm(100),
dv3 = rnorm(100),
iv1 = rnorm(100),
iv2 = rnorm(100),
iv3 = rnorm(100))
mymodel <- function(df) {
lm(value ~ iv1 + iv2 + iv3, data = df)
}
mydf %>%
gather("variable","value", contains("dv")) %>%
split(.$variable) %>%
map(mymodel)
#> $dv1
#>
#> Call:
#> lm(formula = value ~ iv1 + iv2 + iv3, data = df)
#>
#> Coefficients:
#> (Intercept) iv1 iv2 iv3
#> -0.04516 -0.04657 0.08045 0.02518
#>
#>
#> $dv2
#>
#> Call:
#> lm(formula = value ~ iv1 + iv2 + iv3, data = df)
#>
#> Coefficients:
#> (Intercept) iv1 iv2 iv3
#> -0.03906 0.16730 0.10324 0.02500
#>
#>
#> $dv3
#>
#> Call:
#> lm(formula = value ~ iv1 + iv2 + iv3, data = df)
#>
#> Coefficients:
#> (Intercept) iv1 iv2 iv3
#> 0.018492 -0.162563 0.002738 0.179366
Created on 2018-11-26 by the reprex package (v0.2.1)

You could convert your data.frame to long form, with all the dv values in one column and then use plyr's dlply to create the lms. This splits the data.frame on the specified column ("dvN") and applys the function to each and returns a list of lms. I have changed the function slightly to make it just take a data.frame, not the column separately.
Hope this gives what you need.
library(plyr)
library(tidyr)
mydf_l <- gather(mydf, dvN, Value, 1:3)
mymodel2 <- function(df) {
lm(Value ~ iv1 + iv2 + iv3, data = df)
}
allmodels <- dlply(mydf_l, .(dvN), mymodel2)

Related

How to use the dplyr package to do group-separated linear regressions in R?

I have a dataset of x and y separated by categories (a and b). I want to do 2 linear regressions, one for category a data and one for category b data. For this purpose, I used the dplyr package following this answer. I'm a little confused because my code is simpler, but I'm not able to do the regressions. Any tips?
library(dplyr)
Factor <- c("a", "b")
x <- seq(0,3,1)
df <- expand.grid(x = x, Factor = Factor)
df$y <- rnorm(8)
df %>%
group_by(Factor) %>%
do(lm(formula = y ~ x,
data = .))
Error: Results 1, 2 must be data frames, not lm
This creates a list column whose components are lm objects
df2 <- df %>%
group_by(Factor) %>%
summarize(lm = list(lm(formula = y ~ x, data = cur_data())), .groups = "drop")
giving:
> df2
# A tibble: 2 x 2
Factor lm
<fct> <list>
1 a <lm>
2 b <lm>
> with(df2, setNames(lm, Factor))
$a
Call:
lm(formula = y ~ x, data = cur_data())
Coefficients:
(Intercept) x
-0.3906 0.2947
$b
Call:
lm(formula = y ~ x, data = cur_data())
Coefficients:
(Intercept) x
0.2684 -0.3403
Here is my approach:
df %>%
split(~ Factor) %>%
purrr::map(\(x) lm(formula = y ~ x, data = x))

Multiplying results regression R

I just did a regression in R. I would like to multiply the results of each coefficient with some variables. How can I do it?
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.210e+00 7.715e-01 1.568 0.13108
SDCHRO_I -1.846e-01 2.112e-01 -0.874 0.39157
functional_cognitive_level3 4.941e-02 7.599e-02 0.650 0.52224
rev_per_members -4.955e-06 5.827e-06 -0.850 0.40432
And I want something like this:
1.210e+00 + -1.846e-01 * var1 + 4.941e-02 * var2 + 4.941e-02 * var3
Is there a way to do it?
You can use model.matrix.
data(mtcars)
lm1 <- lm(mpg~cyl+disp+hp, data=mtcars)
res <- coef(lm1) %*% t(model.matrix(lm1))
all(res==predict(lm1))
#[1] TRUE
You can access the coefficients with model$coefficients.
For example, if you want to multiply all coefficients with 10, you can do
df = data.frame(x = runif(100), y = runif(100), z = runif(100))
mod = lm(formula = y ~ x*z, data = df)
mod$coefficients
#> (Intercept) x z x:z
#> 0.6449097 -0.1989884 -0.3962655 0.4621273
mod$coefficients*10
#> (Intercept) x z x:z
#> 6.449097 -1.989884 -3.962655 4.621273
Created on 2020-07-10 by the reprex package (v0.3.0)
However, if you want to do like in your example, you need to access the invididual coefficients with model$coefficients[i], e.g.
df = data.frame(x = runif(100), y = runif(100), z = runif(100))
mod = lm(formula = y ~ x*z, data = df)
mod$coefficients[1]*10
#> (Intercept)
#> 5.994662
mod$coefficients[2]*10
#> x
#> -1.687928
Created on 2020-07-10 by the reprex package (v0.3.0)
You can even do this dynamically by looping over the length of the coefficients object.

How to use a variable in lm() function in R?

Let us say I have a dataframe (df) with two columns called "height" and "weight".
Let's say I define:
x = "height"
How do I use x within my lm() function? Neither df[x] nor just using x works.
Two ways :
Create a formula with paste
x = "height"
lm(paste0(x, '~', 'weight'), df)
Or use reformulate
lm(reformulate("weight", x), df)
Using reproducible example with mtcars dataset :
x = "Cyl"
lm(paste0(x, '~', 'mpg'), data = mtcars)
#Call:
#lm(formula = paste0(x, "~", "mpg"), data = mtcars)
#Coefficients:
#(Intercept) mpg
# 11.2607 -0.2525
and same with
lm(reformulate("mpg", x), mtcars)
We can use glue to create the formula
x <- "height"
lm(glue::glue('{x} ~ weight'), data = df)
Using a reproducible example with mtcars
x <- 'cyl'
lm(glue::glue('{x} ~ mpg'), data = mtcars)
#Call:
#lm(formula = glue::glue("{x} ~ mpg"), data = mtcars)
#Coefficients:
#(Intercept) mpg
# 11.2607 -0.2525
When you run x = "height" your are assigning a string of characters to the variable x.
Consider this data frame:
df <- data.frame(
height = c(176, 188, 165),
weight = c(75, 80, 66)
)
If you want a regression using height and weight you can either do this:
lm(height ~ weight, data = df)
# Call:
# lm(formula = height ~ weight, data = df)
#
# Coefficients:
# (Intercept) weight
# 59.003 1.593
or this:
lm(df$height ~ df$weight)
# Call:
# lm(formula = df$height ~ df$weight)
#
# Coefficients:
# (Intercept) df$weight
# 59.003 1.593
If you really want to use x instead of height, you must have a variable called x (in your df or in your environment). You can do that by creating a new variable:
x <- df$height
y <- df$weight
lm(x ~ y)
# Call:
# lm(formula = x ~ y)
#
# Coefficients:
# (Intercept) y
# 59.003 1.593
Or by changing the names of existing variables:
names(df) <- c("x", "y")
lm(x ~ y, data = df)
# Call:
# lm(formula = x ~ y, data = df)
#
# Coefficients:
# (Intercept) y
# 59.003 1.593

Using dataframe name as a column in a model table

I'm confused as to why the following doesn't work. I'm trying to use the name of a data frame/tibble as a column in a multiple models data frame, but keep running up against the following error. Here's an example:
library(tidyverse)
library(rlang)
set.seed(666)
df1 <- tibble(
x = 1:10 + rnorm(10),
y = seq(20, 38, by=2) + rnorm(10),
z = 2*x + 3*y
)
df2 <- tibble(
x = 1:10 + rnorm(10),
y = seq(20, 38, by=2) + rnorm(10),
z = 4*x + 5*y
)
results <- tibble(dataset = c('df1','df2'))
Notice that the following all work:
lm(z ~ x + y, data=df1)
lm(z ~ x + y, data=df2)
lm(z ~ x + y, data=eval(sym('df1')))
But when I try the following:
results <- results %>% mutate(model = lm(z ~ x + y, data = eval(sym(dataset))))
I get the error
Error in mutate_impl(.data, dots) :
Evaluation error: Only strings can be converted to symbols.
Can someone figure out how to make this work?
We can use the map function and specify the lm function as the following.
library(tidyverse)
library(rlang)
results2 <- results %>%
mutate(model = map(dataset, ~lm(z ~ x + y, data = eval(sym(.)))))
results2
# # A tibble: 2 x 2
# dataset model
# <chr> <list>
# 1 df1 <S3: lm>
# 2 df2 <S3: lm>
results2$model[[1]]
# Call:
# lm(formula = z ~ x + y, data = eval(sym(.)))
#
# Coefficients:
# (Intercept) x y
# 6.741e-14 2.000e+00 3.000e+00
results2$model[[2]]
# Call:
# lm(formula = z ~ x + y, data = eval(sym(.)))
#
# Coefficients:
# (Intercept) x y
# 9.662e-14 4.000e+00 5.000e+00
I'd recommend a slightly different route where you bind all the data and skip the eval and sym calls. This follows the "Many Models" chapter of R for Data Science.
purrr::lst creates a list of the data frames with the names of those variables as the list's names, and the .id argument to bind_rows uses those names to create a column marking data as coming from df1 or df2. Nesting creates a column data which is a list-column of data frames. Then you can build the models of each dataset. I used the tilde shortcut notation to build the anonymous function.
The result: you have a column model that is a list of models.
library(tidyverse)
library(rlang)
results <- lst(df1, df2) %>%
bind_rows(.id = "dataset") %>%
group_by(dataset) %>%
nest() %>%
mutate(model = map(data, ~lm(z ~ x + y, data = .)))
results$model[[1]]
#>
#> Call:
#> lm(formula = z ~ x + y, data = .)
#>
#> Coefficients:
#> (Intercept) x y
#> 6.741e-14 2.000e+00 3.000e+00
You also still have a column of that nested data. If you don't want it, you can drop it:
select(results, -data)
#> # A tibble: 2 x 2
#> dataset model
#> <chr> <list>
#> 1 df1 <lm>
#> 2 df2 <lm>

Using lapply on a list of models

I have generated a list of models, and would like to create a summary table.
As and example, here are two models:
x <- seq(1:10)
y <- sin(x)^2
model1 <- lm(y ~ x)
model2 <- lm(y ~ x + I(x^2) + I(x^3))
and two formulas, the first generating the equation from components of formula
get.model.equation <- function(x) {
x <- as.character((x$call)$formula)
x <- paste(x[2],x[1],x[3])
}
and the second generating the name of model as a string
get.model.name <- function(x) {
x <- deparse(substitute(x))
}
With these, I create a summary table
model.list <- list(model1, model2)
AIC.data <- lapply(X = model.list, FUN = AIC)
AIC.data <- as.numeric(AIC.data)
model.models <- lapply(X = model.list, FUN = get.model)
model.summary <- cbind(model.models, AIC.data)
model.summary <- as.data.frame(model.summary)
names(model.summary) <- c("Model", "AIC")
model.summary$AIC <- unlist(model.summary$AIC)
rm(AIC.data)
model.summary[order(model.summary$AIC),]
Which all works fine.
I'd like to add the model name to the table using get.model.name
x <- get.model.name(model1)
Which gives me "model1" as I want.
So now I apply the function to the list of models
model.names <- lapply(X = model.list, FUN = get.model.name)
but now instead of model1 I get X[[1L]]
How do I get model1 rather than X[[1L]]?
I'm after a table that looks like this:
Model Formula AIC
model1 y ~ x 11.89136
model2 y ~ x + I(x^2) + I(x^3) 15.03888
Do you want something like this?
model.list <- list(model1 = lm(y ~ x),
model2 = lm(y ~ x + I(x^2) + I(x^3)))
sapply(X = model.list, FUN = AIC)
I'd do something like this:
model.list <- list(model1 = lm(y ~ x),
model2 = lm(y ~ x + I(x^2) + I(x^3)))
# changed Reduce('rbind', ...) to do.call(rbind, ...) (Hadley's comment)
do.call(rbind,
lapply(names(model.list), function(x)
data.frame(model = x,
formula = get.model.equation(model.list[[x]]),
AIC = AIC(model.list[[x]])
)
)
)
# model formula AIC
# 1 model1 y ~ x 11.89136
# 2 model2 y ~ x + I(x^2) + I(x^3) 15.03888
Another option, with ldply, but see hadley's comment below for a more efficient use of ldply:
# prepare data
x <- seq(1:10)
y <- sin(x)^2
dat <- data.frame(x,y)
# create list of named models obviously these are not suited to the data here, just to make the workflow work...
models <- list(model1=lm(y~x, data = dat),
model2=lm(y~I(1/x), data=dat),
model3=lm(y ~ log(x), data = dat),
model4=nls(y ~ I(1/x*a) + b*x, data = dat, start = list(a = 1, b = 1)),
model5=nls(y ~ (a + b*log(x)), data=dat, start = setNames(coef(lm(y ~ log(x), data=dat)), c("a", "b"))),
model6=nls(y ~ I(exp(1)^(a + b * x)), data=dat, start = list(a=0,b=0)),
model7=nls(y ~ I(1/x*a)+b, data=dat, start = list(a=1,b=1))
)
library(plyr)
library(AICcmodavg) # for small sample sizes
# build table with model names, function, AIC and AICc
data.frame(cbind(ldply(models, function(x) cbind(AICc = AICc(x), AIC = AIC(x))),
model = sapply(1:length(models), function(x) deparse(formula(models[[x]])))
))
.id AICc AIC model
1 model1 15.89136 11.89136 y ~ x
2 model2 15.78480 11.78480 y ~ I(1/x)
3 model3 15.80406 11.80406 y ~ log(x)
4 model4 16.62157 12.62157 y ~ I(1/x * a) + b * x
5 model5 15.80406 11.80406 y ~ (a + b * log(x))
6 model6 15.88937 11.88937 y ~ I(exp(1)^(a + b * x))
7 model7 15.78480 11.78480 y ~ I(1/x * a) + b
It's not immediately obvious to me how to replace the .id with a column name in the ldply function, any tips?

Resources