This is an example of what I'm trying to do.
Step 1 :
Create a list of combination of dependent variable and independent variables
a <- list(paste("Sepal.Length ~ Sepal.Width" ) ,
paste("Sepal.Width ~ Sepal.Length" )
)
Step 2 :
Using lappy function to run glm for each element in the list in the step #1 , and also create a for loop to test two different parameters
param <- c("gaussian" , "Gamma" )
for(i in 1:2) {
print(lapply(a , FUN = function(X) glm(X , data = iris ,family = param[i] )))}
Is there a better way to achieve this without using for loop in the second step? This is what I have tried but it's not working.
a <-
list(
paste("Sepal.Length ~ Sepal.Width , data = iris , family = "Gaussian" " ) ,
paste("Sepal.Length ~ Sepal.Width , data = iris , family = "Gamma" " ) ,
paste("Sepal.Width ~ Sepal.Length , data = iris , family = "Gaussian" " ) ,
paste("Sepal.Width ~ Sepal.Length , data = iris , family = "Gamma" " )
)
lapply(a , FUN = function(X) glm(X))
Your paste does nothing here. Leave it out. Furthermore, the use of strings is also unnecessary here. Leave them out. Same goes for your parameter families: these are functions, no need to quote them.
This already vastly simplifies the code, both in length and conceptually. Now we have this:
models = list(Sepal.Length ~ Sepal.Width, Sepal.Width ~ Sepal.Length)
families = c(gaussian, Gamma)
And we can apply it:
lapply(models,
function (model) lapply(families,
function (family) glm(model, family, iris)))
… which is a nested application. The indentation hints at what belongs together. Since this is a bit odd, we can also use the cartesian product of the different parameters:
params = as.data.frame(t(expand.grid(models, families)))
lapply(params, function (p) glm(formula = p[[1]], data = iris, family = p[[2]]))
The first line is a bit obscure here. expand.grid allows us to create a data frame of all parameter combinations. Here’s an example:
> expand.grid(1 : 3, c('a', 'b'))
Var1 Var2
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b
Unfortunately, this data frame is in the wrong orientation to be used by lapply, because that applies over columns. So we transpose it (and convert it to a data.frame again, since t always returns a matrix).
This piece of code is incredibly useful because it makes writing nested loops via lapply much more readable; unfortunately, it is itself quite unreadable, so we stick it into a function:
combine_parameters = function (...)
as.data.frame(t(expand.grid(...)))
This allows us to write elegant, readable code:
models = list(Sepal.Length ~ Sepal.Width, Sepal.Width ~ Sepal.Length)
families = c(gaussian, Gamma)
params = combine_parameters(models, families)
lapply(params, function (p) glm(formula = p[[1]], family = p[[2]]), data = iris)
Using lapply:
lapply(c("gaussian", "Gamma"), function(myFamily){
lapply(c("Sepal.Length ~ Sepal.Width" ,
"Sepal.Width ~ Sepal.Length"), function(myFormula){
glm(formula = myFormula, family = myFamily, data = iris)
})
})
EDIT:
As mentioned in #KonradRudolph answer, we can pass formula as a list with a link = argument, e.g.:
lapply(list(gaussian(link = "identity"), Gamma), function(myFamily){
lapply(c("Sepal.Length ~ Sepal.Width" ,
"Sepal.Width ~ Sepal.Length"), function(myFormula){
glm(formula = myFormula, family = myFamily, data = iris)
})
})
Related
I have seen an example of list apply (lapply) that works nicely to take a list of data objects,
and return a list of regression output, which we can pass to Stargazer for nicely formatted output.
Using stargazer with a list of lm objects created by lapply-ing over a split data.frame
library(MASS)
library(stargazer)
data(Boston)
by.river <- split(Boston, Boston$chas)
class(by.river)
fit <- lapply(by.river, function(dd)lm(crim ~ indus,data=dd))
stargazer(fit, type = "text")
What i would like to do is, instead of passing a list of datasets to do the same regression on each data set (as above),
pass a list of independent variables to do different regressions on the same data set. In long hand it would look like this:
fit2 <- vector(mode = "list", length = 2)
fit2[[1]] <- lm(nox ~ indus, data = Boston)
fit2[[2]] <- lm(crim ~ indus, data = Boston)
stargazer(fit2, type = "text")
with lapply, i tried this and it doesn't work. Where did I go wrong?
myvarc <- c("nox","crim")
class(myvarc)
myvars <- as.list(myvarc)
class(myvars)
fit <- lapply(myvars, function(dvar)lm(dvar ~ indus,data=Boston))
stargazer(fit, type = "text")
Consider creating dynamic formulas from string:
fit <- lapply(myvars, function(dvar)
lm(as.formula(paste0(dvar, " ~ indus")),data=Boston))
This should work:
fit <- lapply(myvars, function(dvar) lm(eval(paste0(dvar,' ~ wt')), data = Boston))
You can also use a dplyr & purrr approach, keep everything in a tibble, pull out what you want, when you need it. No difference in functionality from the lapply methods.
library(dplyr)
library(purrr)
library(MASS)
library(stargazer)
var_tibble <- tibble(vars = c("nox","crim"), data = list(Boston))
analysis <- var_tibble %>%
mutate(models = map2(data, vars, ~lm(as.formula(paste0(.y, " ~ indus")), data = .x))) %>%
mutate(tables = map2(models, vars, ~stargazer(.x, type = "text", dep.var.labels.include = FALSE, column.labels = .y)))
You can also use get():
# make a list of independent variables
list_x <- list("nox","crim")
# create regression function
my_reg <- function(x) { lm(indus ~ get(x), data = Boston) }
# run regression
results <- lapply(list_x, my_reg)
The logic is similar to the content-based recommender,
content
undesirable
desirable
user_1
...
user_10
1
3.00
2.77
0.11
NA
...
5000
2.50
2.11
NA
0.12
I need to run the model for undesirable and desirable as independent values and each user as the dependent value, thus I need run 10 times to fit the model and predict each user's NA value.
This is the code that I hard coding, but I wonder how to use for loop, I just searched for several methods but they do not work for me...
the data as 'test'
hard code
#fit model
fit_1 = lm(user_1 ~ undesirable + desirable, data = test)
...
fit_10 = lm(user_10 ~ undesirable + desirable, data = test)
#prediction
u_1_na = test[is.na(test$user_1), c('user_1', 'undesirable', 'desirable')]
result1 = predict(fit_1, newdata = u_1_na)
which(result1 == max(result1))
max(result1)
...
u_10_na = test[is.na(test$user_10), c('user_10', 'undesirable', 'desirable')]
result10 = predict(fit_10, newdata = u_10_na)
which(result10 == max(result10))
max(result10)
#make to csv file
apply each max predict value to csv.
this is what I try for now(for loop)
mod_summaries <- list()
for(i in 1:10) {
predictors_i <- colnames(data)[1:10]
mod_summaries[[i - 1]] <- summary(
lm(predictors_i ~ ., test[ , c("undesirable", 'desirable')]))
}
An apply method:
mod_summaries_lapply <-
lapply(
colnames(mtcars),
FUN = function(x)
summary(lm(reformulate(".", response = x), data = mtcars))
)
A for loop method to make linear models for each column. The key is the reformulate() function, which creates the formula from strings. In the question, the function is made of a string and results in error invalid term in model formula. The string needs to be evaluated with eval() . This example uses the mtcars dataset.
mod_summaries <- list()
for(i in 1:11) {
predictors_i <- colnames(mtcars)[i]
mod_summaries[[i]] <- summary(lm(reformulate(".", response = predictors_i), data=mtcars))
#summary(lm(reformulate(". -1", response = predictors_i), data=mtcars)) # -1 to exclude intercept
#summary(lm(as.formula(paste(predictors_i, "~ .")), data=mtcars)) # a "paste as formula" method
}
You could use the function as.formula together with the paste function to create your formula. Following is an example
formula_lm <- as.formula(
paste(response_var,
paste(expl_var, collapse = " + "),
sep = " ~ "))
This implies that you have more than one explanatory variable (separated in the paste with +). If you only have one, omit the second paste.
With the created formula, you can use the lm funciton like this:
lm(formula_lm, data)
Edit: the vector expl_var would in your case include the undesirable and desirable variable.
Avoid the loop. Make your data tidy. Something like:
library(tidyverse)
test %>%
select(-content) %>%
pivot_longer(
starts_with("user"),
names_to="user",
values_to="value"
) %>%
group_by(user) %>%
group_map(
function(.x, .y) {
summary(lm(user ~ ., data=.x))
}
)
Untested code since your example is not reproducible.
I have an R dataframe with 9 input variables and 1 output variable. I want to find the accuracy of randomForest using each individual input, and add them to a list. To do this, I need to loop over a list of formulas, as in the code below:
library(randomForest)
library(caret)
formulas = c(target ~ age, target ~ sex, target ~ cp,
target ~ trestbps, target ~ chol, target ~ fbs,
target ~ restecg, target ~ ca, target ~ thal)
test_idx = sample(dim(df)[1], 60)
test_data = df[test_idx, ]
train_data = df[-test_idx, ]
accuracies = rep(NA, 9)
for (i in 1:length(formulas)){
rf_model = randomForest(formulas[i], data=train_data)
prediction = predict(rf_model, newdata=test_data, type="response")
acc = confusionMatrix(test_data$target, prediction)$overall[1]
accuracies[i] = acc
}
I run into an error,
Error in if (n==0) stop("data (x) has 0 rows") : argument is of
length zero calls: ... eval -> eval -> randomForest -> randomForest.default
Execution halted
The error is related to the formulas[i] argument passed to randomForest, when I type the formula name as the argument (for example, rf_model = randomForest(target ~ age, data=train_data), there is no error.
Is there any other way to iterate over randomForest?
Thank you!
As you have not provided any data, I am using the iris dataset. You have to make 2 changes in your code to make it run. First, use list to store the formulas, and second, formulas[[i]] within for loop. You can use the following code
library(randomForest)
library(caret)
df <- iris
formulas = list(Species ~ Sepal.Length, Species ~ Petal.Length, Species ~ Petal.Width,
Species ~ Sepal.Width)
test_idx = sample(dim(df)[1], 60)
test_data = df[test_idx, ]
train_data = df[-test_idx, ]
accuracies = rep(NA, 4)
for (i in 1:length(formulas)){
rf_model = randomForest(formulas[[i]], data=train_data)
prediction = predict(rf_model, newdata=test_data, type="response")
acc = confusionMatrix(test_data$Species, prediction)$overall[1]
accuracies[i] = acc
}
#> 0.7000000 0.9166667 0.9166667 0.5000000
I am trying to use lme function from nlme package inside a lapply loop. This works for lmer function from lme4 package, but produces an error message for lme. How can I loop lme functions similarly to the lmer function in the example below?
library("nlme")
library("lme4")
set.seed(1)
dt <- data.frame(Resp1 = rnorm(100, 50, 23), Resp2 = rnorm(100, 80, 15), Pred = rnorm(100,10,2), group = factor(rep(LETTERS[1:10], each = 10)))
## Syntax:
lmer(Resp1 ~ Pred + (1 |group), data = dt)
lme(Resp1 ~ Pred, random = ~1 | group, data = dt)
## Works for lme4
lapply(c("Resp1", "Resp2"), function(k) {
lmer(substitute(j ~ Pred + (1 | group), list(j = as.name(k))), data = dt)})
## Does not work for nlme
lapply(c("Resp1", "Resp2"), function(k) {
lme(substitute(j ~ Pred, list(j = as.name(k))), random = ~1 | group, data = dt)})
# Error in UseMethod("lme") :
# no applicable method for 'lme' applied to an object of class "call"
PS. I am aware that this solution exists, but I would like to use a method substituting response variable directly in the model function instead of subsetting data using an additional function.
Instead of fiddling around with substitute and eval you also could do the following:
lapply(c("Resp1", "Resp2"), function(r) {
f <- formula(paste(r, "Pred", sep = "~"))
m <- lme(fixed = f, random = ~ 1 | group, data = dt)
m$call$fixed <- f
m})
You could use the same trick if you want to provide different data sets to a modelling function:
makeModel <- function(dat) {
l <- lme(Resp1 ~ Pred, random = ~ 1 | group, data = dat)
l$call$data <- as.symbol(deparse(substitute(dat)))
l
}
I use this snippet quite a bit, when I want to generate a model from within a function and want to update it afterwards.
As #CarlWitthoft suggested, adding eval into the function will solve the issue:
lapply(c("Resp1", "Resp2"), function(k) {
lme(eval(substitute(j ~ Pred, list(j = as.name(k)))), random = ~1 | group, data = dt)})
Also see #thothal's alternative.
I have seen an example of list apply (lapply) that works nicely to take a list of data objects,
and return a list of regression output, which we can pass to Stargazer for nicely formatted output.
Using stargazer with a list of lm objects created by lapply-ing over a split data.frame
library(MASS)
library(stargazer)
data(Boston)
by.river <- split(Boston, Boston$chas)
class(by.river)
fit <- lapply(by.river, function(dd)lm(crim ~ indus,data=dd))
stargazer(fit, type = "text")
What i would like to do is, instead of passing a list of datasets to do the same regression on each data set (as above),
pass a list of independent variables to do different regressions on the same data set. In long hand it would look like this:
fit2 <- vector(mode = "list", length = 2)
fit2[[1]] <- lm(nox ~ indus, data = Boston)
fit2[[2]] <- lm(crim ~ indus, data = Boston)
stargazer(fit2, type = "text")
with lapply, i tried this and it doesn't work. Where did I go wrong?
myvarc <- c("nox","crim")
class(myvarc)
myvars <- as.list(myvarc)
class(myvars)
fit <- lapply(myvars, function(dvar)lm(dvar ~ indus,data=Boston))
stargazer(fit, type = "text")
Consider creating dynamic formulas from string:
fit <- lapply(myvars, function(dvar)
lm(as.formula(paste0(dvar, " ~ indus")),data=Boston))
This should work:
fit <- lapply(myvars, function(dvar) lm(eval(paste0(dvar,' ~ wt')), data = Boston))
You can also use a dplyr & purrr approach, keep everything in a tibble, pull out what you want, when you need it. No difference in functionality from the lapply methods.
library(dplyr)
library(purrr)
library(MASS)
library(stargazer)
var_tibble <- tibble(vars = c("nox","crim"), data = list(Boston))
analysis <- var_tibble %>%
mutate(models = map2(data, vars, ~lm(as.formula(paste0(.y, " ~ indus")), data = .x))) %>%
mutate(tables = map2(models, vars, ~stargazer(.x, type = "text", dep.var.labels.include = FALSE, column.labels = .y)))
You can also use get():
# make a list of independent variables
list_x <- list("nox","crim")
# create regression function
my_reg <- function(x) { lm(indus ~ get(x), data = Boston) }
# run regression
results <- lapply(list_x, my_reg)