I am attempting to fit a bunch of different models to a single dataset. Each of the models uses a different combination of outcome variable and data subset. To fit all of these models, I created a dataframe with one column for the outcome variable and one column specifying the data subset (as a string). (Note that the subsets are overlapping so there doesn't appear to be an obvious way to do this using nest().) I then created a new function which takes one row of this dataframe and calls "lm" using these options. Lastly, I use pmap to map this function to the dataframe.
After a bunch of experimentation, I found an approach that works but that is rather inelegant (see below for a simplified version of what I did). It seems like there should be a way to pass the subset condition to the subset argument in lm rather than using parse(eval(text = condition)) to first create a logical vector. I read the Advanced R section on metaprogramming in the hopes that they would provide some insight, but I was unable to find anything that works.
Any suggestions would be helpful.
library(tidyverse)
outcomes <- c("mpg", "disp")
sub_conditions <- c("mtcars$cyl >=6", "mtcars$wt > 2")
models <- expand.grid(y = outcomes, condition = sub_conditions) %>% mutate_all(as.character)
fit <- function(y, condition) {
# Create the formula to use in all models
rx <- paste(y, "~ hp + am")
log_vec <- eval(parse(text = condition))
lm(rx, data = mtcars[log_vec,])
}
t <- pmap(models, fit)
Are you sure you want to pass conditions in this way using string?
If that is the case, there are not many options. You can use rlang::parse_expr as an alternative.
fit <- function(y, condition) {
rx <- paste(y, "~ hp + am")
lm(rx, data = mtcars[eval(rlang::parse_expr(condition)),])
}
and call it via
purrr::pmap(models, fit)
Related
I like to run multiple one-way ANOVAs over multiple columns of a data frame. My approach to doing this was with a for loop. The first column of my data frame contains the groups. To provide a reproducible example, i here take the iris data set. I want to use rstatix::anova_test() instead of f.ex. aov(), because rstatix::anova_test() is pipe-friendly, seems to be a better option for unbalanced data (like i have it) and allows also to define the type of sums of squares for ANOVA.
When i write a for loop with aov() it works. Unfortunately, I have so far failed doing similar with rstatix::anova_test(). Please can anybody help me?
data <- iris %>% relocate(Species, .before = Sepal.Length)
# Define object which will receive the results
results <- NULL
results <- as.data.frame(results)
with aov() it works
for(i in 2:ncol(data)){
# Put the name of the variable in first column of results object
results[i-1,1] <- names(data)[i]
# ANOVA test iterating through each column of the data frame and save output in a temporary object.
temp_anova_results <- broom::tidy(aov(data[,i] ~ Species, data = data))
# write ANOVA p value in second column of results object
results[i-1,2] <- temp_anova_results$p.value[1]
rm(temp_anova_results)
}
for several reasons i like to work with rstatix::anova_test(), but failed to get a correct for loop, an example i tried:
for(i in 2:ncol(data)){
# Put the name of the variable in first column of results object
results[i-1,1] <- names(data)[i]
# ANOVA test iterating through each column of the data frame and save output in a temporary object.
temp_anova_results <- data %>% anova_test(data[,i] ~ Species, type = 3)
# write ANOVA p value in second column of results object
results[i-1,2] <- temp_anova_results$p[1]
rm(temp_anova_results)
}
data %>% anova_test(data[,i] ~ Species) seems to be the problem, but works outside the for loop when inserting a number for i, like f.ex. data %>% anova_test(data[,2] ~ Species)
Maybe somebody else has a better answer, but the only way I can get this to work is to build the formula from the column names, that is replace your anova_test line with:
temp_anova_results <- data %>% anova_test(formula(paste0(names(dat)[i],"~","Species")))
I don't know why your method didn't work. Even outside of the loop using i instead of a numeric constant broke the anova_test function call.
I'm doing a naive-bayes algorithm in R. The main goal is to predict a variable's value. But in this specific task, I'm trying to see which column is better at predicting it.
This is an example of what works (but in the real dataset doing it manually isn't an option):
library(naivebayes)
data("mtcars")
mtcars$vsLog <- as.logical(as.integer(mtcars$vs))
mtcars_train <- mtcars[1:20,]
mtcars_test <- mtcars[20:32,]
car_model <- naive_bayes( data=mtcars_train, vsLog ~ mpg )
predictions <- predict(car_model,mtcars_test)
What I'm having trouble with is performing a for loop, in which the model takes one column at a time, and save how good each model did at predicting the values.
I've looked at different ways to input the columns as something I can iterate over, but couldn't make it work.
My minimum reproducible example of my problem is:
library(naivebayes)
data("mtcars")
mtcars$vsLog <- as.logical(as.integer(mtcars$vs))
mtcars_train <- mtcars[1:20,]
mtcars_test <- mtcars[20:32,]
for (j in 1:ncol(mtcars)) {
car_model <- naive_bayes( data=mtcars_train, vsLog ~ mtcars_train[,j] )
predictions[j] <- predict(car_model,mtcars_test)
}
The problem is how to replace mpg in the first example with something I can loop over. Things I've tried: mtcars_train$mpg , unlist( mtcars_train[,j] ) , colnames .
I really tried googling this, I hope it's not too silly of a question.
Thanks for reading
This might be helpful. If you want to use a for loop, you can use seq_along with the names of your columns you want to loop through in your dataset. You can use reformulate to create a formula, which would you vsLog in your example, as well as the jth item in your column names. In this example, you can store your predict results in a list. Perhaps this might translate to your real dataset.
pred_lst <- list()
mtcars_names <- names(mtcars_train)
for (j in seq_along(mtcars_names)) {
car_model <- naive_bayes(reformulate(mtcars_names[j], "vsLog"), data=mtcars_train)
pred_lst[[j]] <- predict(car_model, mtcars_test)
}
I am new to R, and am trying to loop regressions by group. For my data I have 13 groups, and would like to create 13 objects--a regression result for each group, so I can put all the regression results in a table.
Here is what I have tried:
for (i in 1:13) {groupi = lm(Yvariable ~ Xvariables,
data = dataset,
subset = dataset$group== i )}
So that I would have 13 group'i' objects that are each a regression result to put into a table.
THANKS!
If I get your problem right there is a specialised command for this: lmList from the nlme package.
Try this:
library(nlme)
your.result.list <- lmList(Yvariable ~ Xvariables | group, data = dataset)
your.result.list
The object your.result.list is of class lmList so it is a list with the 13 elements that you wanted to have as single objects. It has a generic print option which prints you a table of the coefficients into the console. So maybe this is already what you want?
Consider by, object-oriented wrapper to tapply, designed to subset data frames by factor(s) and run operations on the subsets. Often it can replace split + lapply for a more streamlined call:
reg_list <- by(dataset, dataset$group, function(sub)
summary(lm(Yvariable ~ Xvariables,
data = sub)
)
)
Please note the above only produces a named list of regression result summaries. Further work is needed for extraction of each model's estimates by extending function.
I am running the following imputation task in R as a for loop:
myData <- essuk[c(2,3,4,5,6,12)]
myDataImp <- matrix(0,dim(myData)[1],dim(myData)[2])
lower <- c(0)
upper <- c(Inf)
for (k in c(1:5))
{
gmm.fit1 <- gmm.tmvnorm(matrix(myData[,k],length(myData[,k]),1), lower=lower, upper=upper)
useMu <- matrix(gmm.fit1$coefficients[1],1,1)
useSigma <- matrix(gmm.fit1$coefficients[2],1,1)
replaceThese <- myData[,k]<=0
myDataImp[,k] <- myData[,k]
myDataImp[replaceThese,k] <- rtmvnorm(n=sum(replaceThese), c(useMu), c(useSigma), c(-Inf), c(0))
}
The steps are pretty straightforward
Define the data set and an empty imputation data set.
For column 1-5, fit a model.
Extract model estimates to be used for imputation.
Run a model using model estimates and replace values <= 0 with the new values in the imputation data set.
However, I want to do this separately for multiple groups, rather than for the full sample. Column 12 in the data set contains information on group membership (integers ranging from 1-72).
I have tried several options, including splitting the data frame with data_list <- split(myData, myData$V12) and use the lapply() function. However, this does not work due to how model estimates are formatted:
Error in as.data.frame.default(data) :
cannot coerce class ""gmm"" to a data.frame
I have also thought about the possibility of doing a nested for loop, although I am not sure how that could be accomplished. Any suggestions are much appreciated.
what about using subset() ?
myData$V12 = as.factor(myData$V12)
listofresults= c()
for (i in levels(myData$V12)){
data = subset (myData, myData$V12 == i)
#your analysis here: result saved in myDataImp
listofresults = c(listofresults, myDataImp)
}
not the most elegant, but should work.
This is something which data analysts do all the time (especially when working with survey data which features missing responses.) It's common to first multiply impute a set of compete data matrices, fit models to each of these matrices, and then combine the results. At the moment I'm doing things by hand and looking for a more elegant solution.
Imagine there's 5 *.csv files in the working directory, named dat1.csv, dat2.csv, ... dat5.csv. I want to estimate the same linear model using each data set.
Given this answer, a first step is to gather a list of the files, which I do with the following
csvdat <- list.files(pattern="dat.*csv")
Now I want to do something like
for(x in csvdat) {
lm.which(csvdat == "x") <- lm(y ~ x1 + x2, data = x)
}
The "which" statement is my silly way of trying to number each model in turn, using the location in the csvdat list the loop is currently up to. that is, I'd like this loop to return a set of 5 lm objects with the names lm.1, lm.2, etc
Is there some simple way to create these objects, and name them so that I can easily indicate which data set they correspond to?
Thanks for your help!
Another approach is to use the plyr package to do the looping. Using the example constructed by #chl, here is how you would do it
require(plyr)
# read csv files into list of data frames
data_frames = llply(csvdat, read.csv)
# run regression models on each data frame
regressions = llply(data_frames, lm, formula = y ~ .)
names(regressions) = csvdat
Use a list to store the results of your regression models as well, e.g.
foo <- function(n) return(transform(X <- as.data.frame(replicate(2, rnorm(n))),
y = V1+V2+rnorm(n)))
write.csv(foo(10), file="dat1.csv")
write.csv(foo(10), file="dat2.csv")
csvdat <- list.files(pattern="dat.*csv")
lm.res <- list()
for (i in seq(along=csvdat))
lm.res[[i]] <- lm(y ~ ., data=read.csv(csvdat[i]))
names(lm.res) <- csvdat
what you want is a combination of the functions seq_along() and assign()
seq_along helps creates a vector from 1 to 5 if there are five objects in csvdat (to get the appropriate numbers and not only the variable names). Then assign (using paste to create the appropriate astrings from the numbers) lets you create the variable.
Note that you will also need to load the data file first (was missing in your example):
for (x in seq_along(csvdat)) {
data.in <- read.csv(csvdat[x]) #be sure to change this to read.table if necessary
assign(paste("lm.", x, sep = ""), lm(y ~ x1 + x2, data = data.in))
}
seq_along is not totally necessary, there could be other ways to solve the numeration problem.
The critical function is assign. With assign you can create variables with a name based on a string. See ?assign for further info.
Following chl's comments (see his post) everything in one line:
for (x in seq_along(csvdat)) assign(paste("lm", x, sep = "."), lm(y ~ x1 + x2, data = read.csv(csvdat[x]))