Loop over data table columns and apply glm using for loop - r

I am trying to loop over my data table columns and apply glm to each column using a for loop.
for(n in 1:ncol(dt)){
model = glm(y ~ dt[, n], family=binomial(link="logit"))
}
Why doesn't this work? I am getting this error:
Error in `[.data.table`(dt, , n) :
j (the 2nd argument inside [...]) is a single symbol but column name 'n' is not found. Perhaps you intended DT[, ..n]. This difference to data.frame is deliberate and explained in FAQ 1.1.
I nearly managed to make it work using something like dt[[n]], but I think it gets rid of the column name.

Using lapply to iterate over columns and reformulate to construct the formula.
model_list <- lapply(names(dt), function(x)
glm(reformulate(x, 'y'), dt, family=binomial(link="logit")))

We can create a formula with paste and use that in glm
model <- vector('list', ncol(dt))
for(n in 1:ncol(dt)){
model[[n]] = glm(as.formula(paste0('y ~ ', names(dt)[n])),
data = dt, family=binomial(link="logit"))
}

Related

Changing a variable name within the formula in a loop in R

I want to change the variable (which represents a dataframe column), that is a part of the formula, for looping this formula. It's important that I want to insert one var at the moment, because I want to work with this variable later, and only then to change it to another one (so, I guess "lapply" with a list of variables wouldn't be a solution?)
svychisq(~var1 + strata, svy_design)
I need this var1 (name of column) to be changed in a loop / function
Get all the variables in a vector and create a formula object using sprintf/paste0.
library(survey)
cols <- c('var1', 'var2')
#If you want all the variables that have 'var' in it.
#cols <- grep('var', names(df), value = TRUE)
result <- lapply(cols, function(x) {
svychisq(as.formula(sprintf('~%s + strata', x)), svy_design)
})

Passing arguments to subset within a function

I am attempting to fit a bunch of different models to a single dataset. Each of the models uses a different combination of outcome variable and data subset. To fit all of these models, I created a dataframe with one column for the outcome variable and one column specifying the data subset (as a string). (Note that the subsets are overlapping so there doesn't appear to be an obvious way to do this using nest().) I then created a new function which takes one row of this dataframe and calls "lm" using these options. Lastly, I use pmap to map this function to the dataframe.
After a bunch of experimentation, I found an approach that works but that is rather inelegant (see below for a simplified version of what I did). It seems like there should be a way to pass the subset condition to the subset argument in lm rather than using parse(eval(text = condition)) to first create a logical vector. I read the Advanced R section on metaprogramming in the hopes that they would provide some insight, but I was unable to find anything that works.
Any suggestions would be helpful.
library(tidyverse)
outcomes <- c("mpg", "disp")
sub_conditions <- c("mtcars$cyl >=6", "mtcars$wt > 2")
models <- expand.grid(y = outcomes, condition = sub_conditions) %>% mutate_all(as.character)
fit <- function(y, condition) {
# Create the formula to use in all models
rx <- paste(y, "~ hp + am")
log_vec <- eval(parse(text = condition))
lm(rx, data = mtcars[log_vec,])
}
t <- pmap(models, fit)
Are you sure you want to pass conditions in this way using string?
If that is the case, there are not many options. You can use rlang::parse_expr as an alternative.
fit <- function(y, condition) {
rx <- paste(y, "~ hp + am")
lm(rx, data = mtcars[eval(rlang::parse_expr(condition)),])
}
and call it via
purrr::pmap(models, fit)

Correlation Test loop in R

I am trying to create a data frame with p values and estimates that compares one gene to many different expression markers. My cor.test works when I use it on only one expression but when I try to loop it it breaks and gives me this error " 'x' and 'y' must have the same length".
I am wondering how to get this loop to work and build the data frame.
Below is what I am running through my loop and the code for the loop.
M3 <- ads$mean
Expression <- c("Exp1","Exp2","Exp3")
for (i in seq_along(Expression))
{
corr<-cor.test(M3, Expression[i], method = "pearson")
cor_df<-data.frame(Expression = Expression[i],pvalue = corr$p.value,
cor = corr$estimate)
}
Based on your comment, if Exp1, Exp2, and Exp3 are columns in a data frame (df) then you can use something like this:
corr <- cor.test(M3, df[ ,Expression[i]], method = "pearson")

Naming calculated columns in data.table

Is there a way to name columns created from the output of a function in data.table?
I want to get regression estimates from some data, so my code looks like this:
library(data.table)
DT <- as.data.table(expand.grid(x = 1:10, t = letters[1:4]))
DT[, y := rnorm(10), by = t]
# Function to get estimates and standard error for
# each regressor.
ExtractLm <- function(model) {
a <- summary(model)
return(list(rownames(a$coefficients),
a$coefficients[, 1],
a$coefficients[, 2]))
}
DT.regression <- DT[, ExtractLm(lm(x ~ y)), by = t]
The problem here is that DT.regression has ugly names (V1, V2, V3). One solution would be to modify the function ExtractLm() so that it returns a named list, but I think this is not very flexible since it wouldn't allow me to use different names in another situation. Or course, I can always rename columns after the fact, but that's not very elegant.
I tried to use DT[, c("regressor", "estimate", "se") = ExtractLm(lm(x ~ y)), by = t] and many variants and it doesn't work. I've been re-reading the docs but I've only found that kind of syntax when creating new columns with :=, which is not aplicable to this case. Is there a way to use a similar syntax to do this?

With R, loop over data frames, and assign appropriate names to objects created in the loop

This is something which data analysts do all the time (especially when working with survey data which features missing responses.) It's common to first multiply impute a set of compete data matrices, fit models to each of these matrices, and then combine the results. At the moment I'm doing things by hand and looking for a more elegant solution.
Imagine there's 5 *.csv files in the working directory, named dat1.csv, dat2.csv, ... dat5.csv. I want to estimate the same linear model using each data set.
Given this answer, a first step is to gather a list of the files, which I do with the following
csvdat <- list.files(pattern="dat.*csv")
Now I want to do something like
for(x in csvdat) {
lm.which(csvdat == "x") <- lm(y ~ x1 + x2, data = x)
}
The "which" statement is my silly way of trying to number each model in turn, using the location in the csvdat list the loop is currently up to. that is, I'd like this loop to return a set of 5 lm objects with the names lm.1, lm.2, etc
Is there some simple way to create these objects, and name them so that I can easily indicate which data set they correspond to?
Thanks for your help!
Another approach is to use the plyr package to do the looping. Using the example constructed by #chl, here is how you would do it
require(plyr)
# read csv files into list of data frames
data_frames = llply(csvdat, read.csv)
# run regression models on each data frame
regressions = llply(data_frames, lm, formula = y ~ .)
names(regressions) = csvdat
Use a list to store the results of your regression models as well, e.g.
foo <- function(n) return(transform(X <- as.data.frame(replicate(2, rnorm(n))),
y = V1+V2+rnorm(n)))
write.csv(foo(10), file="dat1.csv")
write.csv(foo(10), file="dat2.csv")
csvdat <- list.files(pattern="dat.*csv")
lm.res <- list()
for (i in seq(along=csvdat))
lm.res[[i]] <- lm(y ~ ., data=read.csv(csvdat[i]))
names(lm.res) <- csvdat
what you want is a combination of the functions seq_along() and assign()
seq_along helps creates a vector from 1 to 5 if there are five objects in csvdat (to get the appropriate numbers and not only the variable names). Then assign (using paste to create the appropriate astrings from the numbers) lets you create the variable.
Note that you will also need to load the data file first (was missing in your example):
for (x in seq_along(csvdat)) {
data.in <- read.csv(csvdat[x]) #be sure to change this to read.table if necessary
assign(paste("lm.", x, sep = ""), lm(y ~ x1 + x2, data = data.in))
}
seq_along is not totally necessary, there could be other ways to solve the numeration problem.
The critical function is assign. With assign you can create variables with a name based on a string. See ?assign for further info.
Following chl's comments (see his post) everything in one line:
for (x in seq_along(csvdat)) assign(paste("lm", x, sep = "."), lm(y ~ x1 + x2, data = read.csv(csvdat[x]))

Resources