Creating a function out of a dataframe with a function - r

I have several variables in my dataframe (e.g.: a, b, c, d) and I'm obtaining by season linear model parameters (Intercept, Slope and rSquared) through this code (Example for variable a):
lm_results_season_a<- ddply(dataframe1, "Season", function(x) {
model <- summary(lm(y ~ a, data = x))
Intercept<- model$coefficients[1,1]
Slope<- model$coefficients[2,1]
rSquared <- model$r.squared
data.frame(Intercept, Slope, rSquared)
})
My problem is that I have too many variables, and repeat this code again for each variable takes a lot of space.
For example, I would have to write the same code for variable b
lm_results_season_b<- ddply(dataframe1, "Season", function(x) {
model <- summary(lm(y ~ b, data = x))
Intercept<- model$coefficients[1,1]
Slope<- model$coefficients[2,1]
rSquared <- model$r.squared
data.frame(Intercept, Slope, rSquared)
})
and keep repeating the same code for the rest of the variables. So I tried to create a function in which I don't have to repeat all this code again, but just to call a function that can make all the calculations and give me the dataframe I am looking for.
I tried this code in which I define the variables before, and then just add them to the function:
variable1 <- dataframe1$y
variable2 <- dataframe1$a
LM_coef <- function(data, variable1, variable2){
lm_results_season<- ddply(data, "Season", function(x) {
model <- summary(lm(variable1 ~ variable2, data = x))
Intercept<- model$coefficients[1,1]
Slope<- model$coefficients[2,1]
rSquared <- model$r.squared
data.frame(Intercept,Slope, rSquared)
})
return(lm_results_season)
}
But this is not working as I wanted. Instead of giving me the linear regression parameters by Season for the variable "a", it is just giving me the linear regression parameters just for the variable "a" as a whole, and not by season.
Any idea on what's happening in the function or how to modify this function?

Are you bound to the plyr package? Otherwise, you can use the more advanced and up-to-date purrr package, always from the tidyverse world.
Here we can create a function where we insert the dataframe data, the two variables for the linear model variable1 and variable2, and the splitting column split_var (in your case "Season").
LM_coef <- function(data, variable1, variable2, split_var){
require(purrr)
data %>%
split(.[[split_var]]) %>%
map(~summary(lm(eval(as.name(variable1)) ~ eval(as.name(variable2)), data = .x))) %>%
map_dfr(~cbind(as.data.frame(t(as.matrix(coef(.)[1:2,1]))), .$r.squared), .id = split_var) %>%
setNames(c(split_var, "Intercept", "Slope", "rSquared"))
}
Example
Using the mtcars dataset, we can do
LM_coef(mtcars, "hp", "mpg", "cyl")
in order to obtain
# cyl Intercept Slope rSquared
# 1 4 147.4315 -2.430092 0.27405583
# 2 6 164.1564 -2.120802 0.01614624
# 3 8 294.4974 -5.647887 0.08044919
which is equal to what you would obtain from your initial function lm_results_season_a.

Related

How to loop over columns to evaluate different fixed effects in consecutive lme4 mixed models and extract the coefficients and P values?

I am new to R and am trying to loop a mixed model across 90 columns in a dataset.
My dataset looks like the following one but has 90 predictors instead of 7 that I need to evaluate as fixed effects in consecutive models.
I then need to store the model output (coefficients and P values) to finally construct a figure summarizing the size effects of each predictor. I know the discussion of P value estimates from lme4 mixed models.
For example:
set.seed(101)
mydata <- tibble(id = rep(1:32, times=25),
time = sample(1:800),
experiment = rep(1:4, times=200),
Y = sample(1:800),
predictor_1 = runif(800),
predictor_2 = rnorm(800),
predictor_3 = sample(1:800),
predictor_4 = sample(1:800),
predictor_5 = seq(1:800),
predictor_6 = sample(1:800),
predictor_7 = runif(800)) %>% arrange (id, time)
The model to iterate across the N predictors is:
library(lme4)
library(lmerTest) # To obtain new values
mixed.model <- lmer(Y ~ predictor_1 + time + (1|id) + (1|experiment), data = mydata)
summary(mixed.model)
My coding skills are far from being able to set a loop to repeat the model across the N predictors in my dataset and store the coefficients and P values in a dataframe.
I have been able to iterate across all the predictors fitting linear models instead of mixed models using lapply. But I have failed to apply this strategy with mixed models.
varlist <- names(mydata)[5:11]
lm_models <- lapply(varlist, function(x) {
lm(substitute(Y ~ i, list(i = as.name(x))), data = mydata)
})
One option is to update the formula of a restricted model (w/o predictor) in an lapply loop over the predictors. Then summaryze the resulting list and subset the coefficient matrix using a Vectorized function.
library(lmerTest)
mixed.model <- lmer(Y ~ time + (1|id) + (1|experiment), data = mydata)
preds <- grep('pred', names(mydata), value=TRUE)
fits <- lapply(preds, \(x) update(mixed.model, paste('. ~ . + ', x)))
extract_coef_p <- Vectorize(\(x) x |> summary() |> coef() |> {\(.) .[3, c(1, 5)]}())
res <- `rownames<-`(t(extract_coef_p(fits)), preds)
res
# Estimate Pr(>|t|)
# predictor_1 -7.177579138 0.8002737
# predictor_2 -5.010342111 0.5377551
# predictor_3 -0.013030513 0.7126500
# predictor_4 -0.041702039 0.2383835
# predictor_5 -0.001437124 0.9676346
# predictor_6 0.005259293 0.8818644
# predictor_7 31.304496255 0.2511275

How to regress a list of covariates with a desired predictor and dependent variable and return a table of coefficients and p-values using lme and lmer

I have a dataset with a rather large amount of variables. In the dataset I have a predictor and an outcome variable I want to investigate. I want to find covariates with either a significant effect on the outcome variable, or a significant interaction effect between the predictor and the covariate on the outcome variable.
It would therefore be convenient to be able to regress all the covariates in turn with the desired predictor on the dependent variable and create a table over the effects and interaction effects of the covariates with their respective p-values.
I want to do something like this:
library(dplyr)
# Generating sample data
set.seed(5)
df <- data.frame(matrix(round(abs(2*rnorm(100*100)), digits = 0), ncol=100))
# Selecting covariates
covar <- names(df)[! names(df) %in% c("X1", "X2")]
# Running the lm function over the list of covariates. I should get the covariate coefficients from each regression, but I get an error when I try run this step.
coeff <- lapply(covar, function(x){
# Retrive coefficient matrix
summary(lm(X1 ~ X2 + x + X2*x, df))$coefficients %>%
# Coerce into dataframe and filter for covariates and interaction effects
as.data.frame(.) %>%
filter(row.names(.) %in% grep(x, rownames(.), value =
TRUE))}) %>%
# Finally I want to join all data frames into one
bind_rows(.)
I could use some help with the syntax. I get the following error when I try to run the function:
Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': variable lengths differ (found for 'x')
When you use x(in lapply) inside function, it might be better using paste for model formula instead of just specifying it's formula.
lapply(covar, function(x){
modd <- paste0("X1 ~ X2 +", x, "+ X2 *", x)
summary(lm(modd, df))$coefficients %>%
as.data.frame(.) %>%
filter(row.names(.) %in% grep(x, rownames(.), value =
TRUE))}) %>%
bind_rows(.)

Predict() + assign() loop across variables

I have estimated several models (a, b) and I want to calculate predicted probabilities for each model using a single data frame (df) and store the predicted probabilities of each model as new variables in that data frame. For example:
a <- lm(y ~ z, df) # estimate model a
b <- glm(w ~ x, df) # estimate model b
models <- c("a","b") # create vector of model objects
for (i in models) {
assign(
paste("df$", i, sep = ""),
predict(i, df)
)}
I have tried the above but receive the error "no applicable method for 'predict' applied to an object of class "character"" with the last word changing as I change class of the predicted object, e.g. predict(as.numeric(i),df).
Any ideas? Ideally I could vectorize this as well.
You should rarely have to use assign() and $ should not be used with variable names. The [[]] operator is better for dynamic subsetting than $. And it would be easier if you just made a list if the models rather than just their names. Here's an example
df<-data.frame(x=runif(30), y=runif(30), w=runif(30), z=runif(30))
a <- lm(y ~ z, df) # estimate model a
b <- lm(w ~ x, df) # estimate model b
models <- list(a=a,b=b) # create vector of model objects
# 1) for loop
for (m in names(models)) {
df[[m]] <- predict(models[[m]], df)
}
Or rather than a for loop, you could generate all the values with Map and then append with cdbind afterward
# 2) Map/cbind
df <- cbind(df, Map(function(m) predict(m,df), models))

How to Loop/Repeat a Linear Regression in R

I have figured out how to make a table in R with 4 variables, which I am using for multiple linear regressions. The dependent variable (Lung) for each regression is taken from one column of a csv table of 22,000 columns. One of the independent variables (Blood) is taken from a corresponding column of a similar table.
Each column represents the levels of a particular gene, which is why there are so many of them. There are also two additional variables (Age and Gender of each patient). When I enter in the linear regression equation, I use lm(Lung[,1] ~ Blood[,1] + Age + Gender), which works for one gene.
I am looking for a way to input this equation and have R calculate all of the remaining columns for Lung and Blood, and hopefully output the coefficients into a table.
Any help would be appreciated!
You want to run 22,000 linear regressions and extract the coefficients? That's simple to do from a coding standpoint.
set.seed(1)
# number of columns in the Lung and Blood data.frames. 22,000 for you?
n <- 5
# dummy data
obs <- 50 # observations
Lung <- data.frame(matrix(rnorm(obs*n), ncol=n))
Blood <- data.frame(matrix(rnorm(obs*n), ncol=n))
Age <- sample(20:80, obs)
Gender <- factor(rbinom(obs, 1, .5))
# run n regressions
my_lms <- lapply(1:n, function(x) lm(Lung[,x] ~ Blood[,x] + Age + Gender))
# extract just coefficients
sapply(my_lms, coef)
# if you need more info, get full summary call. now you can get whatever, like:
summaries <- lapply(my_lms, summary)
# ...coefficents with p values:
lapply(summaries, function(x) x$coefficients[, c(1,4)])
# ...or r-squared values
sapply(summaries, function(x) c(r_sq = x$r.squared,
adj_r_sq = x$adj.r.squared))
The models are stored in a list, where model 3 (with DV Lung[, 3] and IVs Blood[,3] + Age + Gender) is in my_lms[[3]] and so on. You can use apply functions on the list to perform summaries, from which you can extract the numbers you want.
The question seems to be about how to call regression functions with formulas which are modified inside a loop.
Here is how you can do it in (using diamonds dataset):
attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)
formula <- list(); model <- list()
for (i in 1:1) {
formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
model[[i]] = glm(formula[[i]])
#then you can plot or do anything else with the result ...
png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
par(mfrow = c(2, 2))
plot(model[[i]])
dev.off()
}
Sensible or not, to make the loop at least somehow work you need:
y<- c(1,5,6,2,5,10) # response
x1<- c(2,12,8,1,16,17) # predictor
x2<- c(2,14,5,1,17,17)
predictorlist<- list("x1","x2")
for (i in predictorlist){
model <- lm(paste("y ~", i[[1]]), data=df)
print(summary(model))
}
The paste function will solve the problem.
A tidyverse addition - with map()
Another way - using map2() from the purrr package:
library(purrr)
xs <- anscombe[,1:3] # Select variables of interest
ys <- anscombe[,5:7]
map2_df(ys, xs,
function(i,j){
m <- lm(i ~j + x4 , data = anscombe)
coef(m)
})
The output is a dataframe (tibble) of all coefficients:
`(Intercept)` j x4
1 4.33 0.451 -0.0987
2 6.42 0.373 -0.253
3 2.30 0.526 0.0518
If more variables are changing this can be done using the pmap() functions

plyr with nested groups?

Is there an eloquent way to use ddply() to obtain output for not only the most granular groups defined, but also the groups of those sub-groups?
In other words, when one of the classifiers is "any" or "either" or "doesn't matter". In the simple case of two grouping variables, this can be accomplished by a separate call to ddply; however, when there are three or more classifiers that can all be set to "any" this gets messy due having to run ddply over and over again for every new combination of "any"+others.
Reproducible example:
require(plyr)
## create a data frame with three classification variables
## and two numeric variables:
df1=data.frame(classifier1 = LETTERS[sample(2,200,replace=T)],
classifier2 = letters[sample(3,200,replace=T)],
classifier3 = rep(c("foo","bar"),100),
VAR1 = runif(200,50,250),
VAR2 = rnorm(200,85,20))
## apply an arbitrary function to subsets of df1; that is, all unique
## combinations of the three classifiers.
dlply(df1, .(classifier1,classifier2,classifier3),
function(df) lm(VAR1 ~ VAR2, data=df))
$A.a.bar
Call:
lm(formula = VAR1 ~ VAR2, data = df)
Coefficients:
(Intercept) VAR2
230.5555 -0.8591
$A.a.foo
Call:
lm(formula = VAR1 ~ VAR2, data = df)
Coefficients:
(Intercept) VAR2
128.3078 0.3631
...
Now, what if I want to get the same output for a few more groups when any/all classifiers are not included. For example, if I wanted to include when classifier1="any", I would only include classifier2 and classifier3 in the dlply statement, like this:
dlply(df1, .(classifier2,classifier3), function(df) lm(VAR1 ~ VAR2, data=df))
If I then wanted to get output for when classifier2 and classifier3="any", I would again delete from the ddply call and only include classifier1:
dlply(df1, .(classifier1), function(df) lm(VAR1 ~ VAR2, data=df))
However, this gets unwieldy when I have many more classifiers than three, and each classifier can be taken out (i.e. = "any") -- the number of combinations increases substantially. Is there an eloquent/fast way to obtain output for all the "groups of groups" of my data?
One approach would be to create a list of the combinations and then use Map to create a list of the results of each dlply call
You can use combn in combination with lapply and do.call('c',...) to create a list of all the combinations of 1,2, ...,n variables
xx <- do.call('c',lapply(1:3, function(m) {
combn(x=names(df1)[1:3],m, simplify = FALSE)}))
You can then use this in a call to Map (which is a wrapper for mapply(..., SIMPLIFY = FALSE)
results <- Map(f = function(x){dlply(df1,.var=x, .fun = lm, formula = VAR1 ~ VAR2)},xx)
Or you could just pass a function to combn -- which will do the same thing
results <- do.call('c',lapply(1:3, function(m) {
combn(x=names(df1)[1:3],m, simplify = FALSE,
function(vv) {dlply(df1,.var=vv, .fun = lm, formula = VAR1~VAR2)})
}))

Resources