plyr with nested groups? - r

Is there an eloquent way to use ddply() to obtain output for not only the most granular groups defined, but also the groups of those sub-groups?
In other words, when one of the classifiers is "any" or "either" or "doesn't matter". In the simple case of two grouping variables, this can be accomplished by a separate call to ddply; however, when there are three or more classifiers that can all be set to "any" this gets messy due having to run ddply over and over again for every new combination of "any"+others.
Reproducible example:
require(plyr)
## create a data frame with three classification variables
## and two numeric variables:
df1=data.frame(classifier1 = LETTERS[sample(2,200,replace=T)],
classifier2 = letters[sample(3,200,replace=T)],
classifier3 = rep(c("foo","bar"),100),
VAR1 = runif(200,50,250),
VAR2 = rnorm(200,85,20))
## apply an arbitrary function to subsets of df1; that is, all unique
## combinations of the three classifiers.
dlply(df1, .(classifier1,classifier2,classifier3),
function(df) lm(VAR1 ~ VAR2, data=df))
$A.a.bar
Call:
lm(formula = VAR1 ~ VAR2, data = df)
Coefficients:
(Intercept) VAR2
230.5555 -0.8591
$A.a.foo
Call:
lm(formula = VAR1 ~ VAR2, data = df)
Coefficients:
(Intercept) VAR2
128.3078 0.3631
...
Now, what if I want to get the same output for a few more groups when any/all classifiers are not included. For example, if I wanted to include when classifier1="any", I would only include classifier2 and classifier3 in the dlply statement, like this:
dlply(df1, .(classifier2,classifier3), function(df) lm(VAR1 ~ VAR2, data=df))
If I then wanted to get output for when classifier2 and classifier3="any", I would again delete from the ddply call and only include classifier1:
dlply(df1, .(classifier1), function(df) lm(VAR1 ~ VAR2, data=df))
However, this gets unwieldy when I have many more classifiers than three, and each classifier can be taken out (i.e. = "any") -- the number of combinations increases substantially. Is there an eloquent/fast way to obtain output for all the "groups of groups" of my data?

One approach would be to create a list of the combinations and then use Map to create a list of the results of each dlply call
You can use combn in combination with lapply and do.call('c',...) to create a list of all the combinations of 1,2, ...,n variables
xx <- do.call('c',lapply(1:3, function(m) {
combn(x=names(df1)[1:3],m, simplify = FALSE)}))
You can then use this in a call to Map (which is a wrapper for mapply(..., SIMPLIFY = FALSE)
results <- Map(f = function(x){dlply(df1,.var=x, .fun = lm, formula = VAR1 ~ VAR2)},xx)
Or you could just pass a function to combn -- which will do the same thing
results <- do.call('c',lapply(1:3, function(m) {
combn(x=names(df1)[1:3],m, simplify = FALSE,
function(vv) {dlply(df1,.var=vv, .fun = lm, formula = VAR1~VAR2)})
}))

Related

Creating a function out of a dataframe with a function

I have several variables in my dataframe (e.g.: a, b, c, d) and I'm obtaining by season linear model parameters (Intercept, Slope and rSquared) through this code (Example for variable a):
lm_results_season_a<- ddply(dataframe1, "Season", function(x) {
model <- summary(lm(y ~ a, data = x))
Intercept<- model$coefficients[1,1]
Slope<- model$coefficients[2,1]
rSquared <- model$r.squared
data.frame(Intercept, Slope, rSquared)
})
My problem is that I have too many variables, and repeat this code again for each variable takes a lot of space.
For example, I would have to write the same code for variable b
lm_results_season_b<- ddply(dataframe1, "Season", function(x) {
model <- summary(lm(y ~ b, data = x))
Intercept<- model$coefficients[1,1]
Slope<- model$coefficients[2,1]
rSquared <- model$r.squared
data.frame(Intercept, Slope, rSquared)
})
and keep repeating the same code for the rest of the variables. So I tried to create a function in which I don't have to repeat all this code again, but just to call a function that can make all the calculations and give me the dataframe I am looking for.
I tried this code in which I define the variables before, and then just add them to the function:
variable1 <- dataframe1$y
variable2 <- dataframe1$a
LM_coef <- function(data, variable1, variable2){
lm_results_season<- ddply(data, "Season", function(x) {
model <- summary(lm(variable1 ~ variable2, data = x))
Intercept<- model$coefficients[1,1]
Slope<- model$coefficients[2,1]
rSquared <- model$r.squared
data.frame(Intercept,Slope, rSquared)
})
return(lm_results_season)
}
But this is not working as I wanted. Instead of giving me the linear regression parameters by Season for the variable "a", it is just giving me the linear regression parameters just for the variable "a" as a whole, and not by season.
Any idea on what's happening in the function or how to modify this function?
Are you bound to the plyr package? Otherwise, you can use the more advanced and up-to-date purrr package, always from the tidyverse world.
Here we can create a function where we insert the dataframe data, the two variables for the linear model variable1 and variable2, and the splitting column split_var (in your case "Season").
LM_coef <- function(data, variable1, variable2, split_var){
require(purrr)
data %>%
split(.[[split_var]]) %>%
map(~summary(lm(eval(as.name(variable1)) ~ eval(as.name(variable2)), data = .x))) %>%
map_dfr(~cbind(as.data.frame(t(as.matrix(coef(.)[1:2,1]))), .$r.squared), .id = split_var) %>%
setNames(c(split_var, "Intercept", "Slope", "rSquared"))
}
Example
Using the mtcars dataset, we can do
LM_coef(mtcars, "hp", "mpg", "cyl")
in order to obtain
# cyl Intercept Slope rSquared
# 1 4 147.4315 -2.430092 0.27405583
# 2 6 164.1564 -2.120802 0.01614624
# 3 8 294.4974 -5.647887 0.08044919
which is equal to what you would obtain from your initial function lm_results_season_a.

Grouping by a user defined list within a custom function in R

I am trying to create a custom function in R that lets the user perform linear regressions on a data set, I would like the user to be able to input variables for the data to be grouped by so that multiple regressions are performed on the data set. The problem I am having is trying to get a user defined list of variables into the custom function. Below I have tried using "..." however this does not work. If anyone has any idea how I should be approaching this that would be great. For reference For reference - lr.1 = the dataset - ddate = the x variable - alue = the y variable - the variables that the data should be grouped by)
`grouped.lr = function(lr.1,ddate, value, ...){
test = lr.1 %>%
group_by(...) %>%
nest() %>%
mutate(mod = map(data, fitmodel.test),
pars = map(mod, tidy),
pred = map(mod, augment))}`
It seems like the use of a formula might be fitting here, as it allows the user to specify the predictor-response relations.
The formula object is also accepted as a format for various models and can thus be directly passed down to the lm() function.
# function training a linear model and a random forest
build_my_models <- function(formula, data) {
lm.fit <- lm(formula, data)
rf.fit <- randomForest(formula, data)
return(list(lm.fit, rf.fit))
}
# data frame with three continuous variables
a <- rnorm(100)
b <- rnorm(100, mean = 2, sd = 4)
c <- 2*a + b
my_data <- data.frame(a = a, b = b, c = c)
# build the models
my_models <- build_my_models(a ~ ., my_data)
# here the formula 'a ~ .' defines the relation between response and predictors
# (the dot indicates that 'a' depends on all other variables in the data frame)
If you want to implement a model yourself, it's never a bad idea to stick to R's syntax and conventions. You can check to documentation on how to parse the formula for your specific needs.

Loop multiple 'multiple linear regressions' in R

I have a database where I want to do several multiple regressions. They all look like this:
fit <- lm(Variable1 ~ Age + Speed + Gender + Mass, data=Data)
The only variable changing is variable1. Now I want to loop or use something from the apply family to loop several variables at the place of variable1. These variables are columns in my datafile. Can someone help me to solve this problem? Many thanks!
what I tried so far:
When I extract one of the column names with the names() function I do get a the name of the column:
varname = as.name(names(Data[14]))
But when I fill this in (and I used the attach() function):
fit <- lm(Varname ~ Age + Speed + Gender + Mass, data=Data)
I get the following error:
Error in model.frame.default(formula = Varname ~ Age + Speed + Gender
+ : object is not a matrix
I suppose that the lm() function does not recognize Varname as Variable1.
You can use lapply to loop over your variables.
fit <- lapply(Data[,c(...)], function(x) lm(x ~ Age + Speed + Gender + Mass, data = Data))
This gives you a list of your results.
The c(...) should contain your variable names as strings. Alternatively, you can choose the variables by their position in Data, like Data[,1:5].
The problem in your case is that the formula in the lm function attempts to read the literal names of columns in the data or feed the whole vector into the regression. Therefore, to use the column name, you need to tell the formula to interpret the value of the variable varnames and incorporate it with the other variables.
# generate some data
set.seed(123)
Data <- data.frame(x = rnorm(30), y = rnorm(30),
Age = sample(0:90, 30), Speed = rnorm(30, 60, 10),
Gender = sample(c("W", "M"), 30, rep=T), Mass = rnorm(30))
varnames <- names(Data)[1:2]
# fit regressions for multiple dependent variables
fit <- lapply(varnames,
FUN=function(x) lm(formula(paste(x, "~Age+Speed+Gender+Mass")), data=Data))
names(fit) <- varnames
fit
$x
Call:
lm(formula = formula(paste(x, "~Age+Speed+Gender+Mass")), data = Data)
Coefficients:
(Intercept) Age Speed GenderW Mass
0.135423 0.010013 -0.010413 0.023480 0.006939
$y
Call:
lm(formula = formula(paste(x, "~Age+Speed+Gender+Mass")), data = Data)
Coefficients:
(Intercept) Age Speed GenderW Mass
2.232269 -0.008035 -0.027147 -0.044456 -0.023895

how to use loop to do linear regression in R

I wonder if I can use such as for loop or apply function to do the linear regression in R. I have a data frame containing variables such as crim, rm, ad, wd. I want to do simple linear regression of crim on each of other variable.
Thank you!
If you really want to do this, it's pretty trivial with lapply(), where we use it to "loop" over the other columns of df. A custom function takes each variable in turn as x and fits a model for that covariate.
df <- data.frame(crim = rnorm(20), rm = rnorm(20), ad = rnorm(20), wd = rnorm(20))
mods <- lapply(df[, -1], function(x, dat) lm(crim ~ x, data = dat))
mods is now a list of lm objects. The names of mods contains the names of the covariate used to fit the model. The main negative of this is that all the models are fitted using a variable x. More effort could probably solve this, but I doubt that effort is worth the time.
If you are just selecting models, which may be dubious, there are other ways to achieve this. For example via the leaps package and its regsubsets function:
library("leapls")
a <- regsubsets(crim ~ ., data = df, nvmax = 1, nbest = ncol(df) - 1)
summa <- summary(a)
Then plot(a) will show which of the models is "best", for example.
Original
If I understand what you want (crim is a covariate and the other variables are the responses you want to predict/model using crim), then you don't need a loop. You can do this using a matrix response in a standard lm().
Using some dummy data:
df <- data.frame(crim = rnorm(20), rm = rnorm(20), ad = rnorm(20), wd = rnorm(20))
we create a matrix or multivariate response via cbind(), passing it the three response variables we're interested in. The remaining parts of the call to lm are entirely the same as for a univariate response:
mods <- lm(cbind(rm, ad, wd) ~ crim, data = df)
mods
> mods
Call:
lm(formula = cbind(rm, ad, wd) ~ crim, data = df)
Coefficients:
rm ad wd
(Intercept) -0.12026 -0.47653 -0.26419
crim -0.26548 0.07145 0.68426
The summary() method produces a standard summary.lm output for each of the responses.
Suppose you want to have response variable fix as first column of your data frame and you want to run simple linear regression multiple times individually with other variable keeping first variable fix as response variable.
h=iris[,-5]
for (j in 2:ncol(h)){
assign(paste("a", j, sep = ""),lm(h[,1]~h[,j]))
}
Above is the code which will create multiple list of regression output and store it in a2,a3,....

Regression Summaries in R

I've been using the glm function to do regression analysis, and it's treating me quite well. I'm wondering though, some of the things I want to regress involve a large amount of regression factors. I have two main questions:
Is it possible to give a text vector for the regressors?
Can the p-value portion of summary(glm) be sorted at all? Preferably by the p-values of each regressor.
Ex.
A # sample data frame
names(A)
[1] Dog Cat Human Limbs Tail Height Weight Teeth.Count
a = names(A)[4:7]
glm( Dog ~ a, data = A, family = "binomial")
For your first question, see as.formula. Basically you want to do the following:
x <- names(A)[4:7]
regressors <- paste(x,collapse=" + ")
form <- as.formula(c("Dog ~ ",regressors))
glm(form, data = A, family = "binomial")
If you want interaction terms in your model, you need to make the structure somewhat more complex by using different collapse= arguments. That argument specifies which symbols are placed between the elements of your vector. For instance, if you specify "*" in the code above, you will have a saturated model with all possible interactions. If you just need some interactions, but not all, you will want to create the part of the formula containing all interactions first (using "*" as collapse argument), and then add the remaining terms in the separate paste function (using "+" as collapse argument). All in all, you want to create a character string that is identical to your formula, and then convert it to the formula class.
For your second question, you need to convert the output of summary to a data structure that can be sorted. For instance, a data frame. Let's say that the name of your glm model is model:
library(plyr)
coef <- summary(model)[12]
coef.sort <- as.data.frame(coef)
names(coef.sort) <- c("Estimate","SE","Tval","Pval")
arrange(coef.sort,Pval)
Assign the result of arrange() to a varable, and continue with it as you like.
An example data frame:
set.seed(42)
A <- data.frame(Dog = sample(0:1, 100, TRUE), b = rnorm(100), c = rnorm(100))
a <- names(A)[2:3]
Firstly, you can use the character vector a to create a model formula with reformulate:
glm(Dog ~ a, data = A, family = "binomial")
form <- reformulate(a, "Dog")
# Dog ~ b + c
model <- glm(form, data = A, family = "binomial")
Secondly, this is a way to sort the model summary by the p-values:
modcoef <- summary(model)[["coefficients"]]
modcoef[order(modcoef[ , 4]), ]
# Estimate Std. Error z value Pr(>|z|)
# b 0.23902684 0.2212345 1.0804232 0.2799538
# (Intercept) 0.20855908 0.2025642 1.0295951 0.3032001
# c -0.09287769 0.2191231 -0.4238608 0.6716673

Resources