Regression of a Data Frame with multiple factor groupings - r

I am working on a regression script.
I have a data.frame with roughly 130 columns, of which I need to do a regression for one column (lets call it X column) against all the other ~100 numeric columns.
Before the regression is calculated, I need to group the data by 4 factors: myDat$Recipe, myDat$Step, myDat$Stage, and myDat$Prod while still keeping the other ~100 columns and row data attached for the regression. Then I need to do a regression of each column ~ X column and print out the R^2 value with the column name. This is what I've tried so far but it is getting overly complicated and I know there's got to be a better way.
rm(list=ls())
myDat <- read.csv(file="C:/Users/Documents/myDat.csv", header=TRUE, sep=",")
for(j in myDat$Recipe)
{
myDatj <- subset(myDat, myDat$Recipe == j)
for(k in myDatj$Step)
{
myDatk <- subset(myDatj, myDatj$Step == k)
for(i in myDatk$Stage)
{
myDati <- subset(myDatk, myDatk$Stage == i)
for(m in myDati$Prod)
{
myDatm <- subset(myDati, myDati$Prod == m)
if(is.numeric(myDatm[3,i]))
{
fit <- lm(myDatk[,i] ~ X, data=myDatm)
rsq <- summary(fit)$r.squared
{
writeLines(paste(rsq,i,"\n"))
}
}
}
}
}
}

You can do this by combining dplyr, tidyr and my broom package (you can install them with install.packages). First you need to gather all the numeric columns into a single column:
library(dplyr)
library(tidyr)
tidied <- myDat %>%
gather(column, value, -X, -Recipe, -Step, -Stage, -Prod)
To understand what this does, you can read up on tidyr's gather operation. (This assumes that all columns besides X, Recipe, Step, Stage, and Prod are numeric and therefore should be predicted in your regression. If that's not the case, you need to remove them beforehand. You'll need to produce a reproducible example of the problem if you need a more customized solution).
Then perform each regression, while grouping by the column and the four grouping variables.
library(broom)
regressions <- tidied %>%
group_by(column, Recipe, Step, Stage, Prod) %>%
do(mod = lm(value ~ X))
glances <- regressions %>% glance(mod)
The resulting glances data frame will have one row for each combination of column, Recipe, Step, Stage, and Prod, along with an r.squared column containing the R-squared from each model. (It will also contain adj.r.squared, along with other columns such as F-test p-value: see here for more). Running coefs <- regressions %>% tidy(mod) will probably also be useful for you, as it will get the coefficient estimates and p-values from each regression.
A similar use case is described in the "broom and dplyr" vignette, and in Section 3.1 of the broom manuscript.

Related

how to use R package `caret` to run `pls::plsr( )` with multiple responses

the caret::train() does not seem to accept y if y is a matrix of multiple columns.
Thanks for any help!
That's correct. Perhaps you want the tidymodels package? Kuhn has said there would be support for multivariate response in it. Here's evidence in favor of my suggestion: https://www.tidymodels.org/learn/models/pls/
Do a search of that document for plsr:
library(tidymodels)
library(pls)
get_var_explained <- function(recipe, ...) {
# Extract the predictors and outcomes into their own matrices
y_mat <- bake(recipe, new_data = NULL, composition = "matrix", all_outcomes())
x_mat <- bake(recipe, new_data = NULL, composition = "matrix", all_predictors())
# The pls package prefers the data in a data frame where the outcome
# and predictors are in _matrices_. To make sure this is formatted
# properly, use the `I()` function to inhibit `data.frame()` from making
# all the individual columns. `pls_format` should have two columns.
pls_format <- data.frame(
endpoints = I(y_mat),
measurements = I(x_mat)
)
# Fit the model
mod <- plsr(endpoints ~ measurements, data = pls_format)
# Get the proportion of the predictor variance that is explained
# by the model for different number of components.
xve <- explvar(mod)/100
# To do the same for the outcome, it is more complex. This code
# was extracted from pls:::summary.mvr.
explained <-
drop(pls::R2(mod, estimate = "train", intercept = FALSE)$val) %>%
# transpose so that components are in rows
t() %>%
as_tibble() %>%
# Add the predictor proportions
mutate(predictors = cumsum(xve) %>% as.vector(),
components = seq_along(xve)) %>%
# Put into a tidy format that is tall
pivot_longer(
cols = c(-components),
names_to = "source",
values_to = "proportion"
)
}
#We compute this data frame for each resample and save the results in the different columns.
folds <-
folds %>%
mutate(var = map(recipes, get_var_explained),
var = unname(var))
#To extract and aggregate these data, simple row binding can be used to stack the data vertically. Most of the action happens in the first 15 components so let’s filter the data and compute the average proportion.
variance_data <-
bind_rows(folds[["var"]]) %>%
filter(components <= 15) %>%
group_by(components, source) %>%
summarize(proportion = mean(proportion))
This might not be a reproducible code block. May need additional data or packages.

How do I avoid using a for-loop to get elements of a nested list to the top level?

I am trying to extract the coefficients of a set of linear models into a data frame. How do I extract these values without using a for-loop?
The data in my example is dummy data for clarity. The actual project makes models for air temperature for each day of the year, and then tries to model the parameters of these models. Currently I can only accumulate each coefficient in a separate variable, and then apply individually it to my data set:
require(tidyverse)
# making different mpg models from displacement, distinguished by cylinder count
models <- mtcars %>%
nest(-cyl, .key = "cardata") %>%
mutate(mod = map(cardata, ~lm(mpg ~ disp, data = .))) %>%
mutate(coefficients = map(mod, coefficients)) #this only extracts a list of coefficients
# currently using one for-loop to extract each coefficient, looking for a more elegant way...
coef.intercept <- c()
for (i in models$coefficients) {
coef.intercept <- c(coef.intercept,i[1])
}
coef.disp <- c()
for (i in models$coefficients) {
coef.disp <- c(coef.disp,i[2])
}
# putting together the final data frame
models <- models %>%
mutate(coef.intercept) %>%
mutate(coef.disp) %>%
select(cyl, coef.intercept, coef.disp) %>%
as.data.frame()
Using 'map' I can extract a list of coefficients, but I cannot use the '[' operator in order to get specific elements of the individual lists. Something like
mutate(models, coef.intercept = map(models, coefficients[1]))
does not work, I get "Error: Index 1 must have length 1, not 2".
I'm not able at the moment to replicate your example, but I think you can start from here and adapt this solution to your needs.
A <- list(a = list(1,"j"), b = list(2, "k") , d = list(3, "m" ) )
sapply(A, `[[`, 1)

Generic time-series backtesing/cross-validation with R

I want to make some time-series evaluation in R. The process is usually to define a time lag and the evaluation frequency/periods, and for each evaluation period, train a model with the defined time lag and compute metrics for that period.
For example, we have:
Evaluation period size and interval n
Evaluation start at b
Time lag l
We train a model with points 1:b-l, evaluate it on b:b+n. After that we train a model with points 1:b+n-l and evaluate it on b+n:b+2n and etc, for k periods. It could vary a bit but that's the general spirit. So this is basically a sliding window for the evaluation data, but an increasing window for training data.
This is illustrated in the answer to this question (the expanding window solution).
How could I do this, preferably without loops and using the tidyverse and/or packages specific for time-series analysis?
So this is how I'm doing at the moment, but I'm really not satisfied with it. Too much custom code and not very modular.
time_series_cv <- function(dates_lim, df) {
eval_data <-
df %>%
filter(
date >= dates_lim[['date_beg']],
date < dates_lim[['date_end']]
)
eval_data$prediction <-
predict(
lm(
log(y) ~ .,
df %>% filter(date < dates_lim[['date_beg']]) %>% select(-c(date))
),
eval_data
)
eval_data %>%
select(date, y, prediction)
}
predictions <-
lapply(dates, time_series_cv, df = df) %>%
bind_rows()
dates is a list of named lists with the start and end of the evaluation period. Lag is 1 sample here.

apply series of commands to split data frame

I'm having some difficulties figuring out how to approach this problem. I have a data frame that I am splitting into distinct sites (link5). Once split I basically want to run a linear regression model on the subsets. Here is the code I'm working with, but it's definitely not correct. Also, It would be great if I could output the model results to a new data frame such that each site would have one row with the model parameter estimates - that is just a wish and not a necessity right now. Thank you for any help!
les_events <- split(les, les$link5)
result <- lapply(les_events) {
lm1 <-lm(cpe~K,data=les_events)
coef <- coef(lm1)
q.hat <- -coef(lm1)[2]
les_events$N0.hat <- coef(lm1[1]/q.hat)
}
You have a number of issues.
You haven't passed a function (the FUN argument) to lapply
Your closure ( The bit inside {} is almost, but not quite the body you want for your function)
something like th following will return the coefficients from your models
result <- lapply(les_events, function(DD){
lm1 <-lm(cpe~K,data=DD)
coef <- coef(lm1)
data.frame(as.list(coef))
})
This will return a list of data.frames containing columns for each coefficient.
lapply(les_events, lm, formula = 'cpe~K')
will return a list of linear model objects, which may be more useful.
For a more general split / apply / combine approaches use plyr or data.table
data.table
library(data.table)
DT <- data.table(les)
result <- les[, {lm1 <- lm(cpe ~ K, data = .SD)
as.list(lm1)}, by = link5]
plyr
library(plyr)
result <- ddply(les, .(link5), function(DD){
lm1 <-lm(cpe~K,data=DD)
coef <- coef(lm1)
data.frame(as.list(coef))
})
# or to return a list of linear model objects
dlply(les, link5, function(DD){ lm(cpe ~K, data =DD)})

Creating a matrix of summary output

How can I insert summary outputs from multiple regression analyses in a matrix type variable in R statistics package?
Here is my script, which runs the regression and collect intercepts and co-eff in a variable:
for (i in 2:(ncol(data.base))) {
Test <- lm(data.base[,i] ~ log(database$var.1))
results <- rbind(results, c(Test$coefficients))
}
I would like to do is to import summary(lm-test) for each regression in to a matrix type variable. I assume the matrix type variable is what I need.
I appreciate your help.
Yuck! Some nasty variable naming there, in my opinion.
I see data.base has outcomes, and you don't want the first column but each is a separate outcome. You also have database which is a data.frame with a variable var.1. Run each regression, store them in a matrix format.
This is a start:
fits <- apply(data.base[, -1], 2, function(y) lm(y ~ log(database$var.1))
summ <- lapply(fits, summary)
summ <- lapply(fits, coef)
Reduce(cbind, summ)

Resources