Loop through function and stack the output into a dataset in R - r

I wrote a function that runs a linear model and outputs a data frame. I would like to run the function several times and stack the output. Here is a hypothetical dataset and function:
data = data.frame(grade_level = rep(1:4, each = 3),
x = rnorm(12, mean = 21, sd = 7.5),
y = rnorm(12, mean = 20, sd = 7))
func = function(grade){
model = lm(y ~ x, data=data[data$grade_level == grade,])
fitted.values = model$fitted.values
final = data.frame(grade_level = data$grade_level[data$grade_level == grade],
predicted_values = fitted.values)
final
}
Currently, I run the function over each grade in the dataset:
grade1 = func(1)
grade2 = func(2)
grade3 = func(3)
grade4 = func(4)
pred.values = rbind(grade1, grade2, grade3, grade4)
How can I use a loop (or something else) to more efficiently run this function multiple times?

The purrr package has a really handy function for this. map works like the apply family of functions in Base R (which operate like a for loop in many ways). The _dfr specifies that you want to call rbind on the results before returrning.
This function says: "loop through c(1, 2, 3, 4), each time calling func() on each, then rbind the results at the end and give the data.frame back to me."
purrr::map_dfr(1:4, func)

A solution using lapply()
do.call("rbind", lapply(1:4, func))

Please find a solution using a loop an rbind below.
data = data.frame(grade_level = rep(1:4, each = 3),
x = rnorm(12, mean = 21, sd = 7.5),
y = rnorm(12, mean = 20, sd = 7))
func = function(grade){
model = lm(y ~ x, data=data[data$grade_level == grade,])
fitted.values = model$fitted.values
final = data.frame(grade_level = data$grade_level[data$grade_level == grade],
predicted_values = fitted.values)
return(final)
}
grades <- c(1:4)
pred.values <- data.frame()
for (i in grades) {
temp <- func(grade = i)
pred.values <- rbind(pred.values, temp)
}
Will give
> pred.values
grade_level predicted_values
1 1 30.78802
2 1 22.79665
3 1 29.56155
4 2 14.60050
5 2 14.56934
6 2 14.71737
7 3 16.97698
8 3 17.71697
9 3 18.95596
10 4 15.18937
11 4 16.56399
12 4 22.49093

Related

How to call multiple distribution functions from different vectors into a function in R

Lets talk you through my workflow:
General idea
Based on data in a dataframe, select the appropriate distribution functions, combine them in all possible ways to get the mean of the combined distributions.
Starting position
I have a large data frame df. In there I have different variables var1, var2 and var3 in this example which contains data to select the appropriate distribution function.
I have several distribution functions per variable:
var1_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 3, sd = 1))
var1_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 6, sd = 1))
var1_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 2, sd = 2))
var2_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 5, sd = 3))
var2_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 3, sd = 1))
var2_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 4, sd = 2))
var3_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 4, sd = 1))
var3_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 5, sd = 1))
var3_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 7, sd = 2))
Select the right distribution
Using an if_else on each of the vars I generate the appropriate distribution per case in a new vector. The if_else looks like this for var1 and has the same appearance for all vars:
df$distr_var1 <- if_else(df$info < 0, "var1_distr1",
if_else(df$info > 0 & df$info < 100, "var1_distr2", "var1_distr3")
This results in the following df:
df <- data.frame(distr_var1 = c("var1_distr1", "var1_distr3", "var1_distr1", "var1_distr2", "var1_distr2", "var1_distr1", "var1_distr3"),
distr_var2 = c("var2_distr2", "var2_distr1", "var2_distr2", "var2_distr1", "var2_distr3", "var2_distr3", "var2_distr1"),
distr_var3 = c("var3_distr2", "var3_distr3", "var3_distr1", "var3_distr1", "var3_distr2", "var3_distr3", "var3_distr1"))
Combine distribution functions
To combine distribution functions in a new proportional distribution function I have created this function based on this question:
foo <- function(...){
#set x values
x <- seq(1, 10, by = 1)
#create y values
y <- 1L
for (fun in list(...)) y <- y * fun(x)
#create new PDF
p <- data.frame(x,y)
pdqr::new_d(p, type = "continuous")
}
And I have stored the PDFs in a list:
PDFS <- list(var1_distr1 = var1_distr1, var1_distr2 = var1_distr2, var1_distr3 = var1_distr3,
var2_distr1 = var2_distr1, var2_distr2 = var2_distr2, var2_distr3 = var2_distr3,
var3_distr1 = var3_distr1, var3_distr2 = var3_distr2, var3_distr3 = var3_distr3)
I would like to use the function foo in the df to generate proportional distributions for all combinations of distributions given in the df. So, for each case, a the following combinations: var1_var2, var1_var3, var2_var3, var1_var2_var3.
Calculate mean over distributions
If I want to calculate a mean over the distributions individually, I can do this:
means <- sapply(PDFS, pdqr::summ_mean)
df$mean_var1 <- means[df$distr_var1]
Or:
df$mean_var2 <- sapply(mget(df$distr_var2), pdqr::summ_mean)
Both approaches work fine. But on the combinations var1_var2, var1_var3, var2_var3, var1_var2_var3 I have not found a suitable approach, but tried these:
df$var1_var2_mean <- sapply(foo(mget(mapply(PDFS, sapply, df$distr_var1, df$distr_var2))), pdqr::summ_mean)
I tried to overcome not calling functions by using a list, but things seem to get too complicated / nested to work nicely...
Question
How to select the appropriate distributions given in distr_var1, distr_var2 and distr_var3, combined them using foo and calculate the mean using pdqr::summ_mean?
I'm happy with all comments, also on the workflow in general
A foreach loop works for me:
df$var1_var2_mean <- foreach(i = 1:nrow(df), .combine = c) %do% {
A <- as.name(df$var1[i])
B <- as.name(df$var2[i])
mean <- summ_mean(foo(get(A),get(B)))
}
And, for each combination I need to do this. At least I got it working...

Use of tail() in out-of-sample prediction

Below you see an out of sample rolling window estimation I found here: (https://www.r-bloggers.com/2017/11/formal-ways-to-compare-forecasting-models-rolling-windows/)
Here is my question: I know the tail() function returns the last n rows of a dataset. But I don't understand its purpose when its used in the random walk in line 13 or when calculating the errors in line 17 and 18. Any help on clarifying this would be much appreciated.
# = Number of windows and window size
w_size = 300
n_windows = nrow(X) - 300
# = Rolling Window Loop = #
forecasts = foreach(i=1:n_windows, .combine = rbind) %do%{
# = Select data for the window (in and out-of-sample) = #
X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] forxpanding window
X_out = X[w_size + i, ]
# = Regression Model = #
m1 = lm(infl0 ~ . - prodl0, data = X_in)
f1 = predict(m1, X_out)
# = Random Walk = #
f2 = tail(X_in$infl0, 1)
return(c(f1, f2))
}
# = Calculate and plot errors = #
e1 = tail(X[ ,"infl0"], nrow(forecasts)) - forecasts[ ,1]
e2 = tail(X[ ,"infl0"], nrow(forecasts)) - forecasts[ ,2]
Here the function tail is applied to a vector because you select only the "inf10" column. In this case tail return the last element of the selected column.
df <- data.frame(A = c(1,2), B = c(3,4))
df[,"A"] # will return c(1,2)
tail(df[,"A"], 1) # will return 2
tail(df$B, 1) # will return 4

R Loop: Perform a Function for Every 3 Rows

I have 2000 wheat plants, growing over the course of 40 days.
I'd like to perform the coeff function on each plant to find the coefficients of the quadratic equation the 3 time points make. (a, b, and c)
(1) The coef(lm(y~poly(x,2,raw=TRUE)) function works exactly the way I want it to.
(2) However, the way my data is presented, requires me to manually set x and y.
(3) Thus, I melted my data, and ordered it.
(4) I'd like to make a loop that will take the first three in column "Day" and set that as x. Then I'd like it to take the first three in column "Height" and set that as y.
Then I'd like to perform the coeff function.
Last I'd like it to present the coefficient outputs I need, preferably in a new data table.
Then repeat for every three rows, which represent each wheat ID, for all wheat plants.
1) This function works, giving me coefficients: a, b, c
x<-c(1,2,3)
y<-c(1,10,4)
coef(lm(y~poly(x,2,raw=TRUE)))
2) This is what my data originally looked like
A = matrix(c(5, 4, 2, 10, 10, 4, 5, 15, 6),nrow=3, ncol=3)
colnames(A)<-c("10", "25", "40")
rownames(A)<-c("Wheat 1", "Wheat 2", "Wheat 3")
A
3) This is my melted format
A.melted<-as.data.frame(melt(A, id.vars="ID"))
A.melted<-A.melted[with(A.melted,order(Var1)),]
colnames(A.melted) <- c("WheatID", "Day", "Height")
A.melted$Day<-as.numeric(as.character(A.melted$Day))
A.melted
#
4) This is what I am trying to do with my loop....
for every 3 rows,
x<-A.melted[,2]
y<-A.melted[,3]
coef(lm(y~poly(x,2,raw=TRUE)))
something to compile the coefficients: a, b, c
I am just not familiar with the syntax of loops, and I'd love any tips and suggestions. Perusing Google tells me that one should not do loops unless it is absolutely required since I may run into more problems- thus I am open to non loop techniques as well.
If you want to do it in a loop try this. The crucial part is to use seq together with a by = argument to let the index take the steps you need.
library(tibble)
df <- tibble(
WheatID = rep(NA_character_, nrow(A)),
Intercept = rep(NA_real_, nrow(A)),
poly1 = rep(NA_real_, nrow(A)),
poly2 = rep(NA_real_, nrow(A))
)
cnt <- 1
for (i in seq(1, nrow(A.melted), by = 3)) {
x <- A.melted$Day[i + 0:2]
y <- A.melted$Height[i + 0:2]
df$WheatID[cnt] <- as.character(A.melted$WheatID[i])
df[cnt, 2:4] <- coef(lm(y~poly(x,2,raw=TRUE)))
cnt <- cnt + 1
}
df
Note: I am not a data.table guy. Therefore, I present you with a tibble.
We can do this with the help of data.table, see ?data.table:
library(data.table)
A.models = A.melted[, model := list(.(lm(Height ~ poly(Day, 2),
data = list(.(.SD[WheatID == .BY[[1]]]))))),
by = WheatID]
A.models[, coefs := list(.(coefficients(model[[1]]))),
by = WheatID]
You can access each model like this:
A.models[WheatID == "Wheat 1", model[[1]]]
and even
A.models[WheatID == "Wheat 1", summary(model[[1]])]
The magic here happens because data.table takes in J expressions, not only functions.
This is something you can do with data.table package.
data.list <- split(A.melted, f = (1:nrow(A.melted) - 1) %/% 3)
coefs <- lapply(data.list, function(x) {
coefs <- coef(lm(Day ~ poly(Height, raw=TRUE), data = x))
data.table(
intercept = coefs[1],
poly.height = coefs[2]
)
})
coefs <- rbindlist(coefs)
Or you could perform apply() directly on the original matrix:
x <- as.numeric(colnames(A))
apply(A, 1, function(y) coef(lm(y~poly(x,2,raw=TRUE))))
Wheat 1 Wheat 2 Wheat 3
(Intercept) -3.88888889 -0.555555556 6.666667e-01
poly(x, 2, raw = TRUE)1 1.11111111 0.477777778 1.333333e-01
poly(x, 2, raw = TRUE)2 -0.02222222 -0.002222222 -2.417315e-18
Or you could transpose the data and use the coef(...) call directly:
x <- as.numeric(colnames(A))
coef(lm(t(A) ~ poly(x, 2, raw = TRUE)))

Using lapply and the lm function together in R

I have a df as follows:
t r
1 0 100.00000
2 1 135.86780
3 2 149.97868
4 3 133.77316
5 4 97.08129
6 5 62.15988
7 6 50.19177
and so on...
I want to apply a rolling regression using lm(r~t).
However, I want to estimate one model for each iteration, where the iterations occur over a set time window t+k. Essentially, the first model should be estimated with t=0,t=1,...t=5, if k = 5, and the second model estimated with t=1, t=2,...,t=6, and so on.
In other words, it iterates from a starting point with a set window t+k where k is some pre-specified window length and applies the lm function over that particular window length iteratively.
I have tried using lapply like this:
mdls = lapply(df, function(x) lm(r[x,]~t))
However, I got the following error:
Error in r[x, ] : incorrect number of dimensions
If I remove the [x,], each iteration gives me the same model, in other words using all the observations.
If I use rollapply:
coefs = rollapply(df, 3, FUN = function(x) coef(lm(r~t, data =
as.data.frame(x))), by.column = FALSE, align = "right")
res = rollapply(df, 3, FUN = function(z) residuals(lm(r~t, data =
as.data.frame(z))), by.column = FALSE, align = "right")
Where:
t = seq(0,15,1)
r = (100+50*sin(0.8*t))
df = as.data.frame(t,r)
I get 15 models, but they are all estimated over the entire dataset, providing the same intercepts and coefficients. This is strange as I managed to make rollapply work just before testing it in a new script. For some reason it does not work again, so I am perplexed as to whether R is playing tricks on me, or whether there is something wrong with my code.
How can I adjust these methods to make sure they iterate according to my wishes?
I enclose a possible solution. The idea is to use a vector 1: nrow (df) in the function rollapply to indicate which rows we want to select.
df = data.frame(t = 0:6, r = c(100.00000, 135.86780, 149.97868, 133.77316, 97.08129, 62.15988, 50.19177))
N = nrow(df)
require(zoo)
# Coefficients
coefs <- rollapply(data = 1:N, width = 3, FUN = function(x){
r = df$r[x]
t = df$t[x]
out <- coef(lm(r~t))
return(out)
})
# Residuals
res <- rollapply(data = 1:N, width = 3, FUN = function(x){
r = df$r[x]
t = df$t[x]
out <- residuals(lm(r~t))
return(out)
})

How to build a function with arguments listed as a string vector in R?

How might I write a function that takes a string vector and returns a function with arguments having the names of the string vector? In addition, I'd like to use my string of arguments to make a list or data.frame named with the input columns inside that function. My desired application is to be able to pass this to a predict method for estimating some points. If this function already exists, let me know. Otherwise, I'm curious how I might write it. Below I include R pseudo-code illustrating what I am trying to do and hopefully showing where I getting stuck conceptually.
make_fitting_function <- function(mod) {
x <- xterms(mod)
function(!! x) {
predict(mod, newdata = list(!! x))
}
}
Calling this function returns a function that can be called on separate vector arguments. For example:
f <- make_fitting_function(lm(mpg ~ wt, data = mtcars))
f(wt = c(1, 2, 3, 4, 5))
The result would be:
1 2 3 4 5
31.94 26.60 21.25 15.91 10.56
However, this would also work for many more x variables (e.g.):
f <- make_fitting_function(lm(mpg ~ wt + am + carb, data = mtcars))
f(wt = c(1, 2, 3, 4, 5), am = rep(1, 5), carb = seq(2, 10, by = 2))
Ideally, this function should be able to be used by integrate in the case of a single variable.
You actually don't need to name the parameters; instead just make the data frame to pass in to predict using all of them.
make_fitting_function <- function(model) {
function(...) {
predict(model, newdata=data.frame(...))
}
}
f <- make_fitting_function(lm(mpg ~ wt + am + carb, data = mtcars))
f(wt = c(1, 2, 3, 4, 5), am = rep(1, 5), carb = seq(2, 10, by = 2))
## 1 2 3 4 5
## 30.589985 24.637415 18.684846 12.732277 6.779708

Resources