How to work with formula objects in R - r

I am trying to learn how to make my own functions with formula objects. I am mostly confused with how to parse them.
Lets say I have the following
gigl <- function(formula, data, family = gaussian())
Using the R dataset BOD
> BOD
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
It is easy to fit a linear model with lm
>lm(Time~demand, data=BOD)
Call:
lm(formula = Time ~ demand)
Coefficients:
(Intercept) demand
-1.8905 0.3746
How can I make my own function by parsing a formula?
For example if I had
>gigl(Time~demand, data=BOD)
How can I parse the components? I don't really care what the function gigl does. I just want to know how to work with the formula.
Edit
Due to questions about a concrete example lets try the following:
Say that I want to use the inputs from a formula to build a cor() matrix. So from the above I would see the result of cor(Time,demand) and if more variables were added I would see the complete cor() of all inputs.

Here's a function that takes a formula and transforms it into a call to the cor() function, then evaluates that call in an environment consisting of the data ...
f <- function(form,data) {
form[[1]] <- quote(cor)
eval(form,data)
}
f(demand~Time,BOD)
## [1] 0.8030693

The rlang package can make it easier to work with formulas in the tidyeval paradigm. For example you can do
library(rlang)
mycor <- function(form, data) {
v1 <- f_lhs(form)
v2 <- f_rhs(form)
d <- enquo(data)
qq <- expr(with(!!d, cor(!!v1, !!v2)))
eval_tidy(qq)
}
mycor(disp~drat, mtcars)
# [1] -0.7102139
with(mtcars, cor(disp, drat))
# [1] -0.7102139
The f_lhs/f_rhs functions help to extract the left-hand side and right-hand side respectively. Then we can use quo() and the !! operator to re-assemble those piece into a new function call. Then we evaluate that new function call with eval_tidy.

Not sure what you're trying to do, but you could take a look at the terms of a formula:
fm <- formula(Time ~ demand);
tms <- terms(fm);
tms;
#Time ~ demand
#attr(,"variables")
#list(Time, demand)
#attr(,"factors")
# demand
#Time 0
#demand 1
#attr(,"term.labels")
#[1] "demand"
#attr(,"order")
#[1] 1
#attr(,"intercept")
#[1] 1
#attr(,"response")
#[1] 1
#attr(,".Environment")
#<environment: R_GlobalEnv>
From tms you could extract relevant entries and attributes. For example,
attr(tms, "variables");
#list(Time, demand)

This assumes that two variables are used (expressions are not allowed). Assuming that the two variables are in the formula and that they can appear on the right or left or both, all.vars which gets the variable names and get_all_vars which gets the content can be useful:
gig1 <- function(formula, data) cor(data[all.vars(formula)])
gig1(demand ~ Time, BOD)
giving:
demand Time
demand 1.0000000 0.8030693
Time 0.8030693 1.0000000
or
gig2 <- function(formula, data) cor(get_all_vars(formula, data))
gig2(demand ~ Time, BOD)
giving:
demand Time
demand 1.0000000 0.8030693
Time 0.8030693 1.0000000
You might want to look at the source of lm and the Formula package for more ideas.

Related

Calculate trend in array using linear regression

I have an array of the dimensions c(54,71,360) which contains climatalogical data. The first two dimensions describe the grid of the region, while the third one serves as time dimension. So in this case, there are 360 time steps (months).
Here is code to produce a sample array:
set.seed(5)
my_array <- array(sample(rnorm(100), 600, replace=T), dim= c(54,71,360))
Now I would like to calculate the trend of each grid cell. The trend is equal to the slope of the linear regression equation. This is why the calculation of the linear regression of every grid cell with the time needs to be performed. And this is exactly what I am struggeling with.
To clearly show what I wish to do, here is an example with one grid cell, which is taken from the array as a vector of the length 360:
grid_cell <- my_array[1,1,]
The linear regression of this vector with the time needs to be calculated. For that purpose, we create a simple time vector:
time_vec <- 1:360
Since I am only interested at the slope coefficient, it can be done this way:
trend <- lm(grid_cell ~ time_vec)$coefficients[2]
This leads to a value of 1.347029e-05 in this case.
I would like to do this for every grid cell of the array, so that the output is a matrix of the dimensions c(54,71), meaning one trend value for each grid cell.
I tried the following, which did not work:
trend_mat <- apply(my_array, 1:2, lm(my_array ~ time_vec)$coefficients[2])
I receive the error message:Error in model.frame.default: variable lengths differ.
This is kind of surprising, since both, the third dimension of the array and the time_vec are both of the length 360.
Anybody with an idea how to achieve this?
Of course I am also open for other solutions which may work totally differently, as long as they lead to the same result.
The problems with the code in the question are that
the third argument of apply should be a function and the question's code provides an expression instead of a function.
it applies lm many times. We show how to do it applying lm only once and in the second alternative we don't use lm at all. this gives one and two order of magnitude speedups as shown in the Performance section below.
It is easier to illustrate if we use smaller data as shown in the Note at the end. To use it on your example just replace dims with the line shown in the commented out line in the Note.
1) First we reshape the array into a matrix, perform lm and then reshape it back. This invokes lm once rather than invoking it prod(dims[1:2]) times.
m <- t(matrix(a,,dim(a)[3]))
array(coef(lm(m ~ timevec))[2, ], dim(a)[1:2])
## [,1] [,2] [,3]
## [1,] 0.2636792 0.5682025 -0.255538
## [2,] -0.4453307 0.2338086 0.254682
# check
coef(lm(a[1,1,] ~ timevec))[[2]]
## [1] 0.2636792
coef(lm(a[2,1,] ~ timevec))[[2]]
## [1] -0.4453307
coef(lm(a[1,2,] ~ timevec))[[2]]
## [1] 0.5682025
coef(lm(a[2,2,] ~ timevec))[[2]]
## [1] 0.2338086
coef(lm(a[1,3,] ~ timevec))[[2]]
## [1] -0.255538
coef(lm(a[2,3,] ~ timevec))[[2]]
## [1] 0.254682
2) Alternately, we can remove lm entirely by using the formula for the slope coefficient like this:
m <- t(matrix(a,,dim(a)[3]))
array(cov(m, timevec) / var(timevec), dims[1:2])
## [,1] [,2] [,3]
## [1,] 0.2636792 0.5682025 -0.255538
## [2,] -0.4453307 0.2338086 0.254682
Performance
We see that the single lm runs about 8x faster than apply and eliminating lm runs about 230x times faster than apply. Because the apply is brutally slow on my laptop I only used 3 replications but if you have a faster machine or more patience you can increase it. The main conclusions are unlikely to change much though.
library(microbenchmark)
set.seed(5)
dims <- c(54,71,360)
a <- array(rnorm(prod(dims)), dims)
timevec <- seq_len(dim(a)[3])
microbenchmark(times = 3L,
apply = apply(a, 1:2, function(x) coef(lm(x ~ timevec))[2]),
lm = { m <- t(matrix(a,,dim(a)[3]))
array(coef(lm(m ~ timevec))[2, ], dim(a)[1:2])
},
cov = { m <- t(matrix(a,,dim(a)[3]))
array(cov(m, timevec) / var(timevec), dims[1:2])
})
giving:
Unit: milliseconds
expr min lq mean median uq max neval cld
apply 13446.7953 13523.6016 13605.25037 13600.4079 13684.4779 13768.5479 3 b
lm 264.5883 275.7611 476.82077 286.9338 582.9370 878.9402 3 a
cov 56.9120 57.8830 58.71573 58.8540 59.6176 60.3812 3 a
Note
Test data.
set.seed(5)
# dims <- c(54,71,360)
dims <- 2:4
a <- array(rnorm(prod(dims)), dims)
timevec <- seq_len(dim(a)[3])
There is a anonymous function missing in the question's regression code. Here I will use the new lambdas, introduced in R 4.1.0.
I also use the recommended extractor coef.
set.seed(5)
my_array <- array(sample(rnorm(100), 600, replace=T), dim= c(54,71,360))
time_vec <- 1:360
trend_mat <- apply(my_array, 1:2, \(x) coef(lm(x ~ time_vec))[2])

R / Rolling Regression with extended Data Frame

Hallo I'm currently working on a Regression Analysis with the following Code:
for (i in 1:ncol(Ret1)){
r2.out[i]=summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
}
r2.out
This Code runs a simple OLS Regression of each column in the data Frame agianst the first column and provides the R^2 of These regressions. At the Moment the Regression uses all data Points of a column. What I Need now is that the Code instead of using all data Points in a column just uses a rolling window of data Points. So he calculates for a rolling window of 30 Days the R^2 over the entire time Frame. The output is a Matrix with all the R^2 per rolling window for each (1,i) pair.
This Code does the rolling Regression part but does not make the Regression for each (1,i) pair.
dolm <- function(x) summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
rollapplyr(Ret1, 30, dolm, by.column = FALSE)
I really appreciate any help you can provide.
Using the built-in anscombe data frame we regress the y1 column against x1 and then x2, etc. We use a width of 3 here for purposes of illustration.
xnames should be set to the names of the x variables. In the anscombe data set the column names that begin with x are the x variables. As another example, if all the columns are x variables except the first then xnames <- names(DF)[-1] could be used.
We define an R squared function, rsq which takes the indexes to use, ix and the x variable name xname. We then sapply over the xnames and for each one rollapply over the indices 1:n.
library(zoo)
xnames <- grep("x", names(anscombe), value = TRUE)
n <- nrow(anscombe)
w <- 3
rsq <- function(ix, xname) summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq
sapply(xnames, function(xname) rollapply(1:n, w, rsq, xname = xname ))
giving the following result of dimensions n - w + 1 by length(xnames):
x1 x2 x3 x4
[1,] 2.285384e-01 2.285384e-01 2.285384e-01 0.0000000
[2,] 3.591782e-05 3.591782e-05 3.591782e-05 0.0000000
[3,] 9.841920e-01 9.841920e-01 9.841920e-01 0.0000000
[4,] 5.857410e-01 5.857410e-01 5.857410e-01 0.0000000
[5,] 9.351609e-01 9.351609e-01 9.351609e-01 0.0000000
[6,] 8.760332e-01 8.760332e-01 8.760332e-01 0.7724447
[7,] 9.494869e-01 9.494869e-01 9.494869e-01 0.7015512
[8,] 9.107256e-01 9.107256e-01 9.107256e-01 0.3192194
[9,] 8.385510e-01 8.385510e-01 8.385510e-01 0.0000000
Variations
1) It would also be possible to reverse the order of the rollapply and sapply replacing the last line of code with:
rollapply(1:n, 3, function(ix) sapply(xnames, rsq, ix = ix))
2) Another variation is to replace the definition of rsq and the sapply/rollapply line with the following single statement. It may be a bit harder to read so you may prefer the first solution but it does entail one simplification -- namely, xname need no longer be an explicit argument of the inner anonymous function (which takes the place of rsq above):
sapply(xnames, function(xname) rollapply(1:n, 3, function(ix)
summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq))
Update: Have fixed line which is now n <- nrow(anscombe)

trouble with cbind in manova call in r

I'm trying to do a multivariate ANOVA with the manova function in R. My problem is that I'm trying to find a way to pass the list of dependent variables without typing them all in manually, as there are many and they have horrible names. My data are in a data frame where "unit" is the dependent variable (factor), and the rest of the columns are various numeric response variables. e.g.
unit C_pct Cln C_N_mol Cnmolln C_P_mol N_P_mol
1 C 48.22 3.88 53.92 3.99 3104.75 68.42
2 C 49.91 3.91 56.32 4.03 3454.53 62.04
3 C 50.75 3.93 56.96 4.04 3922.01 69.16
4 SH 50.72 3.93 46.58 3.84 2590.16 57.12
5 SH 51.06 3.93 43.27 3.77 2326.04 53.97
6 SH 48.62 3.88 40.97 3.71 2357.16 59.67
If I write the manova call as
fit <- manova(cbind(C_pct, Cln) ~ unit, data = plots)
it works fine, but I'd like to be able to pass a long list of columns without naming them one by one, something like
fit <- manova(cbind(colnames(plots[5:32])) ~ unit, data = plots)
or
fit <- manove(cbind(plots[,5:32]) ~ unit, data = plots)
I get the error
"Error in model.frame.default(formula = as.matrix(cbind(colnames(plots[5:32]))) ~ :
variable lengths differ (found for 'unit')
I'm sure it's because I'm using cbind wrong, but can't figure it out. Any help is appreciated! Sorry if the formatting is rough, this is my first question posted.
EDIT: Both ways (all 3, actually) work. thanks all!
manova, like most R modelling functions, builds its formula out of the names of the variables found in the dataset. However, when you pass it the colnames, you're technically passing the strings that represent the names of those variables. Hence the function doesn't know what to do with them, and chokes.
You can actually get around this. The LHS of the formula only has to resolve to a matrix; the use of cbind(C_pct, Cln, ...) is a way of obtaining a matrix by evaluating the names of its arguments C_pct, Cln, etc in the environment of your data frame. But if you provide a matrix to start with, then no evaluation is necessary.
fit <- manova(as.matrix(plots[, 5:32]) ~ unit, data=plots)
Some notes. The as.matrix is necessary because getting columns from a data frame like this, returns a data frame. manova won't like this, so we coerce the data frame to a matrix. Second, this works assuming you don't have an actual variable called plots inside your data frame plots. This is because, if R doesn't find a name inside your data frame, it then looks in the environment of the caller, in this case the global environment.
You can also create the matrix before fitting the model, with
plots$response <- as.matrix(plots[, 5:32])
fit <- manova(response ~ unit, data=plots)
You can build your formula as string and cast it to a formula:
responses <- paste( colnames( plots )[2:6], collapse=",")
myformula <- as.formula( paste0( "cbind(", responses , ")~ unit" ) )
manova( myformula, data = plots )
Call:
manova(myformula, data = plots)
Terms:
unit Residuals
resp 1 0.4 6.8
resp 2 0 0
resp 3 220.6 21.0
resp 4 0.1 0.0
resp 5 1715135.8 377938.1
Deg. of Freedom 1 4
Residual standard error: 1.3051760.027080132.293640.04966555307.3834
Estimated effects may be unbalanced

Local prediction modelling approach in R

users
I am trying to develop a local model (PLSR) which is predicting a query sample by a model built on the 10 most similar samples using the code below (not the full model yet, just a part of it). I got stuck when trying to predict the query sample (second to last line). The model is actually predicting something, ("prd") but not the query sample!
Here is my code:
require("pls")
set.seed(10000) # generate some sample data
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method="euclidean") # create a distance matrix
eDm <- as.matrix(eD)
Looping over all 100 samples and extracting their 10 most similar samples for subsequent model building and prediction of query sample:
for (i in 1:nrow(eDm)) {
kni <- head(order(eDm[,i]),11)[-1] # add 10 most similar samples to kni
pls1 <- plsr(y[kni,] ~ x[kni,], ncomp=5, validation="CV") # run plsr on sel. samples
prd <- predict(pls1, ncomp=5, newdata=x[[i]]) # predict query sample ==> I suspect there is something wrong with this expression: newdata=x[[i]]
}
I can't figure out how to address the query sample properly - many thanks i.a. for any help!
Best regards,
Chega
You are going to run into all sorts of pain building models with formulae like that. Also the x[[i]] isn't doing what you think it is - you need to supply a data frame usually to these modelling functions. In this case a matrix seems fine too.
I get all your code working OK if I use:
prd <- predict(pls1, ncomp=5, newdata=x[i, ,drop = FALSE])
giving
> predict(pls1, ncomp=5, newdata=x[i,,drop = FALSE])
, , 5 comps
y[kni, ]
[1,] 0.6409897
What you were seeing with your code are the fitted values for the training data.
> fitted(pls1)[, , 5, drop = FALSE]
, , 5 comps
y[kni, ]
1 0.1443274
2 0.2706769
3 1.1407780
4 -0.2345429
5 -1.0468221
6 2.1353091
7 0.8267103
8 3.3242296
9 -0.5016016
10 0.6781804
This is convention in R when you either don't supply newdata or the object you are supplying makes no sense and doesn't contain the covariates required to generate predictions.
I would have fitted the model as follows:
pls1 <- plsr(y ~ x, ncomp=5, validation="CV", subset = kni)
where I use the subset argument for its intended purpose; to select the rows of the input data to fit the model with. You get nicer output from the models; the labels use y instead of y[kni, ] etc, plus this general convention will serve you well in other modelling tools, where R will expect newdata to be a data frame with names exactly the same as those mentioned in the model formula. In your case, with your code, that would mean creating a data frame with names like x[kni, ] which are not easy to do, for good reason!

R extract regression coefficients from multiply regression via lapply command

I have a large dataset with several variables, one of which is a state variable, coded 1-50 for each state. I'd like to run a regression of 28 variables on the remaining 27 variables of the dataset (there are 55 variables total), and specific for each state.
In other words, run a regression of variable1 on covariate1, covariate2, ..., covariate27 for observations where state==1. I'd then like to repeat this for variable1 for states 2-50, and the repeat the whole process for variable2, variable3,..., variable28.
I think I've written the correct R code to do this, but the next thing I'd like to do is extract the coefficients, ideally into a coefficient matrix. Could someone please help me with this? Here's the code I've written so far:
for (num in 1:50) {
#PUF is the data set I'm using
#Subset the data by states
PUFnum <- subset(PUF, state==num)
#Attach data set with state specific data
attach(PUFnum)
#Run our prediction regression
#the variables class1 through e19700 are the 27 covariates I want to use
regression <- lapply(PUFnum, function(z) lm(z ~ class1+class2+class3+class4+class5+class6+class7+
xtot+e00200+e00300+e00600+e00900+e01000+p04470+e04800+
e09600+e07180+e07220+e07260+e06500+e10300+
e59720+e11900+e18425+e18450+e18500+e19700))
Beta <- lapply(regression, function(d) d<- coef(regression$d))
detach(PUFnum)
}
This is another example of the classic Split-Apply-Combine problem, which can be addressed using the plyr package by #hadley. In your problem, you want to
Split data frame by state
Apply regressions for each subset
Combine coefficients into data frame.
I will illustrate it with the Cars93 dataset available in MASS library. We are interested in figuring out the relationship between horsepower and enginesize based on origin of country.
# LOAD LIBRARIES
require(MASS); require(plyr)
# SPLIT-APPLY-COMBINE
regressions <- dlply(Cars93, .(Origin), lm, formula = Horsepower ~ EngineSize)
coefs <- ldply(regressions, coef)
Origin (Intercept) EngineSize
1 USA 33.13666 37.29919
2 non-USA 15.68747 55.39211
EDIT. For your example, substitute PUF for Cars93, state for Origin and fm for the formula
I've cleaned up your code slightly:
fm <- z ~ class1+class2+class3+class4+class5+class6+class7+
xtot+e00200+e00300+e00600+e00900+e01000+p04470+e04800+
e09600+e07180+e07220+e07260+e06500+e10300+
e59720+e11900+e18425+e18450+e18500+e19700
PUFsplit <- split(PUF, PUF$state)
mod <- lapply(PUFsplit, function(z) lm(fm, data=z))
Beta <- sapply(mod, coef)
If you wanted, you could even put this all in one line:
Beta <- sapply(lapply(split(PUF, PUF$state), function(z) lm(fm, data=z)), coef)

Resources