Right way to use lm in R - r

I do not have very clear idea of how to use functions like lm() that ask for a formula and a data.frame.
On the web I red about different approach but sometimes R give us warnings and other stuff
Suppose for example a linear model where the output vector y is explained by the matrix X.
I red that the best way is to use a data.frame (expecially if we are going to use the predict function later).
In situation where the X is a matrix is this the best way to use lm?
n=100
p=20
n_new=50
X=matrix(rnorm(n*p),n,p)
Y=rnorm(n)
data=list("x"=X,"y"=Y)
l=lm(y~x,data)
X_new=matrix(rnorm(n_new*p),n_new,p)
pred=predict(l,as.data.frame(X_new))

How about:
l <- lm(y~.,data=data.frame(X,y=Y))
pred <- predict(l,data.frame(X_new))
In this case R constructs the column names (X1 ... X20) automatically, but when you use the y~. syntax you don't need to know them.
Alternatively, if you are always going to fit linear regressions based on a matrix, you can use lm.fit() and compute the predictions yourself using matrix multiplication: you have to use cbind(1,.) to add an intercept column.
fit <- lm.fit(cbind(1,X),Y)
all(coef(l)==fit$coefficients) ## TRUE
pred <- cbind(1,X_new) %*% fit$coefficients
(You could also use cbind(1,X_new) %*% coef(l).) This is efficient, but it skips a lot of the error-checking steps, so use it with caution ...

In a situation like the one you describe, you have no reason not to turn your matrix into a data frame. Try:
myData <- as.data.frame(cbind(Y, X))
l <- lm(Y~., data=myData)

Related

R: Sliding one-ahead forecasts from equation estimated on a fixed period

The toy model below stands in for one with a bunch more variables, transforms, lags, etc. Assume I got that stuff right.
My data is ordered in time, but is now formatted as an R time series, because I need to exclude certain periods, etc. I'd rather not make it a time series for this reason, because I think it would be easy to muck up, but if I need to, or it greatly simplifies the estimating process, I'd like to just use an integer sequence, such as index. below, to represent time if that is allowed.
My problem is a simple one (I hope). I would like to use the first part of my data to estimate the coefficients of the model. Then I want to use those estimates, and not estimates from a sliding window, to do one-ahead forecasts for each of the remaining values of that data. The idea is that the formula is applied with a sliding window even though it is not estimated with one. Obviously I could retype the model with coefficients included and then get what I want in multiple ways, with base R sapply, with tidyverse dplyr::mutate or purrr::map_dbl, etc. But I am morally certain there is some standard way of pulling the formula out of the lm object and then wielding it as one desires, that I just haven't been able to find. Example:
set.seed(1)
x1 <- 1:20
y1 <- 2 + x1 + lag(x1) + rnorm(20)
index. <- x1
data. <- tibble(index., x1, y1)
mod_eq <- y1 ~ x1 + lag(x1)
lm_obj <- lm(mod_eq, data.[1:15,])
and I want something along the lines of:
my_forecast_values <- apply_eq_to_data(eq = get_estimated_equation(lm_obj), my_data = data.[16:20])
and the lag shouldn't give me an error.
Also, this is not part of my question per se, but I could use a pointer to a nice tutorial on using R formulas and the standard estimation output objects produced by lm, glm, nls and the like. Not the statistics, just the programming.
The common way to use the coefficients is by calling the predict(), coefficients(), or summary() function on the model object for what it is worth. You might try the ?predict.lm() documentation for details on formula.
A simple example:
data.$lagx <- dplyr::lag(data.$x1, 1) #create lag variable
lm_obj1 <- lm(data=data.[2:15,], y1 ~ x1 + lagx) #create model object
data.$pred1 <- predict(lm_obj1, newdata=data.[16,20]) #predict new data; needs to have same column headings

Bootstrap-t Method for Comparing Trimmed Means in R

I am confused with different robust methods to compare independent means. I found good explanation in statistical textbooks. For example yuen() in case of equal sample sizes. My samples are rather unequal, thus I would like to try a bootstrap-t method (from Wilcox book: Introduction to Robust Estimation and Hypothesis Testing, p.163). It says yuenbt() would be a possible solution.
But all textbooks say I can use vectors here:
yuenbt(x,y,tr=0.2,alpha=0.05,nboot=599,side=F)
If I check the local description it says:
yuenbt(formula, data, tr = 0.2, nboot = 599)
What's wrong with my trial:
x <- c(1,2,3)
y <- c(5,6,12,30,2,2,3,65)
yuenbt(x,y)
Why can't I use yuenbt-function with my two vectors? Thank you very much
Looking at the help (for those wondering, yuenbt is from the package WRS2...) for yuenbt, it takes a formula and a dataframe as arguments. My impression is that it expects data in long format.
With your example data, we can achieve that like so:
library(WRS2)
x <- c(1,2,3)
y <- c(5,6,12,30,2,2,3,65)
dat <- data.frame(value=c(x,y),group=rep(c("x","y"), c(length(x),length(y))))
We can then use the function:
yuenbt(value~group, data=dat)

Calculate many AUCs in R

I am fairly new to R. I am using the ROCR package in R to calculate AUC, which I can do for one predictor just fine. What I am looking to do is perform many AUC calculations for 100 different variables.
What I have done so far is the following:
varlist <- names(mydata)[2:101]
formlist <- lapply(varlist, function(x) paste0("prediction(",x,"mydata$V1))
However then the formulas are in text format, and the as.formula is giving me an error. Any help appreciated! Thanks in advance!
The function inside your lapply looks like it is just outputting a statement like prediction(varmydata$V1). I am guessing you actually want to run that command. If so, you probably want something like
lapply(varlist,function(x) prediction(mydata[x]))
but it is hard to tell without a reproducible situation. Also, it looks like your code has a missing quote.
If I understand you correctly, you want to use the first column of mydata as predictions, and all other variables as labels, one after the other.
Is this the correct way to treat mydata? This way is rather uncommon. It is more common to have the same true labels for several diffent predictions (e.g. iterated cross validation, comparison of different classifiers).
However, to answer your original question:
predictions and labels need to have the same shape for ROCR::prediction, e.g.
either as matrix
prediction (matrix (rep (mydata$V1, 10), 10), mydata [, -1])
or as lists:
prediction (mydata [rep (1, ncol (mydata) - 1)], mydata [-1])

Reusing the model from R's forecast package

I have been told that, while using R's forecast package, one can reuse a model. That is, after the code x <- c(1,2,3,4); mod <- ets(x); f <- forecast(mod,h=1) one could have append(x, 5) and predict the next value without recalculating the model. How does one do that? (As I understand, using simple exponential smoothing one would only need to know alpha, right?)
Is it like forecast(x, model=mod)? If that is the case I have to say that I am using Java and calling the forecast code programmatically (for many time series), so I dont think I could keep the model object in the R environment all the time. Would there be an easy way to keep the model object in Java and load it in R environment when needed?
You have two questions here:
A) Can the forecast package "grow" its datasets? I can't speak in great detail to this package and you will have to look at its document. However, R models in general obey a structure of
fit <- someModel(formula, data)
estfit <- predict(fit, newdata=someDataFrame)
eg you supply updated data given a fit object.
B) Can I serialize a model back and forth to Java? Yes, you can. Rserve is one object, you can also try basic serialize() to (raw) character. Or even just `save(fit, file="someFile.RData").
Regarding your first question:
x <- 1:4
mod <- ets(x)
f1 <- forecast(mod, h=1)
x <- append(x, 5)
mod <- ets(x, model=mod) # Reuses old mod without re-estimating parameters.
f2 <- forecast(mod, h=1)

What is the best way to run a loop of regressions in R?

Assume that I have sources of data X and Y that are indexable, say matrices. And I want to run a set of independent regressions and store the result. My initial approach would be
results = matrix(nrow=nrow(X), ncol=(2))
for(i in 1:ncol(X)) {
matrix[i,] = coefficients(lm(Y[i,] ~ X[i,])
}
But, loops are bad, so I could do it with lapply as
out <- lapply(1:nrow(X), function(i) { coefficients(lm(Y[i,] ~ X[i,])) } )
Is there a better way to do this?
You are certainly overoptimizing here. The overhead of a loop is negligible compared to the procedure of model fitting and therefore the simple answer is - use whatever way you find to be the most understandable. I'd go for the for-loop, but lapply is fine too.
I do this type of thing with plyr, but I agree that it's not a processing efficency issue as much as what you are comfortable reading and writing.
If you just want to perform straightforward multiple linear regression, then I would recommend not using lm(). There is lsfit(), but I'm not sure it would offer than much of a speed up (I have never performed a formal comparison). Instead I would recommend performing the (X'X)^{-1}X'y using qr() and qrcoef(). This will allow you to perform multivariate multiple linear regression; that is, treating the response variable as a matrix instead of a vector and applying the same regression to each row of observations.
Z # design matrix
Y # matrix of observations (each row is a vector of observations)
## Estimation via multivariate multiple linear regression
beta <- qr.coef(qr(Z), Y)
## Fitted values
Yhat <- Z %*% beta
## Residuals
u <- Y - Yhat
In your example, is there a different design matrix per vector of observations? If so, you may be able to modify Z in order to still accommodate this.

Resources