Is there is an equivalent to update for the data part of an lm call object?
For example, say i have the following model:
dd = data.frame(y=rnorm(100),x1=rnorm(100))
Model_all <- lm(formula = y ~ x1, data = dd)
Is there a way of operating on the lm object to have the equivalent effect of:
Model_1t50 <- lm(formula = y ~ x1, data = dd[1:50,])
I am trying to construct some psudo out of sample forecast tests, and it would be very convenient to have a single lm object and to simply roll the data.
I'm fairly certain that update actually does what you want!
example(lm)
dat1 <- data.frame(group,weight)
lm1 <- lm(weight ~ group, data=dat1)
dat2 <- data.frame(group,weight=2*weight)
lm2 <- update(lm1,data=dat2)
coef(lm1)
##(Intercept) groupTrt
## 5.032 -0.371
coef(lm2)
## (Intercept) groupTrt
## 10.064 -0.742
If you're hoping for an effiency gain from this, you'll be disappointed -- R just substitutes the new arguments and re-evaluates the call (see the code of update.default). But it does make the code a lot cleaner ...
biglm objects can be updated to include more data, but not less. So you could do this in the opposite order, starting with less data and adding more. See http://cran.r-project.org/web/packages/biglm/biglm.pdf
However, I suspect you're interested in parameters estimated for subpopulations (ie if rows 1:50 correspond to level "a" of factor variable factrvar. In this case, you should use interaction in your formula (~factrvar*x1) rather than subsetting to data[1:50,]. Interaction of this type will give different effect estimates for each level of factrvar. This is more efficient than estimating each parameter separately and will constrain any additional parameters (ie, x2 in ~factrvar*x1 + x2) to be the same across values of factrvar--if you estimated the same model multiple times to different subsets, x2 would receive a separate parameter estimate each time.
Related
I've been asked to provide standardized coefficients for a glmer model, but am not sure how to obtain them. Unfortunately, the beta function does not work on glmer models:
Error in UseMethod("beta") :
no applicable method for 'beta' applied to an object of class "c('glmerMod', 'merMod')"
Are there other functions I could use, or would I have to write one myself?
Another problem is that the model contains several continuous predictors (which operate on similar scales) and 2 categorical predictors (one with 4 levels, one with six levels). The purpose of using the standardized coefficients would be to compare the impact of the categorical predictors to those of the continuous ones, and I'm not sure that standardized coefficients are the appropriate way to do so. Are standardized coefficients an acceptable approach?
The model is as follows:
model=glmer(cbind(nr_corr,maximum-nr_corr) ~ (condition|SUBJECT) + categorical_1 + categorical_2 + continuous_1 + continuous_2 + continuous_3 + continuous_4 + categorical_1:categorical_2 + categorical_1:continuous_3, data, control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=100000)), family = binomial)
reghelper::beta simply standardizes the numeric variables in our dataset. So assuming your catagorical variables are factors rather than numeric dummy variables or other contrast encodings we can fairly simply standardize the numeric variables in our dataset
vars <- grep('^continuous(.*)?', all.vars(formula(model)))
f <- function(var, data)
scale(data[[var]])
data[, vars] <- lapply(vars, f, data = data)
update(model, data = data)
Now for the more general case we can more or less just as easily create our own beta.merMod function. However we will need to take into account whether or not it makes sense to standardize y. For example if we have a poisson model only positive integer values makes sense. In addition a question becomes whether or not to scale the random slope effects or not, and whether it makes sense to ask this question in the first place. In it I assume that categorical variables are encoded as character or factor and not numeric or integer.
beta.merMod <- function(model,
x = TRUE,
y = !family(model) %in% c('binomial', 'poisson'),
ran_eff = FALSE,
skip = NULL,
...){
# Extract all names from the model formula
vars <- all.vars(form <- formula(model))
lhs <- all.vars(form[[2]])
# Get random effects from the
ranef <- names(ranef(model))
# Remove ranef and lhs from vars
rhs <- vars[!vars %in% c(lhs, ranef)]
# extract the data used for the model
env <- environment(form)
call <- getCall(model)
data <- get(dname <- as.character(call$data), envir = env)
# standardize the dataset
vars <- character()
if(isTRUE(x))
vars <- c(vars, rhs)
if(isTRUE(y))
vars <- c(vars, lhs)
if(isTRUE(ran_eff))
vars <- c(vars, ranef)
data[, vars] <- lapply(vars, function(var){
if(is.numeric(data[[var]]))
data[[var]] <- scale(data[[var]])
data[[var]]
})
# Update the model and change the data into the new data.
update(model, data = data)
}
The function works for both linear and generalized linear mixed effect models (not tested for nonlinear models), and is used just like other beta functions from reghelper
library(reghelper)
library(lme4)
# Linear mixed effect model
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
fm2 <- beta(fm1)
fixef(fm1) - fixef(fm2)
(Intercept) Days
-47.10279 -19.68157
# Generalized mixed effect model
data(cbpp)
# create numeric variable correlated with period
cbpp$nv <-
rnorm(nrow(cbpp), mean = as.numeric(levels(cbpp$period))[as.numeric(cbpp$period)])
gm1 <- glmer(cbind(incidence, size - incidence) ~ nv + (1 | herd),
family = binomial, data = cbpp)
gm2 <- beta(gm1)
fixef(gm1) - fixef(gm2)
(Intercept) nv
0.5946322 0.1401114
Note however that unlike beta the function returns the updated model not a summary of the model.
Another problem is that the model contains several continuous predictors (which operate on similar scales) and 2 categorical predictors (one with 4 levels, one with six levels). The purpose of using the standardized coefficients would be to compare the impact of the categorical predictors to those of the continuous ones, and I'm not sure that standardized coefficients are the appropriate way to do so. Are standardized coefficients an acceptable approach?
Now that is a great question and one better suited for stats.stackexchange, and not one I'm certain of the answer to.
Again, thank you so much, Oliver! For anybody who is interested in the answer regarding the last part of my question,
Another problem is that the model contains several continuous
predictors (which operate on similar scales) and 2 categorical
predictors (one with 4 levels, one with six levels). The purpose of
using the standardized coefficients would be to compare the impact of
the categorical predictors to those of the continuous ones, and I'm
not sure that standardized coefficients are the appropriate way to do
so. Are standardized coefficients an acceptable approach?
you can find the answer here. The tl;dr is that using standardized regression coefficients is not the best approach for mixed models anyways, let alone one such as mine...
I am currently learning R and am playing around with a dataset that has four nominal variables (Hour.Of.Arrival, Mode, Unit, Weekday), and a continuous dependent variable (Overall). This is all imported from a .csv in a data frame named basic. What I am trying to do is run an ANOVA just using this data frame, without creating separate vectors (e.g. Mode<-basic$Mode). "Fit" holds the results of the ANOVA. Here is the code that I wrote:
Fit<-aov(basic["Overall"],basic["Unit"],data=basic)
However, I keep getting the error
"Error in terms.default(formula, "Error", data = data) : no terms
component nor attribute
I hope this question isn't too basic!!
Thanks :)
I think you want something more like Fit<-aov(Overall ~ Unit,data=basic). The Overall ~ Unit tells R to treat Overall as an outcome being predicted by Unit; you already specify that the dataframe to find these variables is basic.
Here's an example to show you how it works:
> y <- rnorm(100)
> x <- factor(rep(c('A', 'B', 'C', 'D'), each = 25))
> dat <- data.frame(x, y)
> aov(y ~ x, data = dat)
Call:
aov(formula = y ~ x, data = dat)
Terms:
x Residuals
Sum of Squares 2.72218 114.54631
Deg. of Freedom 3 96
Residual standard error: 1.092333
Estimated effects may be unbalanced
Note, you don't need to use the data argument, you could also use aov(dat$y ~ dat$x), but the first argument to the function should be a formula.
lm sets model = TRUE by default, meaning the entire dataset used for learning is copied and returned with the fitted object. This is used by predict but creates memory overhead (example below).
I am wondering, is the copied dataset used for any reason other than predict?
Not essential to answer, but I'd also like to know of models that store data for reasons other than predict.
Example
object.size(lm(mpg ~ ., mtcars))
#> 45768 bytes
object.size(lm(mpg ~ ., mtcars, model = FALSE))
#> 28152 bytes
Bigger dataset = bigger overhead.
Motivation
To share my motivation, the twidlr package forces users to provide data when using predict. If this makes copying the dataset when learning unnecessary, it seems reasonable to save memory by defaulting to model = FALSE. I've opened a relevant issue here.
A secondary motivation - you can easily fit many models like lm with pipelearner, but copying data each time creates massive overhead. So finding ways to cut down memory needs would be very handy!
I think model frame is returned as a protection against non-standard evaluation.
Let's look at a small example.
dat <- data.frame(x = runif(10), y = rnorm(10))
FIT <- lm(y ~ x, data = dat)
fit <- FIT; fit$model <- NULL
What is the difference between
model.frame(FIT)
model.frame(fit)
?? Checking methods(model.frame) and stats:::model.frame.lm shows that in the first case, model frame is efficiently extracted from FIT$model; while in the second case, it will be reconstructed from fit$call and model.frame.default. Such difference also results in the difference between
# depends on `model.frame`
model.matrix(FIT)
model.matrix(fit)
as model matrix is built from a model frame. If we dig further, we will see that these are different, too,
# depends on `model.matrix`
predict(FIT)
predict(fit)
# depends on `predict.lm`
plot(FIT)
plot(fit)
Note that this is where the problem could be. If we deliberately remove dat, we can not reconstruct the model frame, then all these will fail:
rm(dat)
model.frame(fit)
model.matrix(fit)
predict(fit)
plot(fit)
while using FIT will work.
This is not bad enough. The following example under non-standard evaluation is really bad!
fitting <- function (myformula, mydata, keep.mf = FALSE) {
b <- lm(formula = myformula, data = mydata, model = keep.mf)
par(mfrow = c(2,2))
plot(b)
predict(b)
}
Now let's create a data frame again (we have removed it earlier)
dat <- data.frame(x = runif(10), y = rnorm(10))
Can you see that
fitting(y ~ x, dat, keep.mf = TRUE)
works but
fitting(y ~ x, dat, keep.mf = FALSE)
fails?
Here is a question I answered / investigated a year ago: R - model.frame() and non-standard evaluation It was asked for survival package. That example is really extreme: even if we provide newdata, we would still get error. Retaining the model frame is the only way to proceed!
Finally on your observation of memory costs. In fact, $model is not mainly responsible for potentially large lm object. $qr is, as it has the same dimension with model matrix. Consider a model with lots of factors, or nonlinear terms like bs, ns or poly, the model frame is much smaller compared with model matrix. So omitting model frame return does not help reduce lm object size. This is actually one motivation that biglm is developed.
Since I inevitably mentioned biglm, I would emphasis again that this method only helps reducing the final model object size, not RAM usage during model fitting.
I am trying to use R to fit a linear model and make predictions. My model includes some constant side parameters that are not in the data frame. Here's a simplified version of what I'm doing:
dat <- data.frame(x=1:5,y=3*(1:5))
b <- 1
mdl <- lm(y~I(b*x),data=dat)
Unfortunately the model object now suffers from a dangerous scoping issue: lm() does not save b as part of mdl, so when predict() is called, it has to reach back into the environment where b was defined. Thus, if subsequent code changes the value of b, the predict value will change too:
y1 <- predict(mdl,newdata=data.frame(x=3)) # y1 == 9
b <- 5
y2 <- predict(mdl,newdata=data.frame(x=3)) # y2 == 45
How can I force predict() to use the original b value instead of the changed one? Alternatively, is there some way to control where predict() looks for the variable, so I can ensure it gets the desired value? In practice I cannot include b as part of the newdata data frame, because in my application, b is a vector of parameters that does not have the same size as the data frame of new observations.
Please note that I have greatly simplified this relative to my actual use case, so I need a robust general solution and not just ad-hoc hacking.
eval(substitute the value into the quoted expression
mdl <- eval(substitute(lm(y~I(b*x),data=dat), list(b=b)))
mdl
# Call:
# lm(formula = y ~ I(1 * x), data = dat)
# ...
We could also use bquote
mdl <- eval(bquote(lm(y~I(.(b)*x), data=dat)))
mdl
#Call:
#lm(formula = y ~ I(1 * x), data = dat)
#Coefficients:
#(Intercept) I(1 * x)
# 9.533e-15 3.000e+00
According to ?bquote description
‘bquote’ quotes its
argument except that terms wrapped in ‘.()’ are evaluated in the
specified ‘where’ environment.
I'm modelling a lot of data for different companies, and for each company I need to identify quickly those model parameters that are most significant. What I would like to see is xtable() output for a fitted model that sorts all coefficients in increasing order of p-value (ie, most significant parameters first).
x <- data.frame(a=rnorm(100), b=runif(100), c=rnorm(100), e=rnorm(100))
fit <- glm(a ~ ., data=x)
xtable(fit)
I'm guessing that I may be able to accomplish something like this by messing with the structure of the fit object. But I'm not familiar with the structure enough to be able to confidently change anything.
Suggestions?
Not necessarily the most elegant solution, but that should do the job:
data(birthwt, package="MASS")
glm.res <- glm(low ~ ., data=birthwt[,-10])
idx <- order(coef(summary(glm.res))[,4]) # sort out the p-values
out <- coef(summary(glm.res))[idx,] # reorder coef, SE, etc. by increasing p
library(xtable)
xtable(out)