The rms package contains a wealth of useful statistical functions. However, I cannot find a proper way to extract certain fit statistics from the fitted object. Consider an example:
library(pacman)
p_load(rms, stringr, readr)
#fit
> (fit = rms::ols(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris))
Linear Regression Model
rms::ols(formula = Sepal.Length ~ Sepal.Width + Petal.Length +
Petal.Width + Species, data = iris)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 150 LR chi2 302.96 R2 0.867
sigma0.3068 d.f. 5 R2 adj 0.863
d.f. 144 Pr(> chi2) 0.0000 g 0.882
Residuals
Min 1Q Median 3Q Max
-0.794236 -0.218743 0.008987 0.202546 0.731034
Coef S.E. t Pr(>|t|)
Intercept 2.1713 0.2798 7.76 <0.0001
Sepal.Width 0.4959 0.0861 5.76 <0.0001
Petal.Length 0.8292 0.0685 12.10 <0.0001
Petal.Width -0.3152 0.1512 -2.08 0.0389
Species=versicolor -0.7236 0.2402 -3.01 0.0031
Species=virginica -1.0235 0.3337 -3.07 0.0026
So, the print function for the fit prints a lot of useful stuff including standard errors and adjusted R2. Unfortunately, if we inspect the model fit object, the values don't seem to be present anywhere.
> str(fit)
List of 19
$ coefficients : Named num [1:6] 2.171 0.496 0.829 -0.315 -0.724 ...
..- attr(*, "names")= chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
$ residuals : Named num [1:150] 0.0952 0.1432 -0.0731 -0.2894 -0.0544 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ effects : Named num [1:150] -71.5659 -1.1884 9.1884 -1.3724 -0.0587 ...
..- attr(*, "names")= chr [1:150] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
$ rank : int 6
$ fitted.values : Named num [1:150] 5 4.76 4.77 4.89 5.05 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ assign :List of 4
..$ Sepal.Width : int 2
..$ Petal.Length: int 3
..$ Petal.Width : int 4
..$ Species : int [1:2] 5 6
$ qr :List of 5
..$ qr : num [1:150, 1:6] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
..$ qraux: num [1:6] 1.08 1.02 1.11 1.02 1.02 ...
..$ pivot: int [1:6] 1 2 3 4 5 6
..$ tol : num 1e-07
..$ rank : int 6
..- attr(*, "class")= chr "qr"
$ df.residual : int 144
$ var : num [1:6, 1:6] 0.07828 -0.02258 -0.00198 0.01589 -0.02837 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
.. ..$ : chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
$ stats : Named num [1:6] 150 302.964 5 0.867 0.882 ...
..- attr(*, "names")= chr [1:6] "n" "Model L.R." "d.f." "R2" ...
$ linear.predictors: Named num [1:150] 5 4.76 4.77 4.89 5.05 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ call : language rms::ols(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris)
$ terms :Classes 'terms', 'formula' language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
.. ..- attr(*, "variables")= language list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
.. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
.. .. .. ..$ : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
.. ..- attr(*, "term.labels")= chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
.. ..- attr(*, "order")= int [1:4] 1 1 1 1
.. ..- attr(*, "intercept")= num 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
.. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ...
.. .. ..- attr(*, "names")= chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
.. ..- attr(*, "formula")=Class 'formula' language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
.. .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
$ Design :List of 12
..$ name : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
..$ label : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
..$ units : Named chr [1:4] "" "" "" ""
.. ..- attr(*, "names")= chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
..$ colnames : chr [1:5] "Sepal.Width" "Petal.Length" "Petal.Width" "Species=versicolor" ...
..$ mmcolnames : chr [1:5] "Sepal.Width" "Petal.Length" "Petal.Width" "Speciesversicolor" ...
..$ assume : chr [1:4] "asis" "asis" "asis" "category"
..$ assume.code : int [1:4] 1 1 1 5
..$ parms :List of 1
.. ..$ Species: chr [1:3] "setosa" "versicolor" "virginica"
..$ limits : list()
..$ values : list()
..$ nonlinear :List of 4
.. ..$ Sepal.Width : logi FALSE
.. ..$ Petal.Length: logi FALSE
.. ..$ Petal.Width : logi FALSE
.. ..$ Species : logi [1:2] FALSE FALSE
..$ interactions: NULL
$ non.slopes : num 1
$ na.action : NULL
$ scale.pred : chr "Sepal.Length"
$ fail : logi FALSE
$ sformula :Class 'formula' language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
- attr(*, "class")= chr [1:3] "ols" "rms" "lm"
There is a 7 year old question on R help where the package creator explains a solution to getting these:
On Wed, 11 Aug 2010, david dav wrote:
Hi,
I would like to extract the coefficients of a logistic regression
(estimates and standard error as well) in lrm as in glm with
summary(fit.glm)$coef
Thanks
David
coef(fit) sqrt(diag(vcov(fit)))
But these will not be very helpful except in the trivial case where
everything is linear, nothing interacts, and factors have two levels.
Frank
And the solution is according to the author not optimal. This leaves one wondering how the displayed values are calculated. Tracing down the code results in a hunt through the undocumented package code (the package code is on Github). I.e. we begin with print.ols():
> rms:::print.ols
function (x, digits = 4, long = FALSE, coefs = TRUE, title = "Linear Regression Model",
...)
{
latex <- prType() == "latex"
k <- 0
z <- list()
if (length(zz <- x$na.action)) {
k <- k + 1
z[[k]] <- list(type = paste("naprint", class(zz)[1],
sep = "."), list(zz))
}
stats <- x$stats
...
Reading further we do find that e.g. R2 adj. is calculated in the print function:
rsqa <- 1 - (1 - r2) * (n - 1) / rdf
We also find some standard error calculations, though no p values.
se <- sqrt(diag(x$var))
z[[k]] <- list(type='coefmatrix',
list(coef = x$coefficients,
se = se,
errordf = rdf))
All the results are passed down further to prModFit(). We can look it up and find the p value calculation etc. Unfortunately, the print command returns NULL so these values are not available anywhere for programmatic reuse:
> x = print((fit = rms::ols(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris)))
#printed output...
> x
NULL
How does one get all the statistics?
Here is a hack solution where we capture the output of the print command:
#parser
get_model_stats = function(x, precision=60) {
# remember old number formatting function
# (which would round and transforms p-values to formats like "<0.01")
old_format_np = rms::formatNP
# substitute it with a function which will print out as many digits as we want
assignInNamespace("formatNP", function(x, ...) formatC(x, format="f", digits=precision), "rms")
# remember old width setting
old_width = options('width')$width
# substitute it with a setting making sure the table will not wrap
options(width=old_width + 4 * precision)
# actually print the data and capture it
cap = capture.output(print(x))
# restore original settings
options(width=old_width)
assignInNamespace("formatNP", old_format_np, "rms")
#model stats
stats = c()
stats$R2.adj = str_match(cap, "R2 adj\\s+ (\\d\\.\\d+)") %>% na.omit() %>% .[, 2] %>% as.numeric()
#coef stats lines
coef_lines = cap[which(str_detect(cap, "Coef\\s+S\\.E\\.")):(length(cap) - 1)]
#parse
coef_lines_table = suppressWarnings(readr::read_table(coef_lines %>% stringr::str_c(collapse = "\n")))
colnames(coef_lines_table)[1] = "Predictor"
list(
stats = stats,
coefs = coef_lines_table
)
}
Example:
> get_model_stats(fit)
$stats
$stats$R2.adj
[1] 0.86
$coefs
# A tibble: 6 x 5
Predictor Coef S.E. t `Pr(>|t|)`
<chr> <dbl> <dbl> <dbl> <chr>
1 Intercept 2.17 0.280 7.8 <0.0001
2 Sepal.Width 0.50 0.086 5.8 <0.0001
3 Petal.Length 0.83 0.069 12.1 <0.0001
4 Petal.Width -0.32 0.151 -2.1 0.0389
5 Species=versicolor -0.72 0.240 -3.0 0.0031
6 Species=virginica -1.02 0.334 -3.1 0.0026
This still has issues, e.g. p values are not returned as numerics and only has 4 digits, which can cause issues in some situations. The updated code should extract digits up to arbitrary precision.
Be extra careful when using this with long variable names as those could wrap the table into multiple rows and introduce missing values (NA) in output even though the stats are in there!
Package broom is a great way to extract model info.
library(pacman)
library(rms)
library(broom)
fit = ols(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species,
data = iris)
tidy(summary.lm(fit))
# term estimate std.error statistic p.value
# 1 Intercept 2.1712663 0.27979415 7.760227 1.429502e-12
# 2 Sepal.Width 0.4958889 0.08606992 5.761466 4.867516e-08
# 3 Petal.Length 0.8292439 0.06852765 12.100867 1.073592e-23
# 4 Petal.Width -0.3151552 0.15119575 -2.084418 3.888826e-02
# 5 Species=versicolor -0.7235620 0.24016894 -3.012721 3.059634e-03
# 6 Species=virginica -1.0234978 0.33372630 -3.066878 2.584344e-03
glance(fit)
# r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC df.residual
# 1 0.8673123 0.862705 0.3068261 188.251 2.666942e-61 6 -32.55801 79.11602 100.1905 144
The object fit also contains some easily accessible info that you can get and store in a dataframe:
fit$coefficients
# Intercept Sepal.Width Petal.Length Petal.Width Species=versicolor Species=virginica
# 2.1712663 0.4958889 0.8292439 -0.3151552 -0.7235620 -1.0234978
fit$stats
# n Model L.R. d.f. R2 g Sigma
# 150.0000000 302.9635115 5.0000000 0.8673123 0.8820479 0.3068261
Related
res.pca = prcomp(y, scale = TRUE)
summ=summary(res.pca)
summ
Gives me the output Desired Output
I want to change this Summary in to a Data Frame,
I've Tried to use the do.call(cbind, lapply(res.pca, summary)) but it gives me the summary of Min/Max but not the one I desire.
Please See That I dont want to extract values from column names, I seek a general solution That I can use.
What you are looking for is in the "element" importance of summary(res.pca):
Example taken from Principal Components Analysis - how to get the contribution (%) of each parameter to a Prin.Comp.?:
a <- rnorm(10, 50, 20)
b <- seq(10, 100, 10)
c <- seq(88, 10, -8)
d <- rep(seq(3, 16, 3), 2)
e <- rnorm(10, 61, 27)
my_table <- data.frame(a, b, c, d, e)
res.pca <- prcomp(my_table, scale = TRUE)
summary(res.pca)$importance
# PC1 PC2 PC3 PC4 PC5
#Standard deviation 1.7882 0.9038 0.8417 0.52622 9.037e-17
#Proportion of Variance 0.6395 0.1634 0.1417 0.05538 0.000e+00
#Cumulative Proportion 0.6395 0.8029 0.9446 1.00000 1.000e+00
class(summary(res.pca)$importance)
#[1] "matrix"
N.B.:
When you want to "study" an object, it can be convenient to use str on it. Here, you can do str(summary(pca) to see where the information are and hence where you can get what you want:
str(summary(res.pca))
List of 6
$ sdev : num [1:5] 1.79 9.04e-01 8.42e-01 5.26e-01 9.04e-17
$ rotation : num [1:5, 1:5] 0.278 0.512 -0.512 0.414 -0.476 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:5] "a" "b" "c" "d" ...
.. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:5] 34.9 55 52 9 77.8
..- attr(*, "names")= chr [1:5] "a" "b" "c" "d" ...
$ scale : Named num [1:5] 22.4 30.28 24.22 4.47 26.11
..- attr(*, "names")= chr [1:5] "a" "b" "c" "d" ...
$ x : num [1:10, 1:5] -2.962 -1.403 -1.653 -0.537 1.186 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
$ importance: num [1:3, 1:5] 1.788 0.64 0.64 0.904 0.163 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "Standard deviation" "Proportion of Variance" "Cumulative Proportion"
.. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
- attr(*, "class")= chr "summary.prcomp"
I'm using BoxCoxTrans function from the caret package:
library(caret)
library(purrr)
model1 <- apply(X = my.df, 2, BoxCoxTrans)
model2 <- purrr::map2(model1, my.df, function(x,y) predict(x,y))
trans.df <- as.data.frame(do.call(cbind, model2))
library(rcompanion)
plotNormalHistogram(trans.df)
print(trans.df)
It is working correctly and transforming the data, but I have no way of knowing which lambda value is used for the transformation.
You can find these values in model1. I'll show you how to get them using the iris data.
library(caret)
fudge <- 0.2
out <- lapply(iris[1:2], BoxCoxTrans, fudge = fudge) # instead of apply(..., margin = 2, ...)
Now look at the structure of out
str(out, 2)
#List of 2
# $ Sepal.Length:List of 6
# ..$ lambda : num -0.1
# ..$ fudge : num 0.2
# ..$ n : int 150
# ..$ summary :Classes 'summaryDefault', 'table' Named num [1:6] 4.3 5.1 5.8 5.84 6.4 ...
# .. .. ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
# ..$ ratio : num 1.84
# ..$ skewness: num 0.309
# ..- attr(*, "class")= chr "BoxCoxTrans"
# $ Sepal.Width :List of 6
# ..$ lambda : num 0.3
# ..$ fudge : num 0.2
# ..$ n : int 150
# ..$ summary :Classes 'summaryDefault', 'table' Named num [1:6] 2 2.8 3 3.06 3.3 ...
# .. .. ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
# ..$ ratio : num 2.2
# ..$ skewness: num 0.313
# ..- attr(*, "class")= chr "BoxCoxTrans"
Using base R you can use sapply and `[[` now as follows
sapply(out, `[[`, "lambda")
#Sepal.Length Sepal.Width
# -0.1 0.3
Since you use purrr, you might consider map and pluck
map_dbl(out, pluck, "lambda")
#Sepal.Length Sepal.Width
# -0.1 0.3
Thanks to #missuse's mindful comments we can get the lambda used for transformation as
library(dplyr)
real_lambda <- case_when(between(lambda, -fudge, fudge) ~ 0,
between(lambda, 1 - fudge, 1 + fudge) ~ 1,
TRUE ~ lambda)
real_lambda <- setNames(real_lambda, names(lambda))
real_lambda
#Sepal.Length Sepal.Width
# 0.0 0.3
This is necessary because the function BoxCoxTrans has the argument fudge which is
a tolerance value: lambda values within +/-fudge will be coerced to 0 and within 1+/-fudge will be coerced to 1.
I have been following an online example for R Kohonen self-organising maps (SOM) which suggested that the data should be centred and scaled before computing the SOM.
However, I've noticed the object created seems to have attributes for centre and scale, in which case am I really applying a redundant step by centring and scaling first? Example script below
# Load package
require(kohonen)
# Set data
data(iris)
# Scale and centre
dt <- scale(iris[, 1:4],center=TRUE)
# Prepare SOM
set.seed(590507)
som1 <- som(dt,
somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
str(som1)
The output from the last line of the script is:
List of 13
$ data :List of 1
..$ : num [1:150, 1:4] -0.898 -1.139 -1.381 -1.501 -1.018 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length"
"Petal.Width"
.. ..- attr(*, "scaled:center")= Named num [1:4] 5.84 3.06 3.76 1.2
.. .. ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width"
"Petal.Length" "Petal.Width"
.. ..- attr(*, "scaled:scale")= Named num [1:4] 0.828 0.436 1.765 0.762
.. .. ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width"
"Petal.Length" "Petal.Width"
$ unit.classif : num [1:150] 3 5 5 5 4 2 4 4 6 5 ...
$ distances : num [1:150] 0.0426 0.0663 0.0768 0.0744 0.1346 ...
$ grid :List of 6
..$ pts : num [1:36, 1:2] 1.5 2.5 3.5 4.5 5.5 6.5 1 2 3 4 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:2] "x" "y"
..$ xdim : num 6
..$ ydim : num 6
..$ topo : chr "hexagonal"
..$ neighbourhood.fct: Factor w/ 2 levels "bubble","gaussian": 1
..$ toroidal : logi FALSE
..- attr(*, "class")= chr "somgrid"
$ codes :List of 1
..$ : num [1:36, 1:4] -0.376 -0.683 -0.734 -1.158 -1.231 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:36] "V1" "V2" "V3" "V4" ...
.. .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length"
"Petal.Width"
$ changes : num [1:500, 1] 0.0445 0.0413 0.0347 0.0373 0.0337 ...
$ alpha : num [1:2] 0.05 0.01
$ radius : Named num [1:2] 3.61 0
..- attr(*, "names")= chr [1:2] "66.66667%" ""
$ user.weights : num 1
$ distance.weights: num 1
$ whatmap : int 1
$ maxNA.fraction : int 0
$ dist.fcts : chr "sumofsquares"
- attr(*, "class")= chr "kohonen"
Note notice that in lines 7 and 10 of the output there are references to centre and scale. I would appreciate an explanation as to the process here.
Your step with scaling is not redundant because in source code there are no scaling, and attributes, that you see in 7 and 10 are attributes from train dataset.
To check this, just run and compare results of this chunk of code:
# Load package
require(kohonen)
# Set data
data(iris)
# Scale and centre
dt <- scale(iris[, 1:4],center=TRUE)
#compare train datasets
str(dt)
str(as.matrix(iris[, 1:4]))
# Prepare SOM
set.seed(590507)
som1 <- kohonen::som(dt,
kohonen::somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
#without scaling
som2 <- kohonen::som(as.matrix(iris[, 1:4]),
kohonen::somgrid(6,6, "hexagonal"),
rlen=500,
keep.data=TRUE)
#compare results of som function
str(som1)
str(som2)
In R, str() is handy for showing the structure of an object, such as the list of lists returned by lm() and other modelling functions, but it gives way too much output. I'm looking for some tool to create a simple tree diagram showing only the names of the list elements and their structure.
e.g., for this example,
data(Prestige, package="car")
out <- lm(prestige ~ income+education+women, data=Prestige)
str(out, max.level=2)
#> List of 12
#> $ coefficients : Named num [1:4] -6.79433 0.00131 4.18664 -0.00891
#> ..- attr(*, "names")= chr [1:4] "(Intercept)" "income" "education" "women"
#> $ residuals : Named num [1:102] 4.58 -9.39 4.69 4.22 8.15 ...
#> ..- attr(*, "names")= chr [1:102] "gov.administrators" "general.managers" "accountants" "purchasing.officers" ...
#> $ effects : Named num [1:102] -472.99 -123.61 -92.61 -2.3 6.83 ...
#> ..- attr(*, "names")= chr [1:102] "(Intercept)" "income" "education" "women" ...
#> $ rank : int 4
#> $ fitted.values: Named num [1:102] 64.2 78.5 58.7 52.6 65.3 ...
#> ..- attr(*, "names")= chr [1:102] "gov.administrators" "general.managers" "accountants" "purchasing.officers" ...
#> $ assign : int [1:4] 0 1 2 3
#> $ qr :List of 5
#> ..$ qr : num [1:102, 1:4] -10.1 0.099 0.099 0.099 0.099 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. ..- attr(*, "assign")= int [1:4] 0 1 2 3
#> ..$ qraux: num [1:4] 1.1 1.44 1.06 1.06
#> ..$ pivot: int [1:4] 1 2 3 4
#> ..$ tol : num 1e-07
#> ..$ rank : int 4
#> ..- attr(*, "class")= chr "qr"
#> $ df.residual : int 98
...
I would like to get something like this:
This is similar to what I get from tree for file folders in my file system:
C:\Dropbox\Documents\images>tree
Folder PATH listing
Volume serial number is 2250-8E6F
C:.
+---cartoons
+---chevaliers
+---icons
+---milestones
+---minard
+---minard-besancon
The result could be either in graphic characters, as in tree or an actual graphic as shown above. Is anything like this available?
A simple approach to getting this from the str output would be something like...
a <- capture.output(str(out, max.level=2))
a <- trimws(gsub("\\:.*", "", a[grepl("\\$", a)]))
cat(a, sep="\n")
$ coefficients
$ residuals
$ effects
$ rank
$ fitted.values
$ assign
$ qr
..$ qr
..$ qraux
..$ pivot
..$ tol
..$ rank
$ df.residual
$ xlevels
$ call
$ terms
$ model
..$ prestige
..$ income
..$ education
..$ women
The standard way of doing a linear regression is something like this:
l <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris)
and then use predict(l, new_data) to make predictions, where new_data is a dataframe with columns matching the formula. But lm() returns an lm object, which is a list that contains crap-loads of stuff that is mostly irrelevant in most situations. This includes a copy of the original data, and a bunch of named vectors and arrays the length/size of the data:
R> str(l)
List of 12
$ coefficients : Named num [1:3] 3.587 -0.257 0.364
..- attr(*, "names")= chr [1:3] "(Intercept)" "Petal.Length" "Petal.Width"
$ residuals : Named num [1:150] 0.2 -0.3 -0.126 -0.174 0.3 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ effects : Named num [1:150] -37.445 -2.279 -0.914 -0.164 0.313 ...
..- attr(*, "names")= chr [1:150] "(Intercept)" "Petal.Length" "Petal.Width" "" ...
$ rank : int 3
$ fitted.values: Named num [1:150] 3.3 3.3 3.33 3.27 3.3 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ assign : int [1:3] 0 1 2
$ qr :List of 5
..$ qr : num [1:150, 1:3] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:3] "(Intercept)" "Petal.Length" "Petal.Width"
.. ..- attr(*, "assign")= int [1:3] 0 1 2
..$ qraux: num [1:3] 1.08 1.1 1.01
..$ pivot: int [1:3] 1 2 3
..$ tol : num 1e-07
..$ rank : int 3
..- attr(*, "class")= chr "qr"
$ df.residual : int 147
$ xlevels : Named list()
$ call : language lm(formula = Sepal.Width ~ Petal.Length + Petal.Width, data = iris)
$ terms :Classes 'terms', 'formula' length 3 Sepal.Width ~ Petal.Length + Petal.Width
.. ..- attr(*, "variables")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
.. .. .. ..$ : chr [1:2] "Petal.Length" "Petal.Width"
.. ..- attr(*, "term.labels")= chr [1:2] "Petal.Length" "Petal.Width"
.. ..- attr(*, "order")= int [1:2] 1 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
$ model :'data.frame': 150 obs. of 3 variables:
..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
..- attr(*, "terms")=Classes 'terms', 'formula' length 3 Sepal.Width ~ Petal.Length + Petal.Width
.. .. ..- attr(*, "variables")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
.. .. .. .. ..$ : chr [1:2] "Petal.Length" "Petal.Width"
.. .. ..- attr(*, "term.labels")= chr [1:2] "Petal.Length" "Petal.Width"
.. .. ..- attr(*, "order")= int [1:2] 1 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
- attr(*, "class")= chr "lm"
That stuff takes up a lot of space, and the lm object ends up being almost an order of magnitude larger than the original dataset:
R> object.size(iris)
7088 bytes
R> object.size(l)
52704 bytes
This isn't a problem with a dataset as small as that, but it can be really problematic with a 170Mb dataset that produces a 450mb lm object. Even with all the return options set to false, the lm object is still 5 times the original dataset:
R> ls <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris, model=FALSE, x=FALSE, y=FALSE, qr=FALSE)
R> object.size(ls)
30568 bytes
Is there any way of fitting a model in R, and then being able to predict output values on new input data, without storing crap tonnes of extra unnecessary data? In other words, is there a way to just store the model coefficients, but still be able to use those coefficients to predict on new data?
Edit: I guess, as well as not storing all that excess data, I'm also really interested in a way of using lm so that it doesn't even calculate that data - it's just wasted CPU time...
You can use biglm:
m <- biglm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
Since biglm does not store the data in the output object you need to provide your data when making predictions:
p <- predict(m, newdata=iris)
The amount of data biglm uses is proportional to the number of parameters:
> object.size(m)
6720 bytes
> d <- rbind(iris, iris)
> m <- biglm(Sepal.Width ~ Petal.Length + Petal.Width, data=d)
> object.size(m)
6720 bytes
biglm also allows you to update the model with a new chunk of data using the update method. Using this you can also estimate models when the complete dataset does not fit in memory.
The only components of the lm object that you need to calculate predicted values are terms and coefficients. However, you'll need to roll your own prediction function as predict.lm complains if you delete the qr component (which is needed to compute term-by-term effects and standard errors). Something like this should do.
m <- lm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
m$effects <- m$fitted.values <- m$residuals <- m$model <- m$qr <-
m$rank <- m$assign <- NULL
predict0 <- function(object, newdata)
{
mm <- model.matrix(terms(object), newdata)
mm %*% object$coefficients
}
predict0(m, iris[1:10,])
I think there are two approaches to deal with this:
Use lm and trim the fat afterwards. For quite nice and instructive discussions, see e.g. here and here. This will not solve the "computation time" issue.
Do not use lm.
If you go for the second option, you could easily write up the matrix operations yourself so that you only get the predicted values. If you prefer to use a canned routine, you could try other packages that implement least squares, e.g. fastLm from the RcppArmadillo-package (or the Eigen version of it, or as others pointed out biglm), which stores much less information. Using this approach has some benefits, e.g. providing a formula-interface and such things. fastLm is also quite fast, if computation time is a concern for you.
For comparison, here a small benchmark:
l <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris)
library(biglm)
m <- biglm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
library(RcppArmadillo)
a <- fastLm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
object.size(l)
# 52704 bytes
object.size(m)
# 6664 bytes
object.size(a)
# 6344 bytes