I'm using BoxCoxTrans function from the caret package:
library(caret)
library(purrr)
model1 <- apply(X = my.df, 2, BoxCoxTrans)
model2 <- purrr::map2(model1, my.df, function(x,y) predict(x,y))
trans.df <- as.data.frame(do.call(cbind, model2))
library(rcompanion)
plotNormalHistogram(trans.df)
print(trans.df)
It is working correctly and transforming the data, but I have no way of knowing which lambda value is used for the transformation.
You can find these values in model1. I'll show you how to get them using the iris data.
library(caret)
fudge <- 0.2
out <- lapply(iris[1:2], BoxCoxTrans, fudge = fudge) # instead of apply(..., margin = 2, ...)
Now look at the structure of out
str(out, 2)
#List of 2
# $ Sepal.Length:List of 6
# ..$ lambda : num -0.1
# ..$ fudge : num 0.2
# ..$ n : int 150
# ..$ summary :Classes 'summaryDefault', 'table' Named num [1:6] 4.3 5.1 5.8 5.84 6.4 ...
# .. .. ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
# ..$ ratio : num 1.84
# ..$ skewness: num 0.309
# ..- attr(*, "class")= chr "BoxCoxTrans"
# $ Sepal.Width :List of 6
# ..$ lambda : num 0.3
# ..$ fudge : num 0.2
# ..$ n : int 150
# ..$ summary :Classes 'summaryDefault', 'table' Named num [1:6] 2 2.8 3 3.06 3.3 ...
# .. .. ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
# ..$ ratio : num 2.2
# ..$ skewness: num 0.313
# ..- attr(*, "class")= chr "BoxCoxTrans"
Using base R you can use sapply and `[[` now as follows
sapply(out, `[[`, "lambda")
#Sepal.Length Sepal.Width
# -0.1 0.3
Since you use purrr, you might consider map and pluck
map_dbl(out, pluck, "lambda")
#Sepal.Length Sepal.Width
# -0.1 0.3
Thanks to #missuse's mindful comments we can get the lambda used for transformation as
library(dplyr)
real_lambda <- case_when(between(lambda, -fudge, fudge) ~ 0,
between(lambda, 1 - fudge, 1 + fudge) ~ 1,
TRUE ~ lambda)
real_lambda <- setNames(real_lambda, names(lambda))
real_lambda
#Sepal.Length Sepal.Width
# 0.0 0.3
This is necessary because the function BoxCoxTrans has the argument fudge which is
a tolerance value: lambda values within +/-fudge will be coerced to 0 and within 1+/-fudge will be coerced to 1.
Related
I need a faster way of doing linear regression than the lm() method. I found that lm.fit() is quite a bit faster but I'm wondering how to use the results. For example using this code:
x = 1:5
y = 5:1
regr = lm.fit(as.matrix(x), y)
str(regr)
Outputs:
List of 8
$ coefficients : Named num 0.636
..- attr(*, "names")= chr "x1"
$ residuals : num [1:5] 4.364 2.727 1.091 -0.545 -2.182
$ effects : Named num [1:5] -4.719 1.69 -0.465 -2.619 -4.774
..- attr(*, "names")= chr [1:5] "x1" "" "" "" ...
$ rank : int 1
$ fitted.values: num [1:5] 0.636 1.273 1.909 2.545 3.182
$ assign : NULL
$ qr :List of 5
..$ qr : num [1:5, 1] -7.416 0.27 0.405 0.539 0.674
..$ qraux: num 1.13
..$ pivot: int 1
..$ tol : num 1e-07
..$ rank : int 1
..- attr(*, "class")= chr "qr"
$ df.residual : int 4
I'm expecting intercept = 6 and slope = -1 but the result above doesn't contain anyhing near that. Also, does lm.fit() output r squared?
lm.fit allows to do things much more manually, so, as #MrFlick commented, we must include the intercept manually as well using cbind(1, x) as the design matrix. The R^2 is not provided but we may easily compute it:
x <- 1:5
y <- 5:1 + rnorm(5)
regr <- lm.fit(cbind(1, x), y)
regr$coef
# x
# 5.2044349 -0.5535963
1 - var(regr$residuals) / var(y) # R^2
# [1] 0.3557227
1 - var(regr$residuals) / var(y) * (length(y) - 1) / regr$df.residual # Adj. R^2
# [1] 0.1409636
The rms package contains a wealth of useful statistical functions. However, I cannot find a proper way to extract certain fit statistics from the fitted object. Consider an example:
library(pacman)
p_load(rms, stringr, readr)
#fit
> (fit = rms::ols(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris))
Linear Regression Model
rms::ols(formula = Sepal.Length ~ Sepal.Width + Petal.Length +
Petal.Width + Species, data = iris)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 150 LR chi2 302.96 R2 0.867
sigma0.3068 d.f. 5 R2 adj 0.863
d.f. 144 Pr(> chi2) 0.0000 g 0.882
Residuals
Min 1Q Median 3Q Max
-0.794236 -0.218743 0.008987 0.202546 0.731034
Coef S.E. t Pr(>|t|)
Intercept 2.1713 0.2798 7.76 <0.0001
Sepal.Width 0.4959 0.0861 5.76 <0.0001
Petal.Length 0.8292 0.0685 12.10 <0.0001
Petal.Width -0.3152 0.1512 -2.08 0.0389
Species=versicolor -0.7236 0.2402 -3.01 0.0031
Species=virginica -1.0235 0.3337 -3.07 0.0026
So, the print function for the fit prints a lot of useful stuff including standard errors and adjusted R2. Unfortunately, if we inspect the model fit object, the values don't seem to be present anywhere.
> str(fit)
List of 19
$ coefficients : Named num [1:6] 2.171 0.496 0.829 -0.315 -0.724 ...
..- attr(*, "names")= chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
$ residuals : Named num [1:150] 0.0952 0.1432 -0.0731 -0.2894 -0.0544 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ effects : Named num [1:150] -71.5659 -1.1884 9.1884 -1.3724 -0.0587 ...
..- attr(*, "names")= chr [1:150] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
$ rank : int 6
$ fitted.values : Named num [1:150] 5 4.76 4.77 4.89 5.05 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ assign :List of 4
..$ Sepal.Width : int 2
..$ Petal.Length: int 3
..$ Petal.Width : int 4
..$ Species : int [1:2] 5 6
$ qr :List of 5
..$ qr : num [1:150, 1:6] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
..$ qraux: num [1:6] 1.08 1.02 1.11 1.02 1.02 ...
..$ pivot: int [1:6] 1 2 3 4 5 6
..$ tol : num 1e-07
..$ rank : int 6
..- attr(*, "class")= chr "qr"
$ df.residual : int 144
$ var : num [1:6, 1:6] 0.07828 -0.02258 -0.00198 0.01589 -0.02837 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
.. ..$ : chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
$ stats : Named num [1:6] 150 302.964 5 0.867 0.882 ...
..- attr(*, "names")= chr [1:6] "n" "Model L.R." "d.f." "R2" ...
$ linear.predictors: Named num [1:150] 5 4.76 4.77 4.89 5.05 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ call : language rms::ols(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris)
$ terms :Classes 'terms', 'formula' language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
.. ..- attr(*, "variables")= language list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
.. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
.. .. .. ..$ : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
.. ..- attr(*, "term.labels")= chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
.. ..- attr(*, "order")= int [1:4] 1 1 1 1
.. ..- attr(*, "intercept")= num 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
.. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ...
.. .. ..- attr(*, "names")= chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
.. ..- attr(*, "formula")=Class 'formula' language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
.. .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
$ Design :List of 12
..$ name : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
..$ label : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
..$ units : Named chr [1:4] "" "" "" ""
.. ..- attr(*, "names")= chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
..$ colnames : chr [1:5] "Sepal.Width" "Petal.Length" "Petal.Width" "Species=versicolor" ...
..$ mmcolnames : chr [1:5] "Sepal.Width" "Petal.Length" "Petal.Width" "Speciesversicolor" ...
..$ assume : chr [1:4] "asis" "asis" "asis" "category"
..$ assume.code : int [1:4] 1 1 1 5
..$ parms :List of 1
.. ..$ Species: chr [1:3] "setosa" "versicolor" "virginica"
..$ limits : list()
..$ values : list()
..$ nonlinear :List of 4
.. ..$ Sepal.Width : logi FALSE
.. ..$ Petal.Length: logi FALSE
.. ..$ Petal.Width : logi FALSE
.. ..$ Species : logi [1:2] FALSE FALSE
..$ interactions: NULL
$ non.slopes : num 1
$ na.action : NULL
$ scale.pred : chr "Sepal.Length"
$ fail : logi FALSE
$ sformula :Class 'formula' language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
- attr(*, "class")= chr [1:3] "ols" "rms" "lm"
There is a 7 year old question on R help where the package creator explains a solution to getting these:
On Wed, 11 Aug 2010, david dav wrote:
Hi,
I would like to extract the coefficients of a logistic regression
(estimates and standard error as well) in lrm as in glm with
summary(fit.glm)$coef
Thanks
David
coef(fit) sqrt(diag(vcov(fit)))
But these will not be very helpful except in the trivial case where
everything is linear, nothing interacts, and factors have two levels.
Frank
And the solution is according to the author not optimal. This leaves one wondering how the displayed values are calculated. Tracing down the code results in a hunt through the undocumented package code (the package code is on Github). I.e. we begin with print.ols():
> rms:::print.ols
function (x, digits = 4, long = FALSE, coefs = TRUE, title = "Linear Regression Model",
...)
{
latex <- prType() == "latex"
k <- 0
z <- list()
if (length(zz <- x$na.action)) {
k <- k + 1
z[[k]] <- list(type = paste("naprint", class(zz)[1],
sep = "."), list(zz))
}
stats <- x$stats
...
Reading further we do find that e.g. R2 adj. is calculated in the print function:
rsqa <- 1 - (1 - r2) * (n - 1) / rdf
We also find some standard error calculations, though no p values.
se <- sqrt(diag(x$var))
z[[k]] <- list(type='coefmatrix',
list(coef = x$coefficients,
se = se,
errordf = rdf))
All the results are passed down further to prModFit(). We can look it up and find the p value calculation etc. Unfortunately, the print command returns NULL so these values are not available anywhere for programmatic reuse:
> x = print((fit = rms::ols(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris)))
#printed output...
> x
NULL
How does one get all the statistics?
Here is a hack solution where we capture the output of the print command:
#parser
get_model_stats = function(x, precision=60) {
# remember old number formatting function
# (which would round and transforms p-values to formats like "<0.01")
old_format_np = rms::formatNP
# substitute it with a function which will print out as many digits as we want
assignInNamespace("formatNP", function(x, ...) formatC(x, format="f", digits=precision), "rms")
# remember old width setting
old_width = options('width')$width
# substitute it with a setting making sure the table will not wrap
options(width=old_width + 4 * precision)
# actually print the data and capture it
cap = capture.output(print(x))
# restore original settings
options(width=old_width)
assignInNamespace("formatNP", old_format_np, "rms")
#model stats
stats = c()
stats$R2.adj = str_match(cap, "R2 adj\\s+ (\\d\\.\\d+)") %>% na.omit() %>% .[, 2] %>% as.numeric()
#coef stats lines
coef_lines = cap[which(str_detect(cap, "Coef\\s+S\\.E\\.")):(length(cap) - 1)]
#parse
coef_lines_table = suppressWarnings(readr::read_table(coef_lines %>% stringr::str_c(collapse = "\n")))
colnames(coef_lines_table)[1] = "Predictor"
list(
stats = stats,
coefs = coef_lines_table
)
}
Example:
> get_model_stats(fit)
$stats
$stats$R2.adj
[1] 0.86
$coefs
# A tibble: 6 x 5
Predictor Coef S.E. t `Pr(>|t|)`
<chr> <dbl> <dbl> <dbl> <chr>
1 Intercept 2.17 0.280 7.8 <0.0001
2 Sepal.Width 0.50 0.086 5.8 <0.0001
3 Petal.Length 0.83 0.069 12.1 <0.0001
4 Petal.Width -0.32 0.151 -2.1 0.0389
5 Species=versicolor -0.72 0.240 -3.0 0.0031
6 Species=virginica -1.02 0.334 -3.1 0.0026
This still has issues, e.g. p values are not returned as numerics and only has 4 digits, which can cause issues in some situations. The updated code should extract digits up to arbitrary precision.
Be extra careful when using this with long variable names as those could wrap the table into multiple rows and introduce missing values (NA) in output even though the stats are in there!
Package broom is a great way to extract model info.
library(pacman)
library(rms)
library(broom)
fit = ols(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species,
data = iris)
tidy(summary.lm(fit))
# term estimate std.error statistic p.value
# 1 Intercept 2.1712663 0.27979415 7.760227 1.429502e-12
# 2 Sepal.Width 0.4958889 0.08606992 5.761466 4.867516e-08
# 3 Petal.Length 0.8292439 0.06852765 12.100867 1.073592e-23
# 4 Petal.Width -0.3151552 0.15119575 -2.084418 3.888826e-02
# 5 Species=versicolor -0.7235620 0.24016894 -3.012721 3.059634e-03
# 6 Species=virginica -1.0234978 0.33372630 -3.066878 2.584344e-03
glance(fit)
# r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC df.residual
# 1 0.8673123 0.862705 0.3068261 188.251 2.666942e-61 6 -32.55801 79.11602 100.1905 144
The object fit also contains some easily accessible info that you can get and store in a dataframe:
fit$coefficients
# Intercept Sepal.Width Petal.Length Petal.Width Species=versicolor Species=virginica
# 2.1712663 0.4958889 0.8292439 -0.3151552 -0.7235620 -1.0234978
fit$stats
# n Model L.R. d.f. R2 g Sigma
# 150.0000000 302.9635115 5.0000000 0.8673123 0.8820479 0.3068261
I want to create a data frame using describe() function. Dataset under consideration is iris. The data frame should look like this:
Variable n missing unique Info Mean 0.05 0.1 0.25 0.5 0.75 0.9 0.95
Sepal.Length 150 0 35 1 5.843 4.6 4.8 5.1 5.8 6.4 6.9 7.255
Sepal.Width 150 0 23 0.99 3.057 2.345 2.5 2.8 3 3.3 3.61 3.8
Petal.Length 150 0 43 1 3.758 1.3 1.4 1.6 4.35 5.1 5.8 6.1
Petal.Width 150 0 22 0.99 1.199 0.2 0.2 0.3 1.3 1.8 2.2 2.3
Species 150 0 3
Is there a way out to coerce the output of describe() to data.frame type? When I try to coerce, I get an error as shown below:
library(Hmisc)
statistics <- describe(iris)
statistics[1]
first_vec <- statistics[1]$Sepal.Length
as.data.frame(first_vec)
#Error in as.data.frame.default(first_vec) : cannot coerce class ""describe"" to a data.frame
Thanks
The way to figure this out is to examine the objects with str():
data(iris)
library(Hmisc)
di <- describe(iris)
di
# iris
#
# 5 Variables 150 Observations
# -------------------------------------------------------------
# Sepal.Length
# n missing unique Info Mean .05 .10 .25 .50 .75 .90 .95
# 150 0 35 1 5.843 4.600 4.800 5.100 5.800 6.400 6.900 7.255
#
# lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9
# -------------------------------------------------------------
# ...
# -------------------------------------------------------------
# Species
# n missing unique
# 150 0 3
#
# setosa (50, 33%), versicolor (50, 33%)
# virginica (50, 33%)
# -------------------------------------------------------------
str(di)
# List of 5
# $ Sepal.Length:List of 6
# ..$ descript : chr "Sepal.Length"
# ..$ units : NULL
# ..$ format : NULL
# ..$ counts : Named chr [1:12] "150" "0" "35" "1" ...
# .. ..- attr(*, "names")= chr [1:12] "n" "missing" "unique" "Info" ...
# ..$ intervalFreq:List of 2
# .. ..$ range: atomic [1:2] 4.3 7.9
# .. .. ..- attr(*, "Csingle")= logi TRUE
# .. ..$ count: int [1:100] 1 0 3 0 0 1 0 0 4 0 ...
# ..$ values : Named chr [1:10] "4.3" "4.4" "4.5" "4.6" ...
# .. ..- attr(*, "names")= chr [1:10] "L1" "L2" "L3" "L4" ...
# ..- attr(*, "class")= chr "describe"
# $ Sepal.Width :List of 6
# ...
# $ Species :List of 5
# ..$ descript: chr "Species"
# ..$ units : NULL
# ..$ format : NULL
# ..$ counts : Named num [1:3] 150 0 3
# .. ..- attr(*, "names")= chr [1:3] "n" "missing" "unique"
# ..$ values : num [1:2, 1:3] 50 33 50 33 50 33
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr [1:2] "Frequency" "%"
# .. .. ..$ : chr [1:3] "setosa" "versicolor" "virginica"
# ..- attr(*, "class")= chr "describe"
# - attr(*, "descript")= chr "iris"
# - attr(*, "dimensions")= int [1:2] 150 5
# - attr(*, "class")= chr "describe"
We see that di is a list of lists. We can take it apart by looking at just the first sublist. You can convert that into a vector:
unlist(di[[1]])
# descript counts.n
# "Sepal.Length" "150"
# counts.missing counts.unique
# "0" "35"
# counts.Info counts.Mean
# "1" "5.843"
# counts..05 counts..10
# "4.600" "4.800"
# counts..25 counts..50
# "5.100" "5.800"
# counts..75 counts..90
# "6.400" "6.900"
# counts..95 intervalFreq.range1
# "7.255" "4.3"
# intervalFreq.range2 intervalFreq.count1
# "7.9" "1"
# ...
# values.H3 values.H2
# "7.6" "7.7"
# values.H1
# "7.9"
str(unlist(di[[1]]))
# Named chr [1:125] "Sepal.Length" "150" "0" "35" ...
# - attr(*, "names")= chr [1:125] "descript" "counts.n" "counts.missing" "counts.unique" ...
It is very, very long (125). The elements have been coerced to all be of the same (and most inclusive) type, namely, character. It seems you want the 2nd through 12th elements:
unlist(di[[1]])[2:12]
# counts.n counts.missing counts.unique counts.Info
# "150" "0" "35" "1"
# counts.Mean counts..05 counts..10 counts..25
# "5.843" "4.600" "4.800" "5.100"
# counts..50 counts..75 counts..90
# "5.800" "6.400" "6.900"
Now you have something you can start to work with. But notice that this only seems to be the case for numerical variables; the factor variable species is different:
unlist(di[[5]])
# descript counts.n counts.missing counts.unique
# "Species" "150" "0" "3"
# values1 values2 values3 values4
# "50" "33" "50" "33"
# values5 values6
# "50" "33"
In that case, it seems you only want elements two through four.
Using this process of discovery and problem solving, you can see how you'd take the output of describe apart and put the information you want into a data frame. However, this will take a lot of work. You'll presumably need to use loops and lots of if(){ ... } else{ ... } blocks. You might just want to code your own dataset description function from scratch.
You can do this by using the stat.desc function from the pastecs package:
library(pastecs)
summary_df <- stat.desc(mydata)
The summary_df is the dataframe you wanted. See more info here.
In R, you just have to use the summary(iris) function instead of describe(iris) function in Python.
The standard way of doing a linear regression is something like this:
l <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris)
and then use predict(l, new_data) to make predictions, where new_data is a dataframe with columns matching the formula. But lm() returns an lm object, which is a list that contains crap-loads of stuff that is mostly irrelevant in most situations. This includes a copy of the original data, and a bunch of named vectors and arrays the length/size of the data:
R> str(l)
List of 12
$ coefficients : Named num [1:3] 3.587 -0.257 0.364
..- attr(*, "names")= chr [1:3] "(Intercept)" "Petal.Length" "Petal.Width"
$ residuals : Named num [1:150] 0.2 -0.3 -0.126 -0.174 0.3 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ effects : Named num [1:150] -37.445 -2.279 -0.914 -0.164 0.313 ...
..- attr(*, "names")= chr [1:150] "(Intercept)" "Petal.Length" "Petal.Width" "" ...
$ rank : int 3
$ fitted.values: Named num [1:150] 3.3 3.3 3.33 3.27 3.3 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ assign : int [1:3] 0 1 2
$ qr :List of 5
..$ qr : num [1:150, 1:3] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:3] "(Intercept)" "Petal.Length" "Petal.Width"
.. ..- attr(*, "assign")= int [1:3] 0 1 2
..$ qraux: num [1:3] 1.08 1.1 1.01
..$ pivot: int [1:3] 1 2 3
..$ tol : num 1e-07
..$ rank : int 3
..- attr(*, "class")= chr "qr"
$ df.residual : int 147
$ xlevels : Named list()
$ call : language lm(formula = Sepal.Width ~ Petal.Length + Petal.Width, data = iris)
$ terms :Classes 'terms', 'formula' length 3 Sepal.Width ~ Petal.Length + Petal.Width
.. ..- attr(*, "variables")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
.. .. .. ..$ : chr [1:2] "Petal.Length" "Petal.Width"
.. ..- attr(*, "term.labels")= chr [1:2] "Petal.Length" "Petal.Width"
.. ..- attr(*, "order")= int [1:2] 1 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
$ model :'data.frame': 150 obs. of 3 variables:
..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
..- attr(*, "terms")=Classes 'terms', 'formula' length 3 Sepal.Width ~ Petal.Length + Petal.Width
.. .. ..- attr(*, "variables")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
.. .. .. .. ..$ : chr [1:2] "Petal.Length" "Petal.Width"
.. .. ..- attr(*, "term.labels")= chr [1:2] "Petal.Length" "Petal.Width"
.. .. ..- attr(*, "order")= int [1:2] 1 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
- attr(*, "class")= chr "lm"
That stuff takes up a lot of space, and the lm object ends up being almost an order of magnitude larger than the original dataset:
R> object.size(iris)
7088 bytes
R> object.size(l)
52704 bytes
This isn't a problem with a dataset as small as that, but it can be really problematic with a 170Mb dataset that produces a 450mb lm object. Even with all the return options set to false, the lm object is still 5 times the original dataset:
R> ls <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris, model=FALSE, x=FALSE, y=FALSE, qr=FALSE)
R> object.size(ls)
30568 bytes
Is there any way of fitting a model in R, and then being able to predict output values on new input data, without storing crap tonnes of extra unnecessary data? In other words, is there a way to just store the model coefficients, but still be able to use those coefficients to predict on new data?
Edit: I guess, as well as not storing all that excess data, I'm also really interested in a way of using lm so that it doesn't even calculate that data - it's just wasted CPU time...
You can use biglm:
m <- biglm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
Since biglm does not store the data in the output object you need to provide your data when making predictions:
p <- predict(m, newdata=iris)
The amount of data biglm uses is proportional to the number of parameters:
> object.size(m)
6720 bytes
> d <- rbind(iris, iris)
> m <- biglm(Sepal.Width ~ Petal.Length + Petal.Width, data=d)
> object.size(m)
6720 bytes
biglm also allows you to update the model with a new chunk of data using the update method. Using this you can also estimate models when the complete dataset does not fit in memory.
The only components of the lm object that you need to calculate predicted values are terms and coefficients. However, you'll need to roll your own prediction function as predict.lm complains if you delete the qr component (which is needed to compute term-by-term effects and standard errors). Something like this should do.
m <- lm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
m$effects <- m$fitted.values <- m$residuals <- m$model <- m$qr <-
m$rank <- m$assign <- NULL
predict0 <- function(object, newdata)
{
mm <- model.matrix(terms(object), newdata)
mm %*% object$coefficients
}
predict0(m, iris[1:10,])
I think there are two approaches to deal with this:
Use lm and trim the fat afterwards. For quite nice and instructive discussions, see e.g. here and here. This will not solve the "computation time" issue.
Do not use lm.
If you go for the second option, you could easily write up the matrix operations yourself so that you only get the predicted values. If you prefer to use a canned routine, you could try other packages that implement least squares, e.g. fastLm from the RcppArmadillo-package (or the Eigen version of it, or as others pointed out biglm), which stores much less information. Using this approach has some benefits, e.g. providing a formula-interface and such things. fastLm is also quite fast, if computation time is a concern for you.
For comparison, here a small benchmark:
l <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris)
library(biglm)
m <- biglm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
library(RcppArmadillo)
a <- fastLm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
object.size(l)
# 52704 bytes
object.size(m)
# 6664 bytes
object.size(a)
# 6344 bytes
Consider this use of ggplot(...) inside a function.
x <- seq(1,10,by=0.1)
df <- data.frame(x,y1=x, y2=cos(2*x)/(1+x))
library(ggplot2)
gg.fun <- function(){
i=2
plot(ggplot(df,aes(x=x,y=df[,i]))+geom_line())
}
if(exists("i")) remove(i)
gg.fun()
# Error in `[.data.frame`(df, , i) : object 'i' not found
i=3
gg.fun() # plots df[,3] vs. x
It looks like ggplot does not recognize the variable i defined inside the function, but does recognize i if it is defined in the global environment. Why is that?
Note that this gives the expected result.
gg.new <- function(){
i=2
plot(ggplot(data.frame(x=df$x,y=df[,i]),aes(x,y)) + geom_line())
}
if(exists("i")) remove(i)
gg.new() # plots df[,2] vs. x
i=3
gg.new() # also plots df[,2] vs. x
Let's return a non-rendered ggplot object to see what's going on:
gg.str <- function() {
i=2
str(ggplot(df,aes(x=x,y=df[,i]))+geom_line())
}
gg.str()
List of 9
$ data :'data.frame': 91 obs. of 3 variables:
..$ x : num [1:91] 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 ...
..$ y1: num [1:91] 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 ...
..$ y2: num [1:91] -0.208 -0.28 -0.335 -0.373 -0.393 ...
$ layers :List of 1
..$ :Classes 'proto', 'environment' <environment: 0x0000000009886ca0>
$ scales :Reference class 'Scales' [package "ggplot2"] with 1 fields
..$ scales: list()
..and 21 methods, of which 9 are possibly relevant:
.. add, clone, find, get_scales, has_scale, initialize, input, n, non_position_scales
$ mapping :List of 2
..$ x: symbol x
..$ y: language df[, i]
$ theme : list()
$ coordinates:List of 1
..$ limits:List of 2
.. ..$ x: NULL
.. ..$ y: NULL
..- attr(*, "class")= chr [1:2] "cartesian" "coord"
$ facet :List of 1
..$ shrink: logi TRUE
..- attr(*, "class")= chr [1:2] "null" "facet"
$ plot_env :<environment: R_GlobalEnv>
$ labels :List of 2
..$ x: chr "x"
..$ y: chr "df[, i]"
- attr(*, "class")= chr [1:2] "gg" "ggplot"
As we can see, mapping for y is simply an unevaluated expression. Now, when we ask to do the actual plotting, the expression is evaluated within plot_env, which is global. I do not know why it is done so; I believe there are reasons for that.
Here's a demo that can override this behaviour:
gg.envir <- function(envir=environment()) {
i=2
p <- ggplot(df,aes(x=x,y=df[,i]))+geom_line()
p$plot_env <- envir
plot(p)
}
# evaluation in local environment; ok
gg.envir()
# evaluation in global environment (same as default); fails if no i
gg.envir(environment())