Modify an R function to add extra output? - r

I would like the lineal model regression command "lm()" also added information about the confidence interval.
What file should I modidy to get it?
At worst I would need to recompile something, but I hope I could compile only a single file.
What should I do?
Another option would be to create a script that get launched at startup and overwrite the regular behaviour or lm. How?

What you can use is something called a function operator. A function operator takes a function as input, adds a bit of functionality and returns a function.
For example, to create a version of lm that always reports the summary:
tweak_lm = function(modify_function) {
function(...) {
result = lm(...)
print(modify_function(result))
result
}
}
summarized_lm = tweak_lm(summary)
lm_res = summarized_lm(mpg ~ wt, mtcars)
Call:
lm(formula = ..1, data = ..2)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
> lm_res
Call:
lm(formula = ..1, data = ..2)
Coefficients:
(Intercept) wt
37.285 -5.344
>
Using this approach enables you to create other variants of this:
coef_lm = tweak_lm(coef)
lm_res = coef_lm(mpg ~ wt, mtcars)
(Intercept) wt
37.285126 -5.344472
It is not completely clear what you need, but you can use this approach.

Related

Polynomial Regression with for loop

for Boston dataset perform polynomial regression with degree 5,4,3 and 2 I want to use loop but get error :
Error in [.data.frame(data, 0, cols, drop = FALSE) :
undefined columns selected
library(caret)
train_control <- trainControl(method = "cv", number=10)
#set.seed(5)
cv <-rep(NA,4)
n=c(5,4,3,2)
for (i in n) {
cv[i]=train(nox ~ poly(dis,degree=i ), data = Boston, trncontrol = train_control, method = "lm")
}
outside the loop train(nox ~ poly(dis,degree=i ), data = Boston, trncontrol = train_control, method = "lm")
works well
Since you are using poly(..., raw = FALSE) that means you are getting orthogonal contrasts. Hence no need of for-loop, use the maximum degree since the coefficients and standard errors will not change for each coefficient.
Check quick example below using lm and iris dataset:
summary(lm(Sepal.Length~poly(Sepal.Width, 2), iris))
Call:
lm(formula = Sepal.Length ~ poly(Sepal.Width, 2), data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.63153 -0.62177 -0.08282 0.50531 2.33336
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84333 0.06692 87.316 <2e-16 ***
poly(Sepal.Width, 2)1 -1.18838 0.81962 -1.450 0.1492
poly(Sepal.Width, 2)2 -1.41578 0.81962 -1.727 0.0862 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8196 on 147 degrees of freedom
Multiple R-squared: 0.03344, Adjusted R-squared: 0.02029
F-statistic: 2.543 on 2 and 147 DF, p-value: 0.08209
> summary(lm(Sepal.Length~poly(Sepal.Width, 3), iris))
Call:
lm(formula = Sepal.Length ~ poly(Sepal.Width, 3), data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6876 -0.5001 -0.0876 0.5493 2.4600
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84333 0.06588 88.696 <2e-16 ***
poly(Sepal.Width, 3)1 -1.18838 0.80687 -1.473 0.1430
poly(Sepal.Width, 3)2 -1.41578 0.80687 -1.755 0.0814 .
poly(Sepal.Width, 3)3 1.92349 0.80687 2.384 0.0184 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8069 on 146 degrees of freedom
Multiple R-squared: 0.06965, Adjusted R-squared: 0.05054
F-statistic: 3.644 on 3 and 146 DF, p-value: 0.01425
Take a look at the summary table. Everything is the same. Only the poly(Sepal.Width,3)3 was added when a degree of 3 was used. Meaning if we used a degree of 3, we could easily tell what degree 2 will look like. Hence no need of for loop.
Note that you could use different variables in poly: eg poly(cbind(Sepal.Width, Petal.Length, Petal.Width), 4) and still be able to easily recover poly(Sepal.Width, 2).

error when calling a function with data set and response variable in R

I'm trying to call a function which has a data set and response variable in arguments. But I'm getting error.
call <- function(data,var){
mod_3 <- lm(var ~ . , data = data)
summary(mod_3)
}
call(iris,"Sepal.Length")
Error
Error in model.frame.default(formula = var ~ ., data = data, drop.unused.levels = TRUE) :
variable lengths differ (found for 'Sepal.Length')
Can someone help me to solve this issue?
We can create the formula with paste or reformulate
call <- function(data,var){
mod_3 <- lm(as.formula(paste(var, ' ~ .')) , data = data)
summary(mod_3)
}
-testing
call(iris,"Sepal.Length")
Call:
lm(formula = as.formula(paste(var, " ~ .")), data = data)
Residuals:
Min 1Q Median 3Q Max
-0.79424 -0.21874 0.00899 0.20255 0.73103
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.17127 0.27979 7.760 1.43e-12 ***
Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***
Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***
Petal.Width -0.31516 0.15120 -2.084 0.03889 *
Speciesversicolor -0.72356 0.24017 -3.013 0.00306 **
Speciesvirginica -1.02350 0.33373 -3.067 0.00258 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3068 on 144 degrees of freedom
Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627
F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16

How I pass string inside R function

I have a df and I would like to do a function with the names of header and return linear models.
I'm trying this:
a <- function(j,k){
reg1 <- lm(data$j ~ data$k)
summary(reg1)
}
a(j="hour",k="score")
It's NULL for 'data$j'
You cannot use $ when passing column name as variable. Here are couple of ways in which you can do this.
Use reformulate to create a formula object
a <- function(data, j,k){
reg1 <- lm(reformulate(k, j), data = data)
summary(reg1)
}
lm also accepts formula as string so you don't necessarily need to convert it into formula object.
a <- function(data, j,k){
reg1 <- lm(sprintf('%s~%s', j, k), data = data)
summary(reg1)
}
You can call this as :
a(mtcars, 'mpg', 'cyl')
#Call:
#lm(formula = sprintf("%s~%s", j, k), data = data)
#Residuals:
# Min 1Q Median 3Q Max
#-4.9814 -2.1185 0.2217 1.0717 7.5186
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
#cyl -2.8758 0.3224 -8.92 6.11e-10 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 3.206 on 30 degrees of freedom
#Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
#F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
Note that I added data as an additional argument in the function. It is generally a better practice to pass data object in the function rather than relying it to be evaluated in global environment.
Try this. If you are going tu use strings which are as variables in a dataframe, it is better to invoke them in a function using [[]]. Here the code of your function with slight changes:
a <- function(j,k){
reg1 <- lm(data[[j]] ~ data[[k]])
summary(reg1)
}
a(j="hour",k="score")
And a small example using iris dataset:
#Example
data=iris
#Code
a(j="Sepal.Length",k="Petal.Length")
You can further tune your function a if needed.

Avoid losing formulas when applying the lm function over a list of formulas in R

I'm trying to take all pairs of variables in the mtcars data set and make a linear model using the lm function. But my approach is causing me to lose the formulas when I go to summarize or plot the models. Here is the code that I am using.
library(tidyverse)
my_vars <- names(mtcars))
pairs <- t(combn(my_vars, 2)) # Get all possible pairs of variables
# Create formulas for the lm model
fmls <-
as.tibble(pairs) %>%
mutate(fml = paste(V1, V2, sep = "~")) %>%
select(fml) %>%
.[[1]] %>%
sapply(as.formula)
# Create a linear model for ear pair of variables
mods <- lapply(fmls, function(v) lm(data = mtcars, formula = v))
# print the summary of all variables
for (i in 1:length(mods)) {
print(summary(mods[[i]]))
}
(I snagged the idea of using strings to make formulas from here
[1]: Pass a vector of variables into lm() formula.) Here is the output of the summary for the first model (summary(mods[[1]])):
Call:
lm(formula = v, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
I'm searching for a (perhaps metaprogramming) technique so that the call line looks something like lm(formula = var1 ~ var2, data = mtcars) as opposed to formula = v.
I made pairs into a data frame, to make life easier:
library(tidyverse)
my_vars <- names(mtcars)
pairs <- t(combn(my_vars, 2)) %>%
as.data.frame# Get all possible pairs of variables
You can do this using eval() which evaluates an expression.
listOfRegs <- apply(pairs, 1, function(pair) {
V1 <- pair[[1]] %>% as.character
V2 <- pair[[2]] %>% as.character
fit <- eval(parse(text = paste0("lm(", pair[[1]] %>% as.character,
"~", pair[[2]] %>% as.character,
", data = mtcars)")))
return(fit)
})
lapply(listOfRegs, summary)
Then:
> lapply(listOfRegs, summary)
[[1]]
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
... etc

R's capture.output() behaves differently when file=NULL vs. file=[file name]

When using capture.output(..., file = NULL) followed by a specification of what line you want captured, then only that line is captured:
capture.output(summary(lm(speed ~ dist, data = cars)), file = NULL)[5]
[1] "Residuals:"
But when a file name is specified, it will capture the entire object:
capture.output(summary(lm(speed ~ dist, data = cars)), file = "Results.txt")[5]
NULL
The content of Results.txt:
Call:
lm(formula = speed ~ dist, data = cars)
Residuals:
Min 1Q Median 3Q Max
-7.5293 -2.1550 0.3615 2.4377 6.4179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
dist 0.16557 0.01749 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.156 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
How can I make R and/or capture.output only write the line I want to a file (in this toy example line no. 5)?
I'm afraid you can't do so within capture.output(), but you can simply write the part of capture.output()'s output that you want to a file using, for example, cat()
cat(capture.output(summary(lm(speed ~ dist, data = cars)))[5],file="Results.txt")
The side-effect of writing a file happens before the extraction "[" operation takes place when there is a file argument. So you need to write the value after it gets returned to the console/global environment:
cat( capture.output( summary(lm(speed ~ dist, data = cars)), file = NULL)[5] ,
file="test.txt")
It would be pretty easy to wrap this into a function if you will be needing it repeatedly.

Resources