I have a df and I would like to do a function with the names of header and return linear models.
I'm trying this:
a <- function(j,k){
reg1 <- lm(data$j ~ data$k)
summary(reg1)
}
a(j="hour",k="score")
It's NULL for 'data$j'
You cannot use $ when passing column name as variable. Here are couple of ways in which you can do this.
Use reformulate to create a formula object
a <- function(data, j,k){
reg1 <- lm(reformulate(k, j), data = data)
summary(reg1)
}
lm also accepts formula as string so you don't necessarily need to convert it into formula object.
a <- function(data, j,k){
reg1 <- lm(sprintf('%s~%s', j, k), data = data)
summary(reg1)
}
You can call this as :
a(mtcars, 'mpg', 'cyl')
#Call:
#lm(formula = sprintf("%s~%s", j, k), data = data)
#Residuals:
# Min 1Q Median 3Q Max
#-4.9814 -2.1185 0.2217 1.0717 7.5186
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
#cyl -2.8758 0.3224 -8.92 6.11e-10 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 3.206 on 30 degrees of freedom
#Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
#F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
Note that I added data as an additional argument in the function. It is generally a better practice to pass data object in the function rather than relying it to be evaluated in global environment.
Try this. If you are going tu use strings which are as variables in a dataframe, it is better to invoke them in a function using [[]]. Here the code of your function with slight changes:
a <- function(j,k){
reg1 <- lm(data[[j]] ~ data[[k]])
summary(reg1)
}
a(j="hour",k="score")
And a small example using iris dataset:
#Example
data=iris
#Code
a(j="Sepal.Length",k="Petal.Length")
You can further tune your function a if needed.
Related
I am a beginner in R so I'm sorry if my question is basic and has been answered somewhere else but unfortunately I could not find the answer.
One of my predictor variables, nationality, has 8 levels.
I want to create a user defined function that loops through each level in my variable nationality, taking one level per regression. I created a list of the levels of the variable nationalityas such:
mylist <- list("bangladeshian", "british", "filipino", "indian",
"indonesian", "nigerian", "pakistani", "spanish")
then created a user defined function:
f1 <- function(x) {
l <- summary(glm(smoke ~ I(nationality == mylist[x]),
data=df.subpop, family=binomial(link="probit")))
print(l)
}
f1(2)
f1(2) gives this output:
Call:
glm(formula = smoke ~ I(nationality == mylist[x]),
family = binomial(link = "probit"), data = df.subpop)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.629 -0.629 -0.629 -0.629 1.853
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9173 0.1659 -5.530 3.21e-08 ***
I(nationality == mylist[x])TRUE -4.2935 376.7536 -0.011 0.991
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 73.809 on 78 degrees of freedom
Residual deviance: 73.416 on 77 degrees of freedom
AIC: 77.416
Number of Fisher Scoring iterations: 14
As you can see, the coefficient for nationality is "I(nationality == mylist[x])TRUE"
which is not very informative and requires the user to refer back to the line of code
f1(2) and also to mylist to understand the level that that coefficient represents. I believe there should be a cleaner and more straightforward way to do this and accurately run a regression for each level without having to call f1() 8 times.
Consider dynamically building formula with as.formula or reformulate:
nationality_levels <- levels(df.subpop$nationality)
f1 <- function(x) {
# BUILD FORMULA (EQUIVALENT CALLS)
f <- as.formula(paste0("smoke ~ I(nationality == '", x, "')"))
f <- reformulate(paste0("I(nationality == '", x, "')"), "smoke")
l <- summary(
glm(f, data=df.subpop, family=binomial(link="probit"))
)
}
reg_list <- lapply(nationality_levels, f1)
reg_list
I'm trying to take all pairs of variables in the mtcars data set and make a linear model using the lm function. But my approach is causing me to lose the formulas when I go to summarize or plot the models. Here is the code that I am using.
library(tidyverse)
my_vars <- names(mtcars))
pairs <- t(combn(my_vars, 2)) # Get all possible pairs of variables
# Create formulas for the lm model
fmls <-
as.tibble(pairs) %>%
mutate(fml = paste(V1, V2, sep = "~")) %>%
select(fml) %>%
.[[1]] %>%
sapply(as.formula)
# Create a linear model for ear pair of variables
mods <- lapply(fmls, function(v) lm(data = mtcars, formula = v))
# print the summary of all variables
for (i in 1:length(mods)) {
print(summary(mods[[i]]))
}
(I snagged the idea of using strings to make formulas from here
[1]: Pass a vector of variables into lm() formula.) Here is the output of the summary for the first model (summary(mods[[1]])):
Call:
lm(formula = v, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
I'm searching for a (perhaps metaprogramming) technique so that the call line looks something like lm(formula = var1 ~ var2, data = mtcars) as opposed to formula = v.
I made pairs into a data frame, to make life easier:
library(tidyverse)
my_vars <- names(mtcars)
pairs <- t(combn(my_vars, 2)) %>%
as.data.frame# Get all possible pairs of variables
You can do this using eval() which evaluates an expression.
listOfRegs <- apply(pairs, 1, function(pair) {
V1 <- pair[[1]] %>% as.character
V2 <- pair[[2]] %>% as.character
fit <- eval(parse(text = paste0("lm(", pair[[1]] %>% as.character,
"~", pair[[2]] %>% as.character,
", data = mtcars)")))
return(fit)
})
lapply(listOfRegs, summary)
Then:
> lapply(listOfRegs, summary)
[[1]]
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
... etc
Being aware of the danger of using dynamic variable names, I am trying to loop over varios regression models where different variables specifications are choosen. Usually !!rlang::sym() solves this kind of problem for me just fine, but it somehow fails in regressions. A minimal example would be the following:
y= runif(1000)
x1 = runif(1000)
x2 = runif(1000)
df2= data.frame(y,x1,x2)
summary(lm(y ~ x1+x2, data=df2)) ## works
var = "x1"
summary(lm(y ~ !!rlang::sym(var)) +x2, data=df2) # gives an error
My understanding was that !!rlang::sym(var)) takes the values of var (namely x1) and puts that in the code in a way that R thinks this is a variable (not a char). BUt I seem to be wrong. Can anyone enlighten me?
Personally, I like to do this with some computing on the language. For me, a combination of bquote with eval is easiest (to remember).
var <- as.symbol(var)
eval(bquote(summary(lm(y ~ .(var) + x2, data = df2))))
#Call:
#lm(formula = y ~ x1 + x2, data = df2)
#
#Residuals:
# Min 1Q Median 3Q Max
#-0.49298 -0.26248 -0.00046 0.24111 0.51988
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.50244 0.02480 20.258 <2e-16 ***
#x1 -0.01468 0.03161 -0.464 0.643
#x2 -0.01635 0.03227 -0.507 0.612
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 0.2878 on 997 degrees of freedom
#Multiple R-squared: 0.0004708, Adjusted R-squared: -0.001534
#F-statistic: 0.2348 on 2 and 997 DF, p-value: 0.7908
I find this superior to any approach that doesn't show the same call as summary(lm(y ~ x1+x2, data=df2)).
The bang-bang operator !! only works with "tidy" functions. It's not a part of the core R language. A base R function like lm() has no idea how to expand such operators. Instead, you need to wrap those in functions that can do the expansion. rlang::expr is one such example
rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2)))
# summary(lm(y ~ x1 + x2, data = df2))
Then you need to use rlang::eval_tidy to actually evaluate it
rlang::eval_tidy(rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2))))
# Call:
# lm(formula = y ~ x1 + x2, data = df2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.49178 -0.25482 0.00027 0.24566 0.50730
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.4953683 0.0242949 20.390 <2e-16 ***
# x1 -0.0006298 0.0314389 -0.020 0.984
# x2 -0.0052848 0.0318073 -0.166 0.868
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.2882 on 997 degrees of freedom
# Multiple R-squared: 2.796e-05, Adjusted R-squared: -0.001978
# F-statistic: 0.01394 on 2 and 997 DF, p-value: 0.9862
You can see this version preserves the expanded formula in the model object.
1) Just use lm(df2) or if lm has additional columns beyond what is shown in the question but we just want to regress on x1 and x2 then
df3 <- df2[c("y", var, "x2")]
lm(df3)
The following are optional and only apply if it is important that the formula appear in the output as if it had been explicitly given.
Compute the formula fo using the first line below and then run lm as in the second line:
fo <- formula(model.frame(df3))
fm <- do.call("lm", list(fo, quote(df3)))
or just run lm as in the first line below and then write the formula into it as in the second line:
fm <- lm(df3)
fm$call <- formula(model.frame(df3))
Either one gives this:
> fm
Call:
lm(formula = y ~ x1 + x2, data = df3)
Coefficients:
(Intercept) x1 x2
0.44752 0.04278 0.05011
2) character string lm accepts a character string for the formula so this also works. The fn$ causes substitution to occur in the character arguments.
library(gsubfn)
fn$lm("y ~ $var + x2", quote(df2))
or at the expense of more involved code, without gsubfn:
do.call("lm", list(sprintf("y ~ %s + x2", var), quote(df2)))
or if you don't care that the formula displays without var substituted then just:
lm(sprintf("y ~ %s + x2", var), df2)
I have some data that Excel will fit pretty nicely with a logarithmic trend. I want to pass the same data into R and have it tell me the coefficients and intercept. What form should have the data in and what function should I call to have it figure out the coefficients? Ultimately, I want to do this thousands of time so that I can project into the future.
Passing Excel these values produces this trendline function: y = -0.099ln(x) + 0.7521
Data:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813,
0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
For context, the data points represent % of our user base that are retained on a given day.
The question omitted the value of x but working backwards it seems you were using 1, 2, 3, ... so try the following:
x <- 1:11
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647,
0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076,
0.514708368)
fm <- lm(y ~ log(x))
giving:
> coef(fm)
(Intercept) log(x)
0.7521 -0.0990
and
plot(y ~ x, log = "x")
lines(fitted(fm) ~ x, col = "red")
You can get the same results by:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
t <- seq(along=y)
> summary(lm(y~log(t)))
Call:
lm(formula = y ~ log(t))
Residuals:
Min 1Q Median 3Q Max
-3.894e-10 -2.288e-10 -2.891e-11 1.620e-10 4.609e-10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.521e-01 2.198e-10 3421942411 <2e-16 ***
log(t) -9.900e-02 1.261e-10 -784892428 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.972e-10 on 9 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.161e+17 on 1 and 9 DF, p-value: < 2.2e-16
For large projects I recommend to encapsulate the data into a data frame, like
df <- data.frame(y, t)
lm(formula = y ~ log(t), data=df)
I would like the lineal model regression command "lm()" also added information about the confidence interval.
What file should I modidy to get it?
At worst I would need to recompile something, but I hope I could compile only a single file.
What should I do?
Another option would be to create a script that get launched at startup and overwrite the regular behaviour or lm. How?
What you can use is something called a function operator. A function operator takes a function as input, adds a bit of functionality and returns a function.
For example, to create a version of lm that always reports the summary:
tweak_lm = function(modify_function) {
function(...) {
result = lm(...)
print(modify_function(result))
result
}
}
summarized_lm = tweak_lm(summary)
lm_res = summarized_lm(mpg ~ wt, mtcars)
Call:
lm(formula = ..1, data = ..2)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
> lm_res
Call:
lm(formula = ..1, data = ..2)
Coefficients:
(Intercept) wt
37.285 -5.344
>
Using this approach enables you to create other variants of this:
coef_lm = tweak_lm(coef)
lm_res = coef_lm(mpg ~ wt, mtcars)
(Intercept) wt
37.285126 -5.344472
It is not completely clear what you need, but you can use this approach.