I tried to run anova on different sets of data and didn't quite know how to do it. I goolged and found this to be useful: https://stats.idre.ucla.edu/r/codefragments/looping_strings/
hsb2 <- read.csv("https://stats.idre.ucla.edu/stat/data/hsb2.csv")
names(hsb2)
varlist <- names(hsb2)[8:11]
models <- lapply(varlist, function(x) {
lm(substitute(read ~ i, list(i = as.name(x))), data = hsb2)
})
My understanding of what the above codes does is it creates a function lm() and apply it to each variable in varlist and it does linear regression on each of them.
So I thought use aov instead of lm would work for me like this:
aov(substitute(read ~ i, list(i = as.name(x))), data = hsb2)
However, I got this error:
Error in terms.default(formula, "Error", data = data) :
no terms component nor attribute
I have no idea of where the error comes from. Please help!
The problem is that substitute() returns an expression, not a formula. I think #thelatemail's suggestion of
lm(as.formula(paste("read ~",x)), data = hsb2)
is a good work around. Alternatively you could evaluate the expression to get the formula with
models <- lapply(varlist, function(x) {
aov(eval(substitute(read ~ i, list(i = as.name(x)))), data = hsb2)
})
I guess it depends on what you want to do with the list of models afterward. Doing
models <- lapply(varlist, function(x) {
eval(bquote(aov(read ~ .(as.name(x)), data = hsb2)))
})
gives a "cleaner" call property for each of the result.
This should do it. The varlist vector is going to be passed item by item to the function and the column will be delivered. The lm function will only see a two column dataframe and the "read" column will be the dependent variable each time. No need for fancy substitution:
models <- sapply(varlist, function(x) {
lm(read ~ ., data = hsb2[, c("read", x) ])
}, simplify=FALSE)
> summary(models[[1]]) # The first model. Note the use of "[["
Call:
lm(formula = read ~ ., data = hsb2[, c("read", x)])
Residuals:
Min 1Q Median 3Q Max
-19.8565 -5.8976 -0.8565 5.5801 24.2703
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.16215 3.30716 5.492 1.21e-07 ***
write 0.64553 0.06168 10.465 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.248 on 198 degrees of freedom
Multiple R-squared: 0.3561, Adjusted R-squared: 0.3529
F-statistic: 109.5 on 1 and 198 DF, p-value: < 2.2e-16
For all the models::
lapply(models, summary)
akrun borrowed my answer the other night, now I'm (partially) borrowing his.
do.call puts the variables into the call output so it reads properly. Here's a general function for simple regression.
doModel <- function(col1, col2, data = hsb2, FUNC = "lm")
{
form <- as.formula(paste(col1, "~", col2))
do.call(FUNC, list(form, substitute(data)))
}
lapply(varlist, doModel, col1 = "read")
# [[1]]
#
# Call:
# lm(formula = read ~ write, data = hsb2)
#
# Coefficients:
# (Intercept) write
# 18.1622 0.6455
#
#
# [[2]]
#
# Call:
# lm(formula = read ~ math, data = hsb2)
#
# Coefficients:
# (Intercept) math
# 14.0725 0.7248
#
# ...
# ...
# ...
Note: As thelatemail mentions in his comment
sapply(varlist, doModel, col1 = "read", simplify = FALSE)
will keep the names in the list and also allow list$name subsetting.
Related
I would like to analyse many x variables (400 variables) against one y variable (1 variable). However I do not want to write for each and every x variable a new model. Is it possible to write one model which than checks all x variables with y in R-Studio?
Here is an approach where we use a function that regresses all variables in a data frame on a dependent variable from the same data frame that is passed as an argument to the function.
We use lapply() to drive lm() because it will return the resulting model objects as a list, and we are able to easily name the resulting list so we can extract models by independent variable name.
regList <- function(dataframe,depVar) {
indepVars <- names(dataframe)[!(names(dataframe) %in% depVar)]
modelList <- lapply(indepVars,function(x){
lm(dataframe[[depVar]] ~ dataframe[[x]],data=dataframe)
})
# name list elements based on independent variable names
names(modelList) <- indepVars
modelList
}
We demonstrate the function with the mtcars data frame, assigning the mpg column as the dependent variable.
modelList <- regList(mtcars,"mpg")
At this point the modelList object contains 10 models, one for each variable in the mtcars data frame other than mpg. We can access the individual models by independent variable name, or by index.
# print the model where cyl is independent variable
summary(modelList[["cyl"]])
...and the output:
> summary(modelList[["cyl"]])
Call:
lm(formula = dataframe[[depVar]] ~ dataframe[[x]], data = dataframe)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
dataframe[[x]] -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
Extracting the content
Saving the output in a list() enables us to do things like find the model with the highest R^2 without having to use vgrep.
First, we extract the r.squared value from each model summary and save the results to a vector.
r.squareds <- unlist(lapply(modelList,function(x) summary(x)$r.squared))
Because we used names() to name elements in the original list, R automatically saves the variable names to the element names of the vector. This comes in handy when we sort the vector by descending order of R^2 and print the first element of the resulting vector.
r.squareds[order(r.squareds,decreasing=TRUE)][1]
...and the winner (not surprisingly) is wt.
> r.squareds[order(r.squareds,decreasing=TRUE)][1]
wt
0.7528328
If your data frame is DF,
regs <- list()
for (v in setdiff(names(DF), "y")) {
fm <- eval(parse(text = sprintf("y ~ %s", v)))
regs[[v]] <- lm(fm, data=DF)
}
Now you have all simple regression results in the regs list.
Example:
## Generate data
n <- 1000
set.seed(1)
DF <- data.frame(y = rnorm(n))
for (j in seq(400)) DF[[paste0('x',j)]] <- rnorm(n)
## Now data ready
dim(DF)
# [1] 1000 401
head(names(DF))
# [1] "y" "x1" "x2" "x3" "x4" "x5"
tail(names(DF))
# [1] "x395" "x396" "x397" "x398" "x399" "x400"
regs <- list()
for (v in setdiff(names(DF), "y")) {
fm <- eval(parse(text = sprintf("y ~ %s", v)))
regs[[v]] <- lm(fm, data=DF)
}
head(names(regs))
# [1] "x1" "x2" "x3" "x4" "x5" "x6"
r2s <- sapply(regs, function(x) summary(x)$r.squared)
head(r2s, 3)
# x1 x2 x3
# 0.0000409755 0.0024376111 0.0005509134
If you want to include them in the models separately, you can just loop over the x variables and add them into the model on each iteration. For example:
x_variables = list("x_var1", "x_var2", "x_var3", "x_var4", ...)
for(x in x_variables){
model <- lm(y_variable ~ x, data = df)
summary(model)
}
You can fill in the elipses in the code above with all your other x variables. I hope for your sake that there is some kind of naming convention you can exploit to select the variables using a dplyr verb like starts_with or contains!
If you hope to include all the x variables in the same model, you just add them in as you normally would. For example (assuming you want to use an OLS, but the same premise would work for other types):
model <- lm(y_variable ~
x_var1, x_var2, x_var3, x_var4, ..., data = df)
summary(model)
I have a df and I would like to do a function with the names of header and return linear models.
I'm trying this:
a <- function(j,k){
reg1 <- lm(data$j ~ data$k)
summary(reg1)
}
a(j="hour",k="score")
It's NULL for 'data$j'
You cannot use $ when passing column name as variable. Here are couple of ways in which you can do this.
Use reformulate to create a formula object
a <- function(data, j,k){
reg1 <- lm(reformulate(k, j), data = data)
summary(reg1)
}
lm also accepts formula as string so you don't necessarily need to convert it into formula object.
a <- function(data, j,k){
reg1 <- lm(sprintf('%s~%s', j, k), data = data)
summary(reg1)
}
You can call this as :
a(mtcars, 'mpg', 'cyl')
#Call:
#lm(formula = sprintf("%s~%s", j, k), data = data)
#Residuals:
# Min 1Q Median 3Q Max
#-4.9814 -2.1185 0.2217 1.0717 7.5186
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
#cyl -2.8758 0.3224 -8.92 6.11e-10 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 3.206 on 30 degrees of freedom
#Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
#F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
Note that I added data as an additional argument in the function. It is generally a better practice to pass data object in the function rather than relying it to be evaluated in global environment.
Try this. If you are going tu use strings which are as variables in a dataframe, it is better to invoke them in a function using [[]]. Here the code of your function with slight changes:
a <- function(j,k){
reg1 <- lm(data[[j]] ~ data[[k]])
summary(reg1)
}
a(j="hour",k="score")
And a small example using iris dataset:
#Example
data=iris
#Code
a(j="Sepal.Length",k="Petal.Length")
You can further tune your function a if needed.
Being aware of the danger of using dynamic variable names, I am trying to loop over varios regression models where different variables specifications are choosen. Usually !!rlang::sym() solves this kind of problem for me just fine, but it somehow fails in regressions. A minimal example would be the following:
y= runif(1000)
x1 = runif(1000)
x2 = runif(1000)
df2= data.frame(y,x1,x2)
summary(lm(y ~ x1+x2, data=df2)) ## works
var = "x1"
summary(lm(y ~ !!rlang::sym(var)) +x2, data=df2) # gives an error
My understanding was that !!rlang::sym(var)) takes the values of var (namely x1) and puts that in the code in a way that R thinks this is a variable (not a char). BUt I seem to be wrong. Can anyone enlighten me?
Personally, I like to do this with some computing on the language. For me, a combination of bquote with eval is easiest (to remember).
var <- as.symbol(var)
eval(bquote(summary(lm(y ~ .(var) + x2, data = df2))))
#Call:
#lm(formula = y ~ x1 + x2, data = df2)
#
#Residuals:
# Min 1Q Median 3Q Max
#-0.49298 -0.26248 -0.00046 0.24111 0.51988
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.50244 0.02480 20.258 <2e-16 ***
#x1 -0.01468 0.03161 -0.464 0.643
#x2 -0.01635 0.03227 -0.507 0.612
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 0.2878 on 997 degrees of freedom
#Multiple R-squared: 0.0004708, Adjusted R-squared: -0.001534
#F-statistic: 0.2348 on 2 and 997 DF, p-value: 0.7908
I find this superior to any approach that doesn't show the same call as summary(lm(y ~ x1+x2, data=df2)).
The bang-bang operator !! only works with "tidy" functions. It's not a part of the core R language. A base R function like lm() has no idea how to expand such operators. Instead, you need to wrap those in functions that can do the expansion. rlang::expr is one such example
rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2)))
# summary(lm(y ~ x1 + x2, data = df2))
Then you need to use rlang::eval_tidy to actually evaluate it
rlang::eval_tidy(rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2))))
# Call:
# lm(formula = y ~ x1 + x2, data = df2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.49178 -0.25482 0.00027 0.24566 0.50730
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.4953683 0.0242949 20.390 <2e-16 ***
# x1 -0.0006298 0.0314389 -0.020 0.984
# x2 -0.0052848 0.0318073 -0.166 0.868
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.2882 on 997 degrees of freedom
# Multiple R-squared: 2.796e-05, Adjusted R-squared: -0.001978
# F-statistic: 0.01394 on 2 and 997 DF, p-value: 0.9862
You can see this version preserves the expanded formula in the model object.
1) Just use lm(df2) or if lm has additional columns beyond what is shown in the question but we just want to regress on x1 and x2 then
df3 <- df2[c("y", var, "x2")]
lm(df3)
The following are optional and only apply if it is important that the formula appear in the output as if it had been explicitly given.
Compute the formula fo using the first line below and then run lm as in the second line:
fo <- formula(model.frame(df3))
fm <- do.call("lm", list(fo, quote(df3)))
or just run lm as in the first line below and then write the formula into it as in the second line:
fm <- lm(df3)
fm$call <- formula(model.frame(df3))
Either one gives this:
> fm
Call:
lm(formula = y ~ x1 + x2, data = df3)
Coefficients:
(Intercept) x1 x2
0.44752 0.04278 0.05011
2) character string lm accepts a character string for the formula so this also works. The fn$ causes substitution to occur in the character arguments.
library(gsubfn)
fn$lm("y ~ $var + x2", quote(df2))
or at the expense of more involved code, without gsubfn:
do.call("lm", list(sprintf("y ~ %s + x2", var), quote(df2)))
or if you don't care that the formula displays without var substituted then just:
lm(sprintf("y ~ %s + x2", var), df2)
I have an outcome variable, say Y and a list of 20 variables that could affect Y (say X1...X20). I would like to test which variables are NOT independent of Y. To do this I want to run a univariable glm for each variable and Y (ie Y~X1,...,Y~X20) and then do a likelihood ratio test for each model. Finally I would like to create a table the has the resulting P value from the likelihood test for each model.
From what I have seen the lapply function and split function could be useful for this but I don't really understand how they work in the examples I've seen.
This is what I tried at first:
> VarNames<-c(names(data[30:47]))
> glms<-glm(intBT~VarNames,family=binomial(logit))
Error in model.frame.default(formula = intBT ~ VarNames, drop.unused.levels = TRUE) :
variable lengths differ (found for 'VarNames')
I'm not sure if that was a good approach though.
It is easier to answer your questions if you provide a minimal example.
One way to go - but certainly not the most beautiful - is to use paste to create the formulas as a vector of strings and then use lapply on them. The Code for this could look like this:
example.data <- data.frame(intBT=1:10, bli=1:10, bla=1:10, blub=1:10)
var.names <- c('bli', 'bla', 'blub')
formulas <- paste('intBT ~', var.names)
fitted.models <- lapply(formulas, glm, data=example.data)
This gives a list of fitted model. You can then use the apply functions on fitted.models to execute further tests.
Like Paul said it really helps if you provide a minimal example, but I think this does what you want.
set.seed(123)
N <- 100
num_vars <- 5
df <- data.frame(lapply(1:num_vars, function(i) i = rnorm(N)))
names(df) <- c(paste0(rep("X",5), 1:num_vars ))
e <- rnorm(N)
y <- as.numeric((df$X1 + df$X2 + e) > 0.5)
pvalues <- vector(mode = "list")
singlevar <- function(var, y, df){
model <- as.formula(paste0("y ~ ", var))
pvalues[var] <- coef(summary(glm(model, family = "binomial", data = df)))[var,4]
}
sapply(colnames(df), singlevar, y, df)
X1 X2 X3 X4 X5
1.477199e-04 4.193461e-05 8.885365e-01 9.064953e-01 9.702645e-01
For comparison:
Call:
glm(formula = y ~ X2, family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0674 -0.8211 -0.5296 0.9218 2.5463
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5591 0.2375 -2.354 0.0186 *
X2 1.2871 0.3142 4.097 4.19e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 130.68 on 99 degrees of freedom
Residual deviance: 106.24 on 98 degrees of freedom
AIC: 110.24
Number of Fisher Scoring iterations: 4
I'm trying to write a function that regresses multiple items, then tries to predict data based on the model:
"tnt" <- function(train_dep, train_indep, test_dep, test_indep)
{
y <- train_dep
x <- train_indep
mod <- lm (y ~ x)
estimate <- predict(mod, data.frame(x=test_indep))
rmse <- sqrt(sum((test_dep-estimate)^2)/length(test_dep))
print(summary(mod))
print(paste("RMSE: ", rmse))
}
If I pass the above this, it fails:
train_dep = vector1
train_indep <- cbind(vector2, vector3)
test_dep = vector4
test_indep <- cbind(vector5, vector6)
tnt(train_dep, train_indep, test_dep, test_indep)
Changing the above to something like the following works, but I want this done dynamically so I can pass it a matrix of any number of columns:
x1 = x[,1]
x2 = x[,2]
mod <- lm(y ~ x1+x2)
estimate <- predict(mod, data.frame(x1=test_indep[,1], x2=test_indep[,2]))
Looks like this could help, but I'm still confused on the rest of the process: http://finzi.psych.upenn.edu/R/Rhelp02a/archive/70843.html
Try this instead:
tnt <- function(train_dep, train_indep, test_dep, test_indep)
{ dat<- as.data.frame(cbind(y=train_dep, train_indep))
mod <- lm (y ~ . , data=dat )
newdat <- as.data.frame(test_indep)
names(newdat) <- names(dat)[2:length(dat)]
estimate <- predict(mod, newdata=newdat )
rmse <- sqrt(sum((test_dep-estimate)^2)/length(test_dep))
print(summary(mod))
print(paste("RMSE: ", rmse))
}
Call:
lm(formula = y ~ ., data = dat)
Residuals:
1 2 3
0 0 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0 0 NA NA
V2 1 0 Inf <2e-16 ***
V3 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0 on 1 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: Inf on 1 and 1 DF, p-value: < 2.2e-16
[1] "RMSE: 0"
Warning message:
In predict.lm(mod, newdata = newdat) :
prediction from a rank-deficient fit may be misleading
>
The warning is because of the exact fit you are offering
Modified using the as.formula suggestion in the comments. Roman's comment above about passing all as one data.frame and using the . notation in formulas is probably the best solution, but I implemented it in paste because you should know how to use paste and as.formula :-).
tnt <- function(train_dep, train_indep, test_dep, test_indep) {
form <- as.formula(paste("train_dep ~", paste( "train_indep$",colnames(train_indep) ,sep="",collapse=" + " ), sep=" "))
mod <- lm(form)
estimate <- predict(mod, data.frame(x=test_indep))
rmse <- sqrt(sum((test_dep-estimate)^2)/length(test_dep))
print(summary(mod))
print(paste("RMSE: ", rmse))
}