I'm having problems to run a robust linear regression model (using rlm from the MASS library) over a list of dataframes.
Reproducible example:
var1 <- c(1:100)
var2 <- var1*var1
df1 <- data.frame(var1, var2)
var1 <- var1 + 50
var2 <- var2*2
df2 <- data.frame(var1, var2)
lst1 <- list(df1, df2)
Linear model (works):
lin_mod <- lapply(lst1, lm, formula = var1 ~ var2)
summary(lin_mod[[1]])
My code for the robust model:
rob_mod <- lapply(lst1, MASS::rlm, formula = var1 ~ var2)
gives the following error:
Error in rlm.default(X[[i]], ...) :
argument "y" is missing, with no default
How could I solve this?
The error in my actual data is:
Error in qr.default(x) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion
You can also try a purrr:map solution:
library(tidyverse)
map(lst1, ~rlm(var1 ~ var2, data=.))
or as joran commented
map(lst1, MASS:::rlm.formula, formula = var1 ~ var2)
As you can see here ?lm provides only a formula method. In contrast ?rlm provides both (formula and x, y). Thus, you have to specify data= to say rlm to explicitly use the formula method. Otherwise rlm wants x and y as input.
Your call is missing the data argument. lapply will call FUN with each member of the list as the first argument of FUN but data is the second argument to rlm.
The solution is to define an anonymous function.
lin_mod <- lapply(lst1, function(DF) MASS::rlm(formula = var1 ~ var2, data = DF))
summary(lin_mod[[1]])
#
#Call: rlm(formula = var1 ~ var2, data = DF)
#Residuals:
# Min 1Q Median 3Q Max
#-18.707 -5.381 1.768 6.067 7.511
#
#Coefficients:
# Value Std. Error t value
#(Intercept) 19.6977 1.0872 18.1179
#var2 0.0092 0.0002 38.2665
#
#Residual standard error: 8.827 on 98 degrees of freedom
Related
In order to use the for loop, I'm trying to replace the arguments in this function by variables:
lm(mpg~cylinders, data=Auto)
So I did this:
var1='cylinders'
lm((paste('mpg ~',var1)), data = Auto)
It worked fine.
Now, I wonder how we can replace the arguments cylinders+acceleration by var1 and var2.
So tried the same method. I tried to replace this:
lm(mpg~cylinders+acceleration, data=Auto)
by
var1='cylinders'
var2 = 'acceleration'
lm((paste('mpg ~',var1+var2)), data = Auto)
But I got a message error:
Error in var1 + var2 : non-numeric argument to binary operator
So I want to learn how I can work with var1 and var2 in order to use for loop afterwards.
Use reformulate to generate the formula.
var1 <- 'cyl'
var2 <- 'disp'
fo <- reformulate(c(var1, var2), "mpg")
lm(fo, mtcars)
or you could write it like this which gives the same answer except the above shows literally fo in the Call: line in the output whereas the code below expands fo in the Call: line in the output.
do.call("lm", list(fo, quote(mtcars)))
giving:
Call:
lm(formula = mpg ~ cyl + disp, data = mtcars)
Coefficients:
(Intercept) cyl disp
34.66099 -1.58728 -0.02058
I am trying to test a whole bunch of different models easily and compare AIC / R-sq values to select the right one. I am having some trouble saving things how I want to between lists and data frames.
data frame I am going to model:
set.seed(1)
df <- data.frame(response=runif(50,min=50,max=100),
var1 = sample(1:20,50,replace=T),
var2 = sample(40:60,50,replace = T))
list of formulas to test:
formulas <- list( response ~ NULL,
response ~ var1,
response ~ var2,
response ~ var1 + var2,
response ~ var1 * var2)
So, what I want to do is create a loop that will model all of these formulas, extract Formula, AIC, and R-sq values into a table, and let me sort it to find the best one. The problem I'm having is I can't extract the formula name as "Response ~ var1", instead, it keeps coming out as "Response" "~" "var1" if I try to extract as a character object. Or, if I extract as a list (like below), then it comes out like this:
[[1]]
response ~ NULL
[[2]]
[1] 415.89
[[3]]
[1] 0
And I can't easily plug those list elements into a data frame. Here is what I tried:
selection <- matrix(ncol=3)
colnames(selection) <- c("formula","AIC","R2") # create a df to store results in
for ( i in 1:length(formulas)){
mod <- lm( formula = formulas[[i]], data= df)
mod_vals <- c(extract(formulas[[i]]),
round(AIC(mod),2),
round(summary(mod)$adj.r.squared,2)
)
selection[i,] <- mod_vals[]
}
Any ideas? I don't have to keep it as a for loop either, I just want a way to test a long list of models together.
Thanks.
You could use lapply to loop over each formula and extract relevant statistic from the model and bind the datasets together.
do.call(rbind, lapply(formulas, function(x) {
mod <- lm(x, data= df)
data.frame(formula = format(x),
AIC = round(AIC(mod),2),
r_square = round(summary(mod)$adj.r.squared,2))
}))
# formula AIC r_square
#1 response ~ NULL 405.98 0.00
#2 response ~ var1 407.54 -0.01
#3 response ~ var2 407.90 -0.02
#4 response ~ var1 + var2 409.50 -0.03
#5 response ~ var1 * var2 410.36 -0.03
Or with purrr
purrr::map_df(formulas, ~{
mod <- lm(.x, data= df)
data.frame(formula = format(.x),
AIC = round(AIC(mod),2),
r_square = round(summary(mod)$adj.r.squared,2))
})
is there a way to make all variables a target variable and check regression results against other independent variables. For example
df
Date Var1 Var2 Var3
27/9/2019 12 45 59
28/9/2019 34 43 54
29/9/2019 45 23 40
Usually if want to see the relationship between Var1 and Var2 i use below code
lm(Var1 ~ Var2, data = myData)
In case I want to see the results for all variables (Var1 , Var2 and Var3) like, in one instance, Var1 is dependent variable and rest(Var2 and Var3) are independent. Then 2 instance, Var2 is dependent variable and rest(Var1 and Var3) are independent and so on. Is there a way to do this?
You could use something like this to get the formulas you need:
vars <- names(df)[-1] # we can eliminate the dates
forms <- lapply(1:length(vars),
function(i) formula(paste(vars[i], "~", paste(vars[-i], collapse = "+")))
)
Output:
[[1]]
Var1 ~ Var2 + Var3
<environment: 0x7fdaaa63abd0>
[[2]]
Var2 ~ Var1 + Var3
<environment: 0x7fdaaa63c508>
[[3]]
Var3 ~ Var1 + Var2
<environment: 0x7fdaaec0d2a8>
Then you just need to pass each formula into lm in lapply:
mods <- lapply(forms, lm, data = df)
Output:
[[1]]
Call:
FUN(formula = X[[i]], data = ..1)
Coefficients:
(Intercept) Var2 Var3
196.403 3.514 -5.806
[[2]]
Call:
FUN(formula = X[[i]], data = ..1)
Coefficients:
(Intercept) Var1 Var3
-55.8933 0.2846 1.6522
[[3]]
Call:
FUN(formula = X[[i]], data = ..1)
Coefficients:
(Intercept) Var1 Var2
33.8301 -0.1722 0.6053
If you want to regress Var1 against all other variables you can do the following :
lm(Var1 ~. , data = myData)
If you just want to select more tab one variable than you can also use:
lm(Var1 ~ Var2 + Var3, data = myData)
The following is based on the answers to these questions: 1, 2 and 3. See the explanations therein.
The main difference is that it loops (lapply) through the columns of the input data set and constructs full models with each of those column-vectors as response and all others as predictors. Then dredges the full model fit.
library(MuMIn)
model_list <- lapply(names(df1), function(resp){
fmla <- as.formula(paste(resp, "~ ."))
print(fmla)
full <- lm(fmla, data = df1, na.action = na.fail)
dredge(full)
})
model_list
Test data creation code.
set.seed(1234)
df1 <- replicate(3, sample(10:99, 100, TRUE))
df1 <- as.data.frame(df1)
names(df1) <- paste0("Var", 1:3)
I am writing a custom script to bootstrap standard errors in a GLM in R and receive the following error:
Error in eval(predvars, data, env) : numeric 'envir' arg not of length one
Can someone explain what I am doing wrong? My code:
#Number of simulations
sims<-numbersimsdesired
#Set up place to store data
saved.se<-matrix(NA,sims,numberofcolumnsdesired)
y<-matrix(NA,realdata.rownumber)
x1<-matrix(NA,realdata.rownumber)
x2<-matrix(NA,realdata.rownumber)
#Resample entire dataset with replacement
for (sim in 1:sims) {
fake.data<-sample(1:nrow(data5),nrow(data5),replace=TRUE)
#Define variables for GLM using fake data
y<-realdata$y[fake.data]
x1<-realdata$x1[fake.data]
x2<-realdata$x2[fake.data]
#Run GLM on fake data, extract SEs, save SE into matrix
glm.output<-glm(y ~ x1 + x2, family = "poisson", data = fake.data)
saved.se[sim,]<-summary(glm.output)$coefficients[0,2]
}
An example: if we suppose sims = 1000 and we want 10 columns (suppose instead of x1 and x2, we have x1...x10) the goal is a dataset with 1,000 rows and 10 columns containing each explanatory variable's SEs.
There isn't a reason to reinvent the wheel. Here is an example of bootstrapping the standard error of the intercept with the boot package:
set.seed(42)
counts <- c(18,17,15,20,10,20,25,13,12)
x1 <- 1:9
x2 <- sample(9)
DF <- data.frame(counts, x1, x2)
glm1 <- glm(counts ~ x1 + x2, family = poisson(), data=DF)
summary(glm1)$coef
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) 2.08416378 0.42561333 4.896848 9.738611e-07
#x1 0.04838210 0.04370521 1.107010 2.682897e-01
#x2 0.09418791 0.04446747 2.118131 3.416400e-02
library(boot)
intercept.se <- function(d, i) {
glm1.b <- glm(counts ~ x1 + x2, family = poisson(), data=d[i,])
summary(glm1.b)$coef[1,2]
}
set.seed(42)
boot.intercept.se <- boot(DF, intercept.se, R=999)
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = DF, statistic = intercept.se, R = 999)
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* 0.4256133 0.103114 0.2994377
Edit:
If you prefer doing it without a package:
n <- 999
set.seed(42)
ind <- matrix(sample(nrow(DF), nrow(DF)*n, replace=TRUE), nrow=n)
boot.values <- apply(ind, 1, function(...) {
i <- c(...)
intercept.se(DF, i)
})
sd(boot.values)
#[1] 0.2994377
Please consider the following example code (from the lm doc):
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2,10,20, labels=c("Ctl","Trt"))
weight <- c(ctl, trt)
lm(weight ~ group)
form1 <- weight ~ group
lm(form1)
do.call("lm",list(formula=form1))
With the following (abbreviated) output:
> lm(weight ~ group)
Call:
lm(formula = weight ~ group)
> lm(form1)
Call:
lm(formula = form1)
> do.call("lm",list(formula=form1))
Call:
lm(formula = weight ~ group)
As you can see, the second call to lm does not display the formula, but the variable containing the formula. With some experimenting, I came up with the third solution, but I find that one a bit tedious.
The question hence is: is there another way to make lm display the formula instead of the variable (i.e. without using the do.call method I used above).
(The reason I want this is that I am working on a Sweave document, so I cannot see the calls, only the outputs, and then is having lm tell you what formula it used very handy.)
EVEN BETTER SOLUTION -
Thanks to #Aaron
lm <- function(...) {
mf <- match.call()
mf[[1]] <- quote(stats::lm)
env <- parent.frame()
mf$formula <- eval(mf$formula, env)
eval(mf, env)
}
SOLUTION:
Based on G. Grothendieck's answer, I came up with the following function:
lm <- function(...) {
mf <- match.call()
mf[[1]] <- quote(stats::lm)
env <- parent.frame()
fm <- eval(mf, env)
fm$call$formula <- formula(fm)
fm
}
Put this near the top of your Sweave file:
<<preliminaries,echo=FALSE,results=hide>>=
lm <- function(fo, ...) { fm <- stats::lm(fo, ...); fm$call <- fo; fm }
#
and then call lm normally in the rest of the file.
Using the example in the question and assuming we have defined lm as above:
> lm(form1)
Call:
weight ~ group
Coefficients:
(Intercept) groupTrt
5.032 -0.371
This might be just as tedious, but at least it contains fewer characters ;)
R> eval(call("lm",form1))
Call:
lm(formula = weight ~ group)
Coefficients:
(Intercept) groupTrt
5.032 -0.371
You can modify the print.lm:
body(print.lm)[[2]] <-
as.call(expression({
ca<-x$call
ca$formula <- formula(x)
cat("\nCall:\n", paste(deparse(ca), sep = "\n", collapse = "\n"), "\n\n", sep = "")
})[[1]])
then,
> lm(form1)
Call:
lm(formula = weight ~ group)
Coefficients:
(Intercept) groupTrt
5.032 -0.371
A variant based on the OP's final solution (as of Mar 30).
lm <- function(formula, ...) {
mf <- match.call()
mf[[1]] <- quote("stats::lm")
env <- parent.frame()
mf$formula <- eval(mf$formula, env)
eval(mf, env)
}