How can I use ddply function for linear model?
x1 <- c(1:10, 1:10)
x2 <- c(1:5, 1:5, 1:5, 1:5)
x3 <- c(rep(1,5), rep(2,5), rep(1,5), rep(2,5))
set.seed(123)
y <- rnorm(20, 10, 3)
mydf <- data.frame(x1, x2, x3, y)
require(plyr)
ddply(mydf, mydf$x3, .fun = lm(mydf$y ~ mydf$X1 + mydf$x2))
This generates this error:
Error in model.frame.default(formula = mydf$y ~ mydf$X1 + mydf$x2,
drop.unused.levels = TRUE) :
invalid type (NULL) for variable 'mydf$X1'
Appreciate your help.
Here is what you need to do.
mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
mods is a list of two objects containing the regression results. you can extract what you need from mods. for example, if you want to extract the coefficients, you could write
coefs = ldply(mods, coef)
This gives you
x3 (Intercept) x1 x2
1 1 11.71015 -0.3193146 NA
2 2 21.83969 -1.4677690 NA
EDIT. If you want ANOVA, then you can just do
ldply(mods, anova)
x3 Df Sum Sq Mean Sq F value Pr(>F)
1 1 1 2.039237 2.039237 0.4450663 0.52345980
2 1 8 36.654982 4.581873 NA NA
3 2 1 43.086916 43.086916 4.4273907 0.06849533
4 2 8 77.855187 9.731898 NA NA
What Ramnath explanted is exactly right. But I'll elaborate a bit.
ddply expects a data frame in and then returns a data frame out. The lm() function takes a data frame as an input but returns a linear model object in return. You can see that by looking at the docs for lm via ?lm:
Value
lm returns an object of class "lm" or for multiple responses of class
c("mlm", "lm").
So you can't just shove the lm objects into a data frame. Your choices are either to coerce the output of lm into a data frame or you can shove the lm objects into a list instead of a data frame.
So to illustrate both options:
Here's how to shove the lm objects into a list (very much like what Ramnath illustrated):
outlist <- dlply(mydf, "x3", function(df) lm(y ~ x1 + x2, data=df))
On the flip side, if you want to extract only the coefficients you can create a function that runs the regression and then returns only the coefficients in the form of a data frame like this:
myLm <- function( formula, df ){
lmList <- lm(formula, data=df)
lmOut <- data.frame(t(lmList$coefficients))
names(lmOut) <- c("intercept","x1coef","x2coef")
return(lmOut)
}
outDf <- ddply(mydf, "x3", function(df) myLm(y ~ x1 + x2, df))
Use this
mods <- dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
coefs <- llply(mods, coef)
$`1`
(Intercept) x1 x2
11.7101519 -0.3193146 NA
$`2`
(Intercept) x1 x2
21.839687 -1.467769 NA
anovas <- llply(mods, anova)
$`1`
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 2.039 2.0392 0.4451 0.5235
Residuals 8 36.655 4.5819
$`2`
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 43.087 43.087 4.4274 0.0685 .
Residuals 8 77.855 9.732
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Related
I am trying to create a function that passes a parameter in as the dependent variable with the independent variables staying the same.
I have tried to use {{}} but see the problem as something like the below if select contains was possible.
test_func <- function(dataframe, dependent){
model <- tidy(lm({{ dependent }} ~ . - select(contains("x")), data = dataframe))
return(model)
}
test_func(datasets::anscombe, x1)
The function should pass as function(dataframe, dependent) with a single model.
Use reformulate().
f <- function(d, y) lm(reformulate(names(d)[grep("x", names(d))], response=y), data=d)
f(datasets::anscombe, "y1")
# Call:
# lm(formula = reformulate(names(d)[grep("x", names(d))], response = y),
# data = d)
#
# Coefficients:
# (Intercept) x1 x2 x3 x4
# 4.33291 0.45073 NA NA -0.09873
I have a list for formulas I want to fit to data, rather than running a loop I'd like to do this at once, for performance's sake. The estimations should still be separate, I'm not trying to estimate a SUR or anything.
The following code does what I want
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
formulae <-list(y~x[,1],
y~x[,2],
y~x[,1] + x[,2])
lapply(formulae,lm)
Unfortunately this gets somewhat slow as the length of formulae increases is there a way to truly vectorize this?
If it is any help, the only results of lm I care about are coefficients, and some standard errors.
As I said in my comment, what you really need is a more efficient yet stable fitting routine other than lm(). Here I would provide you a well tested one written myself, called lm.chol(). It takes a formula and data, and returns:
a coefficient summary table, as you normally see in summary(lm(...))$coef;
Pearson estimate of residual standard error, as you get from summary(lm(...))$sigma;
adjusted-R.squared, as you get from summary(lm(...))$adj.r.squared.
## linear model estimation based on pivoted Cholesky factorization with Jacobi preconditioner
lm.chol <- function(formula, data) {
## stage0: get response vector and model matrix
## we did not follow the normal route: match.call, model.frame, model.response, model matrix, etc
y <- data[[as.character(formula[[2]])]]
X <- model.matrix(formula, data)
n <- nrow(X); p <- ncol(X)
## stage 1: XtX and Jacobi diagonal preconditioner
XtX <- crossprod(X)
D <- 1 / sqrt(diag(XtX))
## stage 2: pivoted Cholesky factorization
R <- suppressWarnings(chol(t(D * t(D * XtX)), pivot = TRUE))
piv <- attr(R, "pivot")
r <- attr(R, "rank")
if (r < p) {
warning("Model is rank-deficient!")
piv <- piv[1:r]
R <- R[1:r, 1:r]
}
## stage 3: solve linear system for coefficients
D <- D[piv]
b <- D * crossprod(X, y)[piv]
z <- forwardsolve(t(R), b)
RSS <- sum(y * y) - sum(z * z)
sigma <- sqrt(RSS / (n - r))
para <- D * backsolve(R, z)
beta.hat <- rep(NA, p)
beta.hat[piv] <- para
## stage 4: get standard error
Rinv <- backsolve(R, diag(r))
se <- rep(NA, p)
se[piv] <- D * sqrt(rowSums(Rinv * Rinv)) * sigma
## stage 5: t-statistic and p-value
t.statistic <- beta.hat / se
p.value <- 2 * pt(-abs(t.statistic), df = n - r)
## stage 6: construct coefficient summary matrix
coefficients <- matrix(c(beta.hat, se, t.statistic, p.value), ncol = 4L)
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
rownames(coefficients) <- colnames(X)
## stage 7: compute adjusted R.squared
adj.R2 <- 1 - sigma * sigma / var(y)
## return model fitting results
attr(coefficients, "sigma") <- sigma
attr(coefficients, "adj.R2") <- adj.R2
coefficients
}
Here I would offer three examples.
Example 1: full rank linear model
We take R's built-in dataset trees as an example.
# using `lm()`
summary(lm(Height ~ Girth + Volume, trees))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2958 9.0866 9.167 6.33e-10 ***
#Girth -1.8615 1.1567 -1.609 0.1188
#Volume 0.5756 0.2208 2.607 0.0145 *
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 5.056 on 28 degrees of freedom
#Multiple R-squared: 0.4123, Adjusted R-squared: 0.3703
#F-statistic: 9.82 on 2 and 28 DF, p-value: 0.0005868
## using `lm.chol()`
lm.chol(Height ~ Girth + Volume, trees)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2957705 9.0865753 9.166905 6.333488e-10
#Girth -1.8615109 1.1566879 -1.609346 1.187591e-01
#Volume 0.5755946 0.2208225 2.606594 1.449097e-02
#attr(,"sigma")
#[1] 5.056318
#attr(,"adj.R2")
#[1] 0.3702869
The results are exactly the same!
Example 2: rank-deficient linear model
## toy data
set.seed(0)
dat <- data.frame(y = rnorm(100), x1 = runif(100), x2 = rbeta(100,3,5))
dat$x3 <- with(dat, (x1 + x2) / 2)
## using `lm()`
summary(lm(y ~ x1 + x2 + x3, dat))
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164 0.2530 0.856 0.394
#x1 -0.1526 0.3252 -0.469 0.640
#x2 -0.3534 0.5707 -0.619 0.537
#x3 NA NA NA NA
#Residual standard error: 0.8886 on 97 degrees of freedom
#Multiple R-squared: 0.0069, Adjusted R-squared: -0.01358
#F-statistic: 0.337 on 2 and 97 DF, p-value: 0.7147
## using `lm.chol()`
lm.chol(y ~ x1 + x2 + x3, dat)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164455 0.2529576 0.8556595 0.3942949
#x1 NA NA NA NA
#x2 -0.2007894 0.6866871 -0.2924030 0.7706030
#x3 -0.3051760 0.6504256 -0.4691944 0.6399836
#attr(,"sigma")
#[1] 0.8886214
#attr(,"adj.R2")
#[1] -0.01357594
#Warning message:
#In lm.chol(y ~ x1 + x2 + x3, dat) : Model is rank-deficient!
Here, lm.chol() based on Cholesky factorization with complete pivoting and lm() based on QR factorization with partial pivoting have shrunk different coefficients to NA. But two estimation are equivalent, with the same fitted values and residuals.
Example 3: performance for large linear models
n <- 10000; p <- 300
set.seed(0)
dat <- as.data.frame(setNames(replicate(p, rnorm(n), simplify = FALSE), paste0("x",1:p)))
dat$y <- rnorm(n)
## using `lm()`
system.time(lm(y ~ ., dat))
# user system elapsed
# 3.212 0.096 3.315
## using `lm.chol()`
system.time(lm.chol(y ~ ., dat))
# user system elapsed
# 1.024 0.028 1.056
lm.chol() is 3 ~ 4 times faster than lm(). If you want to know the reason, read my this answer.
Remark
I have focused on improving performance on computational kernel. You can take one step further, by using Ben Bolker's parallelism suggestion. If my approach gives 3 times boost, and parallel computing gives 3 times boost on 4 cores, you end up with 9 times boost!
There's not really an easy way to vectorize this, but the pdredge function from the MuMIn package gives you a pretty easy way to parallelize it (this assumes you have multiple cores on your machine or that you can set up a local cluster in one of the ways supported by the parallel package ...
library(parallel)
clust <- makeCluster(2,"PSOCK")
library(MuMIn)
Construct data:
set.seed(101)
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
It will be easier to do this with a named data frame rather than an anonymous matrix:
df <- setNames(data.frame(y,x),c("y",paste0("x",1:3)))
The cluster nodes all need access to the data set:
clusterExport(clust,"df")
Fit the full model (you could use y~. to fit all variables)
full <- lm(y~x1+x2,data=df,na.action=na.fail)
Now fit all submodels (see ?MuMIn::dredge for many more options to control which submodels are fitted)
p <- pdredge(full,cluster=clust)
coef(p)
## (Intercept) x1 x2
## 3 -0.003805107 0.7488708 2.590204
## 2 -0.028502039 NA 2.665305
## 1 -0.101434662 1.0490816 NA
## 0 -0.140451160 NA NA
So far my code looks like this:
Points = readOGR(dsn = "./Data/filename.shp",layer = "layername",stringsAsFactors = FALSE)
Points$LDI = extract(LDI, Points)
LDI = raster("./Data/filename2.tif")
Points$LDI = extract(LDI, Points)
PointsDF = Points#data
for(i in PointsDF) {
Mod1 = lm(LDI ~ i, data = PointsDF)
Mod2 = lm(LDI ~ 1, data = PointsDF)
anova(Mod1, Mod2)
}
This last part is where I know I'm doing everything wrong. I want to run the anova on every numerical field in the data frame.
You're close. A natural way is to loop over the field names. Although there are many ways to do this, lapply is perhaps the most idiomatic because (a) it uses the field names (rather than field indexes, which can be dangerous) and (b) does not require pre-allocating any structures for the output. The trick is to convert field names into formulas. Again, there are many ways to do this, but a direct way is to assemble the formula as a string.
Here is working code as an example. It produces a list of anova objects.
#
# Create some random data.
#
n <- 20
set.seed(17)
X <- data.frame(Y=rnorm(n), X1=runif(n), X2=1:n, X3=rexp(n))
#
# Loop over the regressors.
# (The base model can be precomputed.)
#
mod.0 <- lm(Y ~ 1, X)
models <- lapply(setdiff(names(X), "Y"), function(s) {
mod.1 <- lm(as.formula(paste("Y ~", s)), X)
anova(mod.0, mod.1)
})
print(models)
Here's the output, displaying this list of three anova results.
[[1]]
Analysis of Variance Table
Model 1: Y ~ 1
Model 2: Y ~ X1
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 10.1157
2 18 9.6719 1 0.44385 0.826 0.3754
[[2]]
Analysis of Variance Table
Model 1: Y ~ 1
Model 2: Y ~ X2
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 10.1157
2 18 8.1768 1 1.939 4.2684 0.05353 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[[3]]
Analysis of Variance Table
Model 1: Y ~ 1
Model 2: Y ~ X3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 10.116
2 18 10.081 1 0.034925 0.0624 0.8056
As another example of working with what you have produced, here is sapply being used to print out their p-values:
sapply(models, function(m) m[["Pr(>F)"]][2])
[1] 0.37542968 0.05352883 0.80562894
The issue is that you are not telling the loop what it is iterating on, defing a formula object in the anova call nor creating an object to store results.
In this example the "ij" variable is assigning to the list object and storing the anova models, "y" defined as a variable indicating the left-hand side of the model. The list object "anova.results" is storing each model. The index in the loop definition is using "which" to assign which column contains "y" and as such, drops it from the iterator. I am using the R "iris" dataset for the example.
data(iris)
iris <- iris[,-5]
y = "Sepal.Length"
anova.results <- list()
ij=0
for(i in names(iris)[-which(names(iris) == y)]) {
ij = ij+1
Mod = lm(stats::as.formula(paste(y, i, sep = "~")), data = iris)
anova.results[[ij]] <- anova(Mod, Mod)
}
anova.results
I have an outcome variable, say Y and a list of 20 variables that could affect Y (say X1...X20). I would like to test which variables are NOT independent of Y. To do this I want to run a univariable glm for each variable and Y (ie Y~X1,...,Y~X20) and then do a likelihood ratio test for each model. Finally I would like to create a table the has the resulting P value from the likelihood test for each model.
From what I have seen the lapply function and split function could be useful for this but I don't really understand how they work in the examples I've seen.
This is what I tried at first:
> VarNames<-c(names(data[30:47]))
> glms<-glm(intBT~VarNames,family=binomial(logit))
Error in model.frame.default(formula = intBT ~ VarNames, drop.unused.levels = TRUE) :
variable lengths differ (found for 'VarNames')
I'm not sure if that was a good approach though.
It is easier to answer your questions if you provide a minimal example.
One way to go - but certainly not the most beautiful - is to use paste to create the formulas as a vector of strings and then use lapply on them. The Code for this could look like this:
example.data <- data.frame(intBT=1:10, bli=1:10, bla=1:10, blub=1:10)
var.names <- c('bli', 'bla', 'blub')
formulas <- paste('intBT ~', var.names)
fitted.models <- lapply(formulas, glm, data=example.data)
This gives a list of fitted model. You can then use the apply functions on fitted.models to execute further tests.
Like Paul said it really helps if you provide a minimal example, but I think this does what you want.
set.seed(123)
N <- 100
num_vars <- 5
df <- data.frame(lapply(1:num_vars, function(i) i = rnorm(N)))
names(df) <- c(paste0(rep("X",5), 1:num_vars ))
e <- rnorm(N)
y <- as.numeric((df$X1 + df$X2 + e) > 0.5)
pvalues <- vector(mode = "list")
singlevar <- function(var, y, df){
model <- as.formula(paste0("y ~ ", var))
pvalues[var] <- coef(summary(glm(model, family = "binomial", data = df)))[var,4]
}
sapply(colnames(df), singlevar, y, df)
X1 X2 X3 X4 X5
1.477199e-04 4.193461e-05 8.885365e-01 9.064953e-01 9.702645e-01
For comparison:
Call:
glm(formula = y ~ X2, family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0674 -0.8211 -0.5296 0.9218 2.5463
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5591 0.2375 -2.354 0.0186 *
X2 1.2871 0.3142 4.097 4.19e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 130.68 on 99 degrees of freedom
Residual deviance: 106.24 on 98 degrees of freedom
AIC: 110.24
Number of Fisher Scoring iterations: 4
I'm looking for suggestions on how to deal with NA's in linear regressions when all occurrences of an independent/explanatory variable are NA (i.e. x3 below).
I know the obvious solution would be to exclude the independent/explanatory variable in question from the model but I am looping through multiple regions and would prefer not to have a different functional forms for each region.
Below is some sample data:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
# Quick plot of data
library(ggplot2)
library(reshape2)
df.melt <-melt(df, id=c("time"))
p <- ggplot(df.melt, aes(x=time, y=value)) +
geom_line() + facet_grid(variable ~ .)
p
I have read the documentation for lm and tried various na.action settings without success:
lm(y~x1+x2+x3, data=df, singular.ok=TRUE)
lm(y~x1+x2+x3, data=df, na.action=na.omit)
lm(y~x1+x2+x3, data=df, na.action=na.exclude)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.omit)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.exclude)
Is there a way to get lm to run without error and simply return a coefficient for the explanatory reflective of the lack of explanatory power (i.e. either zero or NA) from the variable in question?
Here's one idea:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
replaceNA<-function(x){
if(all(is.na(x))){
rep(0,length(x))
} else x
}
lm(y~x1+x2+x3, data= data.frame(lapply(df,replaceNA)))
Call:
lm(formula = y ~ x1 + x2 + x3, data = data.frame(lapply(df, replaceNA)))
Coefficients:
(Intercept) x1 x2 x3
0.05467 1.01133 -0.10613 NA
lm(y~x1+x2, data=df)
Call:
lm(formula = y ~ x1 + x2, data = df)
Coefficients:
(Intercept) x1 x2
0.05467 1.01133 -0.10613
So you replace the variables which contain only NA's with variable which contains only 0's. you get the coefficient value NA, but all the relevant parts of the model fits are same (expect qr decomposition, but if information about that is needed, it can be easily modified). Note that component summary(fit)$alias (see ?alias) might be useful.
This seems to relate your other question: Replace lm coefficients in [r]
You won't be able to include a column with all NA values. It does strange things to model.matrix
x1 <- 1:5
x2 <- rep(NA,5)
model.matrix(~x1+x2)
(Intercept) x1 x2TRUE
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$x2
[1] "contr.treatment"
So your alternative is to programatically create the model formula based on the data.
Something like...
make_formula <- function(variables, data, response = 'y'){
if(missing(data)){stop('data not specified')}
using <- Filter(variables,f= function(i) !all(is.na(data[[i]])))
deparse(reformulate(using, response))
}
variables <- c('x1','x2','x3')
make_formula(variables, data =df)
[1] "y ~ x1 + x2"
I've used deparse to return a character string so that there is no environment issues from creating the formula within the function. lm can happily take a character string which is a valid formula.