So far my code looks like this:
Points = readOGR(dsn = "./Data/filename.shp",layer = "layername",stringsAsFactors = FALSE)
Points$LDI = extract(LDI, Points)
LDI = raster("./Data/filename2.tif")
Points$LDI = extract(LDI, Points)
PointsDF = Points#data
for(i in PointsDF) {
Mod1 = lm(LDI ~ i, data = PointsDF)
Mod2 = lm(LDI ~ 1, data = PointsDF)
anova(Mod1, Mod2)
}
This last part is where I know I'm doing everything wrong. I want to run the anova on every numerical field in the data frame.
You're close. A natural way is to loop over the field names. Although there are many ways to do this, lapply is perhaps the most idiomatic because (a) it uses the field names (rather than field indexes, which can be dangerous) and (b) does not require pre-allocating any structures for the output. The trick is to convert field names into formulas. Again, there are many ways to do this, but a direct way is to assemble the formula as a string.
Here is working code as an example. It produces a list of anova objects.
#
# Create some random data.
#
n <- 20
set.seed(17)
X <- data.frame(Y=rnorm(n), X1=runif(n), X2=1:n, X3=rexp(n))
#
# Loop over the regressors.
# (The base model can be precomputed.)
#
mod.0 <- lm(Y ~ 1, X)
models <- lapply(setdiff(names(X), "Y"), function(s) {
mod.1 <- lm(as.formula(paste("Y ~", s)), X)
anova(mod.0, mod.1)
})
print(models)
Here's the output, displaying this list of three anova results.
[[1]]
Analysis of Variance Table
Model 1: Y ~ 1
Model 2: Y ~ X1
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 10.1157
2 18 9.6719 1 0.44385 0.826 0.3754
[[2]]
Analysis of Variance Table
Model 1: Y ~ 1
Model 2: Y ~ X2
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 10.1157
2 18 8.1768 1 1.939 4.2684 0.05353 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[[3]]
Analysis of Variance Table
Model 1: Y ~ 1
Model 2: Y ~ X3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 10.116
2 18 10.081 1 0.034925 0.0624 0.8056
As another example of working with what you have produced, here is sapply being used to print out their p-values:
sapply(models, function(m) m[["Pr(>F)"]][2])
[1] 0.37542968 0.05352883 0.80562894
The issue is that you are not telling the loop what it is iterating on, defing a formula object in the anova call nor creating an object to store results.
In this example the "ij" variable is assigning to the list object and storing the anova models, "y" defined as a variable indicating the left-hand side of the model. The list object "anova.results" is storing each model. The index in the loop definition is using "which" to assign which column contains "y" and as such, drops it from the iterator. I am using the R "iris" dataset for the example.
data(iris)
iris <- iris[,-5]
y = "Sepal.Length"
anova.results <- list()
ij=0
for(i in names(iris)[-which(names(iris) == y)]) {
ij = ij+1
Mod = lm(stats::as.formula(paste(y, i, sep = "~")), data = iris)
anova.results[[ij]] <- anova(Mod, Mod)
}
anova.results
Related
I have a question about how to compare coefficients in a multivariate regression in R.
I conducted a survey in which I measured three different attitudes (scale variables). My goal is to estimate whether some characteristics of the respondents (age, gender, education and ideological position) can explain their (positve/negative) attitudes.
I was advised to conduct a multivariate multiple regression instead of three univariate multiple regression. The code of my multivariate model is:
MMR <- lm(cbind(Attitude_1, Attitude_2, Attitude_3) ~
Age + Gender + Education + Ideological_position,
data = survey)
summary(MMR)
What I am trying to do next is to estimate whether the coefficients of let's say 'Gender' are statistically significant across the three individual models.
I found a very clear instruction how to do this in Stata (https://stats.idre.ucla.edu/stata/dae/multivariate-regression-analysis/), but I don't have a license, so I have to find an alternative in R. I know a similar question has been asked here before (R - Testing equivalence of coefficients in multivariate multiple regression), but the answer was that there does not exist a package (or function) in R which can be used for this purpose. Because this answer was provided a few years back, I was wondering whether in the meantime some new packages or functions are implemented.
More precisely, I was wondering whether I can use the linearHypothesis() function (https://www.rdocumentation.org/packages/car/versions/3.0-11/topics/linearHypothesis)? I already know that this function allows me to test, for instance, whether the coefficient of Gender equals to coefficient of Education:
linearhypothesis(MMR, c("GenderFemale", "EducationHigh-educated")
Can I also use this function to test whether the coefficient of Gender in the equation modelling Attitude_1 equals the coefficient of Gender in the equation modelling Attitude_2 or Attitude_3?
Any help would be greatly appreciated!
Since the model presented in the question is not reproducible (the input is missing) let us use this model instead.
fm0 <- lm(cbind(cyl, mpg) ~ wt + hp, mtcars)
We will discuss two approaches using as our linear hypothesis that the intercepts of the cyl and mpg groups are the same, that the wt slopes are the same and the hp slopes are the same.
1) Mean/Variance
In this approach we base the entire comparison only on the coefficients and their variance covariance matrix.
library(car)
v <- vcov(fm0)
co <- setNames(c(coef(fm0)), rownames(v))
h1 <- c("cyl:(Intercept) = mpg:(Intercept)", "cyl:wt = mpg:wt", "cyl:hp = mpg:hp")
linearHypothesis(NULL, h1, coef. = co, vcov. = v)
giving:
Linear hypothesis test
Hypothesis:
cyl:((Intercept) - mpg:(Intercept) = 0
cyl:wt - mpg:wt = 0
cyl:hp - mpg:hp = 0
Model 1: restricted model
Model 2: structure(list(), class = "formula", .Environment = <environment>)
Note: Coefficient covariance matrix supplied.
Df Chisq Pr(>Chisq)
1
2 3 878.53 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To explain what linearHypothesis is doing note that In this case the hypothesis matrix is L <- t(c(1, -1)) %x% diag(3) and given v then as a large sample approximation we have that L %*% co is distributed as N(0, L %*% v %*% t(L)) under the null hypothesis hence t(L %*% co) %*% solve(L %*% v %*% t(L)) %*% L %*% co is distributed as chi squared with nrow(L) degrees of freedom.
L <- t(c(1, -1)) %>% diag(3)
nrow(L) # degrees of freedom
SSH <- t(L %*% co) %*% solve(L %*% v %*% t(L)) %*% L %*% co # chisq
p <- pchisq(SSH, nrow(L), lower.tail = FALSE) # p value
2) Long form model
With this approach (which is not equivalent to the first one shown above) convert mtcars from wide to long form, mt2. We show how to do that using reshape or pivot_longer at the end but for now we will just form it explicitly. Define lhs as the 32x2 matrix on the left hand side of the fm0 formula, i.e. cbind(cyl, mpg). Note that its column names are c("cyl", "mpg"). Stringing out lhs column by column into a 64 long vector of the cyl column followed by the mpg column gives us our new dependent variable y. We also form a grouping variable g. the same length as y which indicates which column in lhs the corresponding element of y is from.
With mt2 defined we can form fm1. In forming fm1 We will use a weight vector w based on the fm0 sigma values to reflect the fact that the two groups, cyl and mpg, have different values of sigma given by the vector sigma(fm0).
We show below that the fm0 and fm1 models have the same coefficients and then run linearHypothesis.
library(car)
lhs <- fm0$model[[1]]
g. <- colnames(lhs)[col(lhs)]
y <- c(lhs)
mt2 <- with(mtcars, data.frame(wt, hp, g., y))
w <- 1 / sigma(fm0)[g.]^2
fm1 <- lm(y ~ g./(wt + hp) + 0, mt2, weights = w)
# note coefficient names
variable.names(fm1)
## [1] "g.cyl" "g.mpg" "g.cyl:wt" "g.mpg:wt" "g.cyl:hp" "g.mpg:hp"
# check that fm0 and fm1 have same coefs
all.equal(c(t(coef(fm0))), coef(fm1), check.attributes = FALSE)
## [1] TRUE
h2 <- c("g.mpg = g.cyl", "g.mpg:wt = g.cyl:wt", "g.mpg:hp = g.cyl:hp")
linearHypothesis(fm1, h2)
giving:
Linear hypothesis test
Hypothesis:
- g.cyl + g.mpg = 0
- g.cyl:wt + g.mpg:wt = 0
- g.cyl:hp + g.mpg:hp = 0
Model 1: restricted model
Model 2: y ~ g./(wt + hp) + 0
Res.Df RSS Df Sum of Sq F Pr(>F)
1 61 1095.8
2 58 58.0 3 1037.8 345.95 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If L is the hypothesis matrix which is the same as L in (1) except the columns are reordered, q is its number of rows, n is the number of rows of mt2 then SSH/q is distributed F(q, n-q-1) so we have:
n <- nrow(mt2)
L <- diag(3) %x% t(c(1, -1)) # note difference from (1)
q <- nrow(L)
SSH <- t(L %*% coef(fm1)) %*% solve(L %*% vcov(fm1) %*% t(L)) %*% L %*% coef(fm1)
SSH/q # F value
pf(SSH/q, q, n-q-1, lower.tail = FALSE) # p value
anova
An alternative to linearHypothesis is to define the reduced model and then compare the two models using anova. mt2 and w are from above. No packages are used.
fm2 <- lm(y ~ hp + wt, mt2, weights = w)
anova(fm2, fm1)
giving:
Analysis of Variance Table
Model 1: y ~ hp + wt
Model 2: y ~ g./(wt + hp) + 0
Res.Df RSS Df Sum of Sq F Pr(>F)
1 61 1095.8
2 58 58.0 3 1037.8 345.95 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Alternate wide to long calculation
An alternate way to form mt2 is by reshaping mtcars from wide form to long form using reshape.
mt2a <- mtcars |>
reshape(dir = "long", varying = list(colnames(lhs)), v.names = "y",
timevar = "g.", times = colnames(lhs)) |>
subset(select = c("wt", "hp", "g.", "y"))
or using tidyverse (which has rows in a different order but that should not matter as long as mat2b is used consistently in forming fm1 and w.
library(dplyr)
library(tidyr)
mt2b <- mtcars %>%
select(mpg, cyl, wt, hp) %>%
pivot_longer(all_of(colnames(lhs)), names_to = "g.", values_to = "y")
I would like to analyse many x variables (400 variables) against one y variable (1 variable). However I do not want to write for each and every x variable a new model. Is it possible to write one model which than checks all x variables with y in R-Studio?
Here is an approach where we use a function that regresses all variables in a data frame on a dependent variable from the same data frame that is passed as an argument to the function.
We use lapply() to drive lm() because it will return the resulting model objects as a list, and we are able to easily name the resulting list so we can extract models by independent variable name.
regList <- function(dataframe,depVar) {
indepVars <- names(dataframe)[!(names(dataframe) %in% depVar)]
modelList <- lapply(indepVars,function(x){
lm(dataframe[[depVar]] ~ dataframe[[x]],data=dataframe)
})
# name list elements based on independent variable names
names(modelList) <- indepVars
modelList
}
We demonstrate the function with the mtcars data frame, assigning the mpg column as the dependent variable.
modelList <- regList(mtcars,"mpg")
At this point the modelList object contains 10 models, one for each variable in the mtcars data frame other than mpg. We can access the individual models by independent variable name, or by index.
# print the model where cyl is independent variable
summary(modelList[["cyl"]])
...and the output:
> summary(modelList[["cyl"]])
Call:
lm(formula = dataframe[[depVar]] ~ dataframe[[x]], data = dataframe)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
dataframe[[x]] -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
Extracting the content
Saving the output in a list() enables us to do things like find the model with the highest R^2 without having to use vgrep.
First, we extract the r.squared value from each model summary and save the results to a vector.
r.squareds <- unlist(lapply(modelList,function(x) summary(x)$r.squared))
Because we used names() to name elements in the original list, R automatically saves the variable names to the element names of the vector. This comes in handy when we sort the vector by descending order of R^2 and print the first element of the resulting vector.
r.squareds[order(r.squareds,decreasing=TRUE)][1]
...and the winner (not surprisingly) is wt.
> r.squareds[order(r.squareds,decreasing=TRUE)][1]
wt
0.7528328
If your data frame is DF,
regs <- list()
for (v in setdiff(names(DF), "y")) {
fm <- eval(parse(text = sprintf("y ~ %s", v)))
regs[[v]] <- lm(fm, data=DF)
}
Now you have all simple regression results in the regs list.
Example:
## Generate data
n <- 1000
set.seed(1)
DF <- data.frame(y = rnorm(n))
for (j in seq(400)) DF[[paste0('x',j)]] <- rnorm(n)
## Now data ready
dim(DF)
# [1] 1000 401
head(names(DF))
# [1] "y" "x1" "x2" "x3" "x4" "x5"
tail(names(DF))
# [1] "x395" "x396" "x397" "x398" "x399" "x400"
regs <- list()
for (v in setdiff(names(DF), "y")) {
fm <- eval(parse(text = sprintf("y ~ %s", v)))
regs[[v]] <- lm(fm, data=DF)
}
head(names(regs))
# [1] "x1" "x2" "x3" "x4" "x5" "x6"
r2s <- sapply(regs, function(x) summary(x)$r.squared)
head(r2s, 3)
# x1 x2 x3
# 0.0000409755 0.0024376111 0.0005509134
If you want to include them in the models separately, you can just loop over the x variables and add them into the model on each iteration. For example:
x_variables = list("x_var1", "x_var2", "x_var3", "x_var4", ...)
for(x in x_variables){
model <- lm(y_variable ~ x, data = df)
summary(model)
}
You can fill in the elipses in the code above with all your other x variables. I hope for your sake that there is some kind of naming convention you can exploit to select the variables using a dplyr verb like starts_with or contains!
If you hope to include all the x variables in the same model, you just add them in as you normally would. For example (assuming you want to use an OLS, but the same premise would work for other types):
model <- lm(y_variable ~
x_var1, x_var2, x_var3, x_var4, ..., data = df)
summary(model)
In R, I am trying to change the variables inside my linear model dynamically. I have saved a character vector of variables that I want to use in my lm as moderating variables. This works well for numeric type of variables, however, it is not a good solution for the factor type variables as R does not know they are factors with levels.
My problem is outlined below with a simple example, say I have some data here...
yVar <- c(1,2,3,4,5)
xVar <- c(2,1,2,1,2)
numVar1 <- c(1,2,2,3,4)
numVar2 <- c(1,1,2,2,3)
facVar1 <-c(1,2,3,4,5)
facVar2 <-c(1,2,1,2,1)
xVar <- factor(xVar,levels=c(1:2),labels=c("Condition1","Condition2"))
facVar1 <-factor(facVar1,levels=c(1:5),labels=c("red","blue","green","black","yellow"))
facVar2 <-factor(facVar2, levels=c(1:2), labels=c("dog","cat"))
studyData <- data.frame(yVar,xVar,numVar1,numVar2,facVar1,facVar2)
The standard model would look like:
standardModel <- lm(data=studyData, yVar ~ xVar)
summary.aov(standardModel)
I would like to dynamically include a list of moderating variables to use with this model from zList. As so:
zList <- c("numVar1","numVar2","facVar1","facVar2")
And then call variables from the Z list
for (z in zList) {
lmfit <- lm(as.formula(paste("yVar ~ xVar*",z)), data=studyData)
print(z)
print(typeof(z))
print(levels(z))
print(summary.aov(lmfit))
}
This gives the output below:
[1] "numVar1"
[1] "character"
NULL
Df Sum Sq Mean Sq F value Pr(>F)
xVar 1 0.000 0.000 0.000 1.000
numVar1 1 9.484 9.484 33.194 0.109
xVar:numVar1 1 0.230 0.230 0.806 0.534
Residuals 1 0.286 0.286
[1] "numVar2"
[1] "character"
NULL
Df Sum Sq Mean Sq F value Pr(>F)
xVar 1 0 0 2.200e-02 0.906
numVar2 1 10 10 1.781e+31 <2e-16 ***
xVar:numVar2 1 0 0 7.560e-01 0.544
Residuals 1 0 0
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "facVar1"
[1] "character"
NULL
Df Sum Sq Mean Sq
xVar 1 0 0.000
facVar1 3 10 3.333
[1] "facVar2"
[1] "character"
NULL
Df Sum Sq Mean Sq F value Pr(>F)
xVar 1 0 0.000 0 1
Residuals 3 10 3.333
As can be seen, for the numeric type of variables, this solution seems to work (the no. of levels in NULL as they should be and the lm output looks fine). However, for factorial variables, the number of levels is also "NULL", so R doesn't know that this variable is of type factor and has levels.
What could I do so that I could run my linear model, and allow variables to change dynamically on the fly, whereby R knows what type the variable is? Is there an alternative better way of solving this problem?
Thank you in advance for any replies.
If you want the loop to print the information about z while it is fitting the several models, the following code will do it. The vector zList is a character vector, so z is a character string, the variables can be accessed with get(z).
The fitted models will be in list lm_list. Then a sequence of simpler lapply instructions can produce aov objects (in a list, aov_list) or summary statistics.
lm_list <- lapply(zList, function(z) {
cat("\n", "name:", z, "\n")
zvar <- get(z)
cat("typeof:", typeof(zvar), "\n")
cat("class:", class(zvar), "\n")
if(is.factor(zvar)) cat("levels:", levels(zvar), "\n")
fmla <- as.formula(paste("yVar ~ xVar *", z))
lm(fmla, data = studyData)
})
lm_smry <- lapply(lm_list, summary)
lm_smry
aov_list <- lapply(lm_list, aov)
lapply(aov_list, summary)
I am trying to perform a pairwise manova analysis where I loop through all the possible pairs of my columns. I think this is best communicated with an example:
varList <- colnames(iris)
m1 <- manova(cbind(varList[1], varList[2]) ~ Species, data = iris)
# Error in model.frame.default(formula = cbind(varList[1], varList[2]) ~ :
# variable lengths differ (found for 'Species')
m2 <- manova(cbind(noquote(varList[1]), noquote(varList[2])) ~ Species,
data = iris)
# Error in model.frame.default(formula = cbind(noquote(varList[1]), noquote(varList[2])) ~ :
# variable lengths differ (found for 'Species')
m3 <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
m4 <- manova(cbind(iris[ ,1], iris[ ,3]) ~ Species, data = iris)
summary(m3)
# Df Pillai approx F num Df den Df Pr(>F)
# Species 2 0.9885 71.829 4 294 < 2.2e-16 ***
# Residuals 147
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R.version.string
# [1] "R version 3.4.2 (2017-09-28)"
RStudio.Version()$version
# [1] ‘1.1.383’
I think this is more related to referring to colnames from a vector in my cbind() function. I saw something about the using parenthesis from this question here, but can't get that to work for my case. I can call the columns by their number (see m4), but I'd prefer to use column names if possible.
You need to wrap each of the entries from the vector that you are calling with eval(as.symbol()).
So:
m1 <- manova(cbind(eval(as.symbol(varList[1])), eval(as.symbol(varList[2]))) ~ Species, data = iris) should work.
How can I use ddply function for linear model?
x1 <- c(1:10, 1:10)
x2 <- c(1:5, 1:5, 1:5, 1:5)
x3 <- c(rep(1,5), rep(2,5), rep(1,5), rep(2,5))
set.seed(123)
y <- rnorm(20, 10, 3)
mydf <- data.frame(x1, x2, x3, y)
require(plyr)
ddply(mydf, mydf$x3, .fun = lm(mydf$y ~ mydf$X1 + mydf$x2))
This generates this error:
Error in model.frame.default(formula = mydf$y ~ mydf$X1 + mydf$x2,
drop.unused.levels = TRUE) :
invalid type (NULL) for variable 'mydf$X1'
Appreciate your help.
Here is what you need to do.
mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
mods is a list of two objects containing the regression results. you can extract what you need from mods. for example, if you want to extract the coefficients, you could write
coefs = ldply(mods, coef)
This gives you
x3 (Intercept) x1 x2
1 1 11.71015 -0.3193146 NA
2 2 21.83969 -1.4677690 NA
EDIT. If you want ANOVA, then you can just do
ldply(mods, anova)
x3 Df Sum Sq Mean Sq F value Pr(>F)
1 1 1 2.039237 2.039237 0.4450663 0.52345980
2 1 8 36.654982 4.581873 NA NA
3 2 1 43.086916 43.086916 4.4273907 0.06849533
4 2 8 77.855187 9.731898 NA NA
What Ramnath explanted is exactly right. But I'll elaborate a bit.
ddply expects a data frame in and then returns a data frame out. The lm() function takes a data frame as an input but returns a linear model object in return. You can see that by looking at the docs for lm via ?lm:
Value
lm returns an object of class "lm" or for multiple responses of class
c("mlm", "lm").
So you can't just shove the lm objects into a data frame. Your choices are either to coerce the output of lm into a data frame or you can shove the lm objects into a list instead of a data frame.
So to illustrate both options:
Here's how to shove the lm objects into a list (very much like what Ramnath illustrated):
outlist <- dlply(mydf, "x3", function(df) lm(y ~ x1 + x2, data=df))
On the flip side, if you want to extract only the coefficients you can create a function that runs the regression and then returns only the coefficients in the form of a data frame like this:
myLm <- function( formula, df ){
lmList <- lm(formula, data=df)
lmOut <- data.frame(t(lmList$coefficients))
names(lmOut) <- c("intercept","x1coef","x2coef")
return(lmOut)
}
outDf <- ddply(mydf, "x3", function(df) myLm(y ~ x1 + x2, df))
Use this
mods <- dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
coefs <- llply(mods, coef)
$`1`
(Intercept) x1 x2
11.7101519 -0.3193146 NA
$`2`
(Intercept) x1 x2
21.839687 -1.467769 NA
anovas <- llply(mods, anova)
$`1`
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 2.039 2.0392 0.4451 0.5235
Residuals 8 36.655 4.5819
$`2`
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 43.087 43.087 4.4274 0.0685 .
Residuals 8 77.855 9.732
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1