I use model.matrix to create a matrix used by GLM.
formula_test <- as.formula("Y ~ x1 + x2")
data_test <- expand.grid(
Y = 1:100
, x1 = c("A","B")
, x2 = 1:20
)
result_test <- data.frame(model.matrix(
object = formula_test
, data = data_test
))
names(result_test)
Interestingly, the column names of the result_test data are "X.Intercept." "x1B" "x2"
How come the second column name is not "x1A"?
I then tried data_test$x1 <- factor(x = data_test$x1, levels = c("A","B"))but it's still the same.
That is because if you had c("X.Intercept.", "x1A", "x1B", "x2"), then you would have perfect multicollinearity: x1A + x1B would be a column of ones, just like the X.Intercept. column. If, for the sake of interpretation, you prefer having x1A instead of the intercept, we may use
formula_test <- as.formula("Y ~ -1 + x1 + x2")
giving
names(result_test)
# [1] "x1A" "x1B" "x2"
and
all(rowSums(result_test[, c("x1A", "x1B")]) == 1)
# [1] TRUE
As for why it is x1A that is dropped rather than x1B, the rule seems to be that the first factor levels goes away. If instead we use
levels(data_test$x1) <- c("B", "A")
then this gives
names(result_test)
# [1] "X.Intercept." "x1A" "x2"
Related
I want to add extra variables to a formula, with the use of a separate object part_B. As an example:
part_A <- as.formula("y ~ x1")
part_B <- c("x2", "x3")
I tried a couple of things, but one issue is that you cannot call as.formula on the object part_B (because in that case I could have created the formula by combining character vectors).
Desired Result
as.formula("y ~ x1 + x2 + x3")
Is there any way to do this? I guess one solution would be to create a function that writes the character vector as "y ~ x1 + x2 + x3" so it can be fed to as.formula.
Use reformulate and update like this:
update(part_A, reformulate(c(".", part_B)))
## y ~ x1 + x2 + x3
This also works:
v <- all.vars(part_A)
reformulate(c(v[-1], part_B), v[1])
## y ~ x1 + x2 + x3
If you write part_A as a vector instead:
part_A <- c("y", "x1")
part_B <- c("x2", "x3")
new_formula <- as.formula(paste(part_A[1], paste(c(part_A[2], part_B), collapse = " + "), sep = " ~ "))
Apologies if this is a repeat question, if the answer exists somewhere I would appreciate being pointed to it.
I have a large data frame with many factors, mix of categorical and continuous. Here is a shortened example:
x1 = sample(x = c("A", "B", "C"), size = 50, replace = TRUE)
x2 = sample(x = c(5, 10, 27), size = 50, replace = TRUE)
y = rnorm(50, mean=0)
dat = as.data.frame(cbind(y, x1, x2))
dat$x2 = as.numeric(dat$x2)
dat$y = as.numeric(dat$y)
> head(dat)
y x1 x2
1 9 C 2
2 7 C 2
3 8 B 1
4 21 A 2
5 48 A 1
6 19 A 3
I want to subset this dataset for each level of x1, so I end up with 3 new datasets for each level of factor x1. I can do this the following way:
#A
dat.A = dat[which(dat$x1== "A"),,drop=T]
dat.A$x1 = factor(dat.A$x1)
#B
dat.B = dat[which(dat$x1== "B"),,drop=T]
dat.B$x1 = factor(dat.B$x1)
#C
dat.C = dat[which(dat$x1== "C"),,drop=T]
dat.C$x1 = factor(dat.C$x1)
This is somewhat tedious as my real data have 7 levels of the factor of interest so I have to repeat the code 7 times. Once I have each new data frame in my global environment, I want to perform several functions to each one (graphing, creating tables, fitting linear models). Here is a simple example:
#same plot for each dataset
A.plot = plot(dat.A$y, dat.A$x2)
B.plot = plot(dat.B$y, dat.B$x2)
C.plot = plot(dat.C$y, dat.C$x2)
#same models for each dataset
mod.A = lm(y ~ x2, data = dat.A)
summary(mod.A)
mod.B = lm(y ~ x2, data = dat.B)
summary(mod.B)
mod.C = lm(y ~ x2, data = dat.C)
summary(mod.C)
This is a lot of copying and pasting. Is there a way I can write out one line of code for each thing I want to do and loop over each dataset? Something like below, which I know is wrong but it's what I am trying to do:
for (i in datasets) {
[i].plot = plot(dat.[i]$y, dat.[i]$x2)
mod.[i] = lm(y ~ x2, data = dat[i])
}
We can do a split into a list of data.frames and then loop over the list with lapply
lst1 <- split(dat, dat$x1)
lst2 <- lapply(lst1, function(dat) {
plt <- plot(dat$y, dat$x2)
model <- lm(y ~ x2, data = dat)
list(plt, model)
})
For completeness' sake, here's how I would do this in the tidyverse, producing two lists: one with the plots and one with the models.
library(dplyr)
library(ggplot2)
model_list <- dat %>%
group_by(x1) %>%
group_map( ~ lm(y ~ x2, data = .x))
plot_list <- dat %>%
group_by(x1) %>%
group_map( ~ ggplot(.x, aes(x2, y)) + geom_point())
I have a response variable y.
Also I have a list of 5 dependent variables
x <- list(x1, x2, x3, x4, x5)
Lastly I have a Logical Vector z of length 5. E.g.
z <- c(TRUE, TRUE, FALSE, FALSE, TRUE)
Given this I want R to automatically do linear Regression
lm(y ~ x1 + x2 + x5)
Basically the TRUE/FALSE correspond to whether to include the dependent variable or not.
I am unable to do this.
I tried doing lm(y ~x[z]) but it does not work.
You may do
lm(y ~ do.call(cbind, x[z]))
do.call(cbind, x[z]) will convert x[z] into a matrix, which is an acceptable input format for lm. One problem with this is that the names of the regressors (assuming that x is a named list) in the output are a little messy. So, instead you may do
lm(y ~ ., data = data.frame(y = y, do.call(cbind, x[z])))
that would give nice names in the output (again, assuming that x is a named list).
Try something like binding your y to a data.frame or matrix (cbind) before you do your linear regression. You can filter your dependent variables by doing something like this:
x <- list(x1 = 1:5, x2 = 1:5, x3 = 1:10, x4 = 1:5, x5 = 1:5)
z <- c(TRUE, TRUE, FALSE, FALSE, TRUE)
b <- data.frame(x[which(z == TRUE)])
With a data frame like below
df1 <- data.frame(a=seq(1.1,9.9,1.1), b=seq(0.1,0.9,0.1),
c=rev(seq(10.1, 99.9, 11.1)))
I want to aggregate cols b and c by a
So I would do something like this
aggregate(cbind(b,c) ~ a, data = df1, mean)
This would get it done. However I want to generalize without hard coded column names like in a function.
myAggFunction <- function (df, col_main, col_1, col_2){
return (aggregate(cbind(df[,col1], df[,col2]) ~ df[,col_main], df, mean))
}
myAggFunction(df, 1, 2, 3)
The issue I have is that the col names of the returned data frame is as below
df2[, 1] V1 V2
How do I get the column names in the original data frame in the returned data frame?
I will be assuming a general case, where you have multiple LHS (left hand sides) as well as multiple RHS (right hand sides).
Using "data.frame" method
## S3 method for class 'data.frame'
aggregate(x, by, FUN, ..., simplify = TRUE, drop = TRUE)
If you pass object as a named list, you get names preserved. So do not access your data frame with [, ], but with []. You may construct your function as:
## `LHS` and `RHS` are vectors of column names or numbers giving column positions
fun1 <- function (df, LHS, RHS){
## call `aggregate.data.frame`
aggregate.data.frame(df[LHS], df[RHS], mean)
}
Still using "formula" method?
## S3 method for class 'formula'
aggregate(formula, data, FUN, ...,
subset, na.action = na.omit)
It is slightly tedious, but we want to construct a nice formula via:
as.formula( paste(paste0("cbind(", toString(LHS), ")"),
paste(RHS, collapse = " + "), sep = " ~ ") )
For example:
LHS <- c("y1", "y2", "y3")
RHS <- c("x1", "x2")
as.formula( paste(paste0("cbind(", toString(LHS), ")"),
paste(RHS, collapse = " + "), sep = "~") )
# cbind(y1, y2, y3) ~ x1 + x2
If you feed this formula to aggregate, you will get decent column names preserved.
So construct your function as such:
fun2 <- function (df, LHS, RHS){
## ideally, `LHS` and `RHS` should readily be vector of column names
## but specifying vector of numeric positions are allowed
if (is.numeric(LHS)) LHS <- names(df)[LHS]
if (is.numeric(RHS)) RHS <- names(df)[RHS]
## make a formula
form <- as.formula( paste(paste0("cbind(", toString(LHS), ")"),
paste(RHS, collapse = " + "), sep = "~") )
## call `aggregate.formula`
stats:::aggregate.formula(form, df, mean)
}
Remark
aggregate.data.frame is the best. aggregate.formula is a wrapper and will call model.frame inside to construct a data frame first.
I give "formula" method as an option, because the way I construct a formula is useful for lm, etc.
Simple, reproducible example
set.seed(0)
dat <- data.frame(y1 = rnorm(10), y2 = rnorm(10),
x1 = gl(2,5, labels = letters[1:2]))
## "data.frame" method with `fun1`
fun1(dat, 1:2, 3)
# x1 y1 y2
#1 a 0.79071819 -0.3543499
#2 b -0.07287026 -0.3706127
## "formula" method with `fun2`
fun2(dat, 1:2, 3)
# x1 y1 y2
#1 a 0.79071819 -0.3543499
#2 b -0.07287026 -0.3706127
fun2(dat, c("y1", "y2"), "x1")
# x1 y1 y2
#1 a 0.79071819 -0.3543499
#2 b -0.07287026 -0.3706127
I am using lavaan package and my intention is to get my model residuals as dataframes for further use. I run several models that have grouping variables. Here's the basic workflow:
require(lavaan)
df <- data.frame(
y1 = sample(1:100),
y2 = sample(1:100),
x1 = sample(1:100),
x2 = sample(1:100),
x3 = sample(1:100),
grpvar = sample(c("grp1","grp2"), 100, replace = T))
semModel <- list(length = 2)
semModel[1] <- 'y1 ~ c(a,b)*x1 + c(a,b)*x2'
semModel[2] <- 'y1 ~ c(a,b)*x1
y2 ~ c(a,b)*x2 + c(a,b)*x3'
funEstim <- function(model){
sem(model, data = df, group = "grpvar", estimator = "MLM")}
fits <- lapply(semModel, funEstim)
residuals <- lapply(fits, function(x) resid(x, "obs"))
Now the resulting residuals object bugs me. It is a list of matrices that is nested few times. How do I get each of the matrices as a separate dataframe without any hardcoding? I don't want to unlist them as that would lose some information.
You can use list2env along with unlist to make the grp1, grp2, length.grp1, and length.grp2 directly available in the global environment.
list2env(unlist(residuals, recursive=FALSE), envir=.GlobalEnv)
ls()
#[1] "df" "fits" "funEstim" "grp1" "grp2"
#[6] "length.grp1" "length.grp2" "residuals" "semModel"
But they won't be data frames. For that you could convert them to data frames before calling list2env:
df.list <- lapply(unlist(residuals, recursive=FALSE), data.frame)
list2env(df.list, envir=.GlobalEnv)