Related
The code below creates a linear model with R's lm, then a weighted model with a weights column. Finally, I try to pass in the weight column name with a variable weight_col and that fails. I'm pretty sure it's looking for "weight_col" in df, then the caller's environment, finds a variable of length 1, and the lengths don't match.
How do I get it to use weight_col as a name for the weights column in df?
I've tried several combinations of things without success.
> df <- data.frame(
x=c(1,2,3),
y=c(4,5,7),
w=c(1,3,5)
)
> lm(y ~ x, data=df)
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
2.333 1.500
> lm(y ~ x, data=df, weights=w)
Call:
lm(formula = y ~ x, data = df, weights = w)
Coefficients:
(Intercept) x
1.947 1.658
> weight_col <- 'w'
> lm(y ~ x, data=df, weights=weight_col)
Error in model.frame.default(formula = y ~ x, data = df, weights = weight_col, :
variable lengths differ (found for '(weights)')
> R.version.string
[1] "R version 3.6.3 (2020-02-29)"
You can use the data frame name with extractor operator:
lm(y ~ x, data = df, weights = df[[weight_col]])
Or you can use function get:
lm(y ~ x, data = df, weights = get(weight_col))
We can use [[ to extract the value of the column
lm(y ~ x, data=df, weights=df[[weight_col]])
Or with tidyverse
library(dplyr)
df %>%
summarise(model = list(y ~ x, weights = .data[[weight_col]]))
Your first example if weights = w, which is using non-standard evaluation to find w in the context of df. So far, this is normal for interactive use.
Your second set is weights = weight_col which resolves to weights = "w", which is very different. There is nothing in R's non-standard (or standard) evaluation in which that makes sense.
As I said in my comment, use the standard-evaluation form with [[.
lm(y ~ x, data=df, weights=df[[weight_col]])
# Call:
# lm(formula = y ~ x, data = df, weights = df[[weight_col]])
# Coefficients:
# (Intercept) x
# 1.947 1.658
I want to run linear models (in this case, multivariate models with two response variables) within a for loop in which a new data frame called bc_applied is created at each iteration, as well as the vector targets. In my code, the column names "target1" and "target2" change at every iteration, which means I can't explicitly write variable names, instead I want to extract them from the vector targets.
Here is an example:
targets <- c("target1","target2")
bc_applied <- data.frame("dsRNA" = c(rep("gene1",5),rep("gene2",5),rep("gene3",5)),
"target1" = runif(15), "target2" = runif(15))
But when running
lm(bc_applied[,targets] ~ dsRNA, data = bc_applied)
The following error is returned:
Error in model.frame.default(formula = bc_applied[, targets] ~ dsRNA, :
invalid type (list) for variable 'bc_applied[, targets]'
The desired output is given by
lm(cbind(target1, target2) ~ dsRNA, data = bc_applied)
According to ?lm
If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix.
With cbind, it is creating a matrix. So, we need an option that takes a matrix. After subsetting the dataset with the columns, convert it to a matrix with as.matrix and it should work
lm(as.matrix(bc_applied[,targets]) ~ dsRNA, data = bc_applied)
-output
#Call:
#lm(formula = as.matrix(bc_applied[, targets]) ~ dsRNA, data = bc_applied)
#Coefficients:
# target1 target2
#(Intercept) 0.45161 0.47457
#dsRNAgene2 0.36341 0.29226
#dsRNAgene3 -0.07115 -0.03003
Or another option is to create a formula with paste
lm(paste0('cbind(', toString(targets),') ~ dsRNA'), data = bc_applied)
-output
#Call:
#lm(formula = paste0("cbind(", toString(targets), ") ~ dsRNA"),
# data = bc_applied)
#Coefficients:
# target1 target2
#(Intercept) 0.45161 0.47457
#dsRNAgene2 0.36341 0.29226
#dsRNAgene3 -0.07115 -0.03003
or create the formula with glue
lm(glue::glue('cbind({toString(targets)}) ~ dsRNA'), bc_applied)
or another option is
lm(do.call(cbind, asplit(bc_applied[, targets], 2)) ~ dsRNA, bc_applied)
Crosschecking with cbind
lm(cbind(target1, target2)~ dsRNA, data = bc_applied)
-output
#Call:
#lm(formula = cbind(target1, target2) ~ dsRNA, data = bc_applied)
#Coefficients:
# target1 target2
#(Intercept) 0.45161 0.47457
#dsRNAgene2 0.36341 0.29226
#dsRNAgene3 -0.07115 -0.03003
I use a best subset selection package to determine the best independent variables from which to build my model (I do have a specific reason for doing this instead of using the best subset object directly). I want to programmatically extract the feature names and use the resulting string to build my model formula. The result would be something like this:
x <- "x1 + x2 + x3"
y <- "Surv(time, event)"
Because I'm building a coxph model, the formula is as follows:
coxph(Surv(time, event) ~ x1 + x2 + x3)
Using these string fields, I tried to construct the formula like so:
form <- y ~ x
This creates an object of class formula but when I call coxph it doesn't evaluate based on the references created form the formula object. I get the following error:
Error in model.frame.default(formula = y ~ x) : object is not a matrix
If I call eval on the objects y and x within the coxph call, I get the following:
Error in model.frame.default(formula = eval(y) ~ eval(x), data = df) :
variable lengths differ (found for 'eval(x)')
I'm not really sure how to proceed. Thanks for your input.
Couldn't find a good dupe, so posting comment as an answer.
If you build the full formula as a string, including the ~, you can use as.formula on it, e.g.,
x = "x1 + x2 + x3"
y = "Surv(time, event)"
form = as.formula(paste(y, "~", x))
coxph(form, data = your_data)
For a reproducible example, consider the first example at the bottom of the ?coxph help page:
library(survival)
test1 <- list(time=c(4,3,1,1,2,2,3),
status=c(1,1,1,0,1,1,0),
x=c(0,2,1,1,1,0,0),
sex=c(0,0,0,0,1,1,1))
# Fit a stratified model
coxph(Surv(time, status) ~ x + strata(sex), test1)
# Call:
# coxph(formula = Surv(time, status) ~ x + strata(sex), data = test1)
#
# coef exp(coef) se(coef) z p
# x 0.802 2.231 0.822 0.98 0.33
#
# Likelihood ratio test=1.09 on 1 df, p=0.3
# n= 7, number of events= 5
lhs = "Surv(time, status)"
rhs = "x + strata(sex)"
form = as.formula(paste(lhs, "~", rhs))
form
# Surv(time, status) ~ x + strata(sex)
## formula looks good
coxph(form, test1)
# Call:
# coxph(formula = form, data = test1)
#
# coef exp(coef) se(coef) z p
# x 0.802 2.231 0.822 0.98 0.33
Same results either way.
I am trying to master building functions in R. Say I have a data frame or data.table,
dummy <- df(y, x, a, b, who)
Where the vector "who" is like so,
who <- c("Joseph", "Kim", "Billy")
I would like to use the character vector to perform various regression models and name the outputs and their summary statistics. So for the entry, "Billy" in the vector above, I would like something like this:
function() {
ols.reg.Billy <- lm(y ~ x + a + b, data = dummy[dummy$who == "Billy"])
dw.Billy <- dwtest(ols.reg.Billy)
output.Billy <- list(ols.reg.Billy, dw.Billy)
return(output.Billy)
}
But for 500 different entries of the who vector above.
Is there some way to do this? What's the most efficient way? I keep getting errors and I feel I am seriously missing something. Is there some way to use paste?
If this doesn't solve it, please provide a reproducible example. It makes it easier to help you.
library(lmtest)
outputs <- lapply(who, function(name) {
ols.reg <- lm(y ~ x + a + b, data = dummy[dummy$who == name])
dw <- dwtest(ols.reg)
output <- paste(c("ols.reg","dw"), name, sep = "_")
return(output)
})
1) Map Using the built in CO2 data set suppose we wish to regress uptake on conc separately for each Type. Note that this names the components by the Type.
Map(function(x) lm(uptake ~ conc, CO2, subset = Type == x), levels(CO2$Type))
giving this two component list (one component for each level of Type -- Quebec and Mississauga) -- continued after output.
$Quebec
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == x)
Coefficients:
(Intercept) conc
23.50304 0.02308
$Mississippi
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == x)
Coefficients:
(Intercept) conc
15.49754 0.01238
2) Map/do.call We may wish to not only name the components using the Type but also have x substituted with the actual Type in the Call: line of the output. In that case use do.call to invoke lm and use quote to ensure that the name of the data frame rather than its value is displayed and use bquote to perform the substitution for x.
reg <- function(x) {
do.call("lm", list(uptake ~ conc, quote(CO2), subset = bquote(Type == .(x))))
}
Map(reg, levels(CO2$Type))
giving:
$Quebec
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == "Quebec")
Coefficients:
(Intercept) conc
23.50304 0.02308
$Mississippi
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == "Mississippi")
Coefficients:
(Intercept) conc
15.49754 0.01238
3) lmList The nlme package has lmList for doing this:
library(nlme)
lmList(uptake ~ conc | Type, CO2, pool = FALSE)
giving:
Call:
Model: uptake ~ conc | Type
Data: CO2
Coefficients:
(Intercept) conc
Quebec 23.50304 0.02308005
Mississippi 15.49754 0.01238113
I want to estimate an equation such as:
(where the bar denotes the mean of a variable.... Meaning, I want to automatically have interactions between Z and a demeaned version of X. So far I just demean the variables manually beforehand and estimate:
lm(Y ~ .*Z, data= sdata)
This seems to be working, but I would rather use a solution that does not require manual demeaning beforehand because I would also like to include the means of more complex terms, such as:
Edit:
As requested, a working code-sample, note that in the actual thing I have large (and varying) numbers of X- variables, so that I dont want to use a hard-coded variant:
x1 <- runif(100)
x2 <- runif(100)
Z <- runif(100)
Y <- exp(x1) + exp(x2) + exp(z)
##current way of estimating the first equation:
sdata <- data.frame(Y=Y,Z=Z,x1=x1-mean(x1),x2=x2-mean(x2))
lm(Y ~ .*Z, data= sdata)
##basically what I want is that the following terms, and their interactions with Z are also used:
# X1^2 - mean(X1^2)
# X2^2 - mean(X2^2)
# X1*X2 - mean(X1*X2)
Edit 2:
Now, what I want to achieve is basically what
lm(Y ~ .^2*Z, data= sdata)
would do. However, given prior demeaing expressions in there, such as: Z:X1:X2 would correspond to: (x1-mean(x1))*(x2-mean(x2)), while what I want to have is x1*x2-mean(x1*x2)
To show that scale works inside a formula:
lm(mpg ~ cyl + scale(disp*hp, scale=F), data=mtcars)
Call:
lm(formula = mpg ~ cyl + scale(disp * hp, scale = F), data = mtcars)
Coefficients:
(Intercept) cyl scale(disp * hp, scale = F)
3.312e+01 -2.105e+00 -4.642e-05
Now for comparison let's scale the interaction outside the formula:
mtcars$scaled_interaction <- with(mtcars, scale(disp*hp, scale=F))
lm(mpg ~ cyl + scaled_interaction, data=mtcars)
Call:
lm(formula = mpg ~ cyl + scaled_interaction, data = mtcars)
Coefficients:
(Intercept) cyl scaled_interaction
3.312e+01 -2.105e+00 -4.642e-05
At least in these examples, it seems as if scale inside formulae is working.
To provide a solution to your specific issue:
Alternative 1: Use formulae
# fit without Z
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
vars <- attr(mod$terms, "term.labels")
vars <- gsub(":", "*", vars) # needed so that scale works later
vars <- paste0("scale(", vars, ", scale=F)")
newf <- as.formula(paste0("Y ~ ", paste0(vars, collapse = "+")))
# now interact with Z
f2 <- update.formula(newf, . ~ .*Z)
# This fives the following formula:
f2
Y ~ scale(x1, scale = F) + scale(x2, scale = F) + scale(x1*x2, scale = F) +
Z + scale(x1, scale = F):Z + scale(x2, scale = F):Z + scale(x1*x2, scale = F):Z
Alternative 2: Use Model Matrices
# again fit without Z and get model matrix
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
modmat <- apply(model.matrix(mod), 2, function(x) scale(x, scale=F))
Here, all x's and the interactions are demeaned:
> head(modmat)
(Intercept) x1 x2 x1:x2
[1,] 0 0.1042908 -0.08989091 -0.01095459
[2,] 0 0.1611867 -0.32677059 -0.05425087
[3,] 0 0.2206845 0.29820499 0.06422944
[4,] 0 0.3462069 -0.15636463 -0.05571430
[5,] 0 0.3194451 -0.38668844 -0.12510551
[6,] 0 -0.4708222 -0.32502269 0.15144812
> round(colMeans(modmat), 2)
(Intercept) x1 x2 x1:x2
0 0 0 0
You can use the model matrix as follows:
modmat <- modmat[, -1] # remove intercept
lm(sdata$Y ~ modmat*sdata$Z)
It is not beautiful, but should do the work with any number of explanatory variables. You can also add Y and Z to the matrix so that the output looks prettier if this is a concern. Note that you can also create the model matrix directly without fitting the model. I took it from the fitted model directly since it have already fitted it for the first approach.
As a sidenote, it may be that this is not implemented in a more straight forward fashion because it is difficult to imagine situations in which demeaning the interaction is more desirable compared to the interaction of demeaned variables.
Comparing both approaches:
Here the output of both approaches for comparison. As you can see, apart from the coefficient names everything is identical.
> lm(sdata$Y ~ modmat*sdata$Z)
Call:
lm(formula = sdata$Y ~ modmat * sdata$Z)
Coefficients:
(Intercept) modmatx1 modmatx2 modmatx1:x2 sdata$Z
4.33105 1.56455 1.43979 -0.09206 1.72901
modmatx1:sdata$Z modmatx2:sdata$Z modmatx1:x2:sdata$Z
0.25332 0.38155 -0.66292
> lm(f2, data=sdata)
Call:
lm(formula = f2, data = sdata)
Coefficients:
(Intercept) scale(x1, scale = F) scale(x2, scale = F)
4.33105 1.56455 1.43979
scale(x1 * x2, scale = F) Z scale(x1, scale = F):Z
-0.09206 1.72901 0.25332
scale(x2, scale = F):Z scale(x1 * x2, scale = F):Z
0.38155 -0.66292