Create formula using the name of a data frame column - r

Given a data.frame, I would like to (dynamically) create a formula y ~ ., where y is the name of the first column of the data.frame.
What complicates this beyond the approach of as.formula(paste(names(df)[1], "~ .")) is that the name of the column might be a function, e.g.:
names(model.frame(lm(I(Sepal.Length/Sepal.Width) ~ Species, data = iris)))[1] is "I(Sepal.Length/Sepal.Width)"
So I need the column name to be quoted, i.e. in the above example I would want the formula to be `I(Sepal.Length/Sepal.Width)` ~ ..
This works:
df <- model.frame(lm(I(Sepal.Length/Sepal.Width) ~ Species, data = iris))
fm <- . ~ .
fm[[2]] <- as.name(names(df)[1])
But is there a neat way to do it in one step?

We could use reformulate
reformulate(".", response = sprintf("`%s`", names(df)[1]))

Related

Using variable to select covariates for glm

I am running a simulation of multiple experiments using random data to create glm models. In each individual experiment I need to select different covariates to build the glm. Is there a way to use variable names to specify which covariates to use in the formula? For example, for a data frame called data that will contain the heading y plus a set of other headings that changes with each iteration, something like:
data <- data.frame(x1 = c(1:100),x2 = c(2:101),x3 = c(3:102),x4 = c(4:103),x5 = c(5,104),y = c(6:105))
#Experiment #1:
covars = c(x1,x2,x4)
glm(y ~ sum(covars),data=data)
#Experiment #2:
covars = c(x1,x3,x4,x5)
glm(y ~ sum(covars),data=data)
#Experiment #3:
covars = c(x2,x4,x5)
glm(y ~ sum(covars),data=data)
#etc...
So far, I have tried using this approach with the sum & colnames functions but I get the following error: "invalid 'type' (character) of argument"
Thank you!
We can use . to represent all the columns except the dependent column 'y'
glm(y ~ ., data = data)

Pass dynamically variable names in lm formula inside a function

I have a function that asks for two parameters:
dataRead (dataframe from the user)
variableChosen (which dependent variable the user wants to utilize
in the model)
Obs: indepent variable will always be the first column
But if the user gives me for example, a dataframe called dataGiven which columns names are: "Doses", "Weight"
I want that my model name has these names in my results
My actual function correctly make the lm, but my formula names from the data frame are gone (and shows how I got the data from the function)
Results_REG<- function (dataRead, variableChosen){
fit1 <- lm(formula = dataRead[,1]~dataRead[,variableChosen])
return(fit1)
}
When I call:
test1 <- Results_REG(dataGive, "Weight")
names(teste1$model)
shows:
"dataRead[, 1]" "dataRead[, variableChosen]"
I wanted to show my dataframe columns names, like:
"Doses" "Weight"
First off, it's always difficult to help without a reproducible code example. For future posts I recommend familiarising yourself with how to provide such a minimal reproducible example.
I'm not entirely clear on what you're asking, so I assume this is about how to create a function that fits a simple linear model based on data with a single user-chosen predictor var.
Here is an example based on mtcars
results_LM <- function(data, var) {
lm(data[, 1] ~ data[, var])
}
results_LM(mtcars, "disp")
#Call:
#lm(formula = data[, 1] ~ data[, var])
#
#Coefficients:
#(Intercept) data[, var]
# 29.59985 -0.04122
You can confirm that this gives the same result as
lm(mpg ~ disp, data = mtcars)
Or perhaps you're asking how to carry through the column names for the predictor? In that case we can use as.formula to construct a formula that we use together with the data argument in lm.
results_LM <- function(data, var) {
fm <- as.formula(paste(colnames(data)[1], "~", var))
lm(fm, data = data)
}
fit <- results_LM(mtcars, "disp")
fit
#Call:
#lm(formula = fm, data = data)
#
#Coefficients:
#(Intercept) disp
# 29.59985 -0.04122
names(fit$model)
#[1] "mpg" "disp"
outcome <- 'mpg'
model <- lm(mtcars[,outcome] ~ . ,mtcars)
yields the same result as:
data(mtcars)
model <- lm( mpg ~ . ,mtcars)
but allows you to pass a variable (the column name). However, this may cause an error where mpg is included in the right hand side of the equation as well. Not sure if anyone knows how to fix that.

Referencing factor names in R for ANOVA

I'm relatively new to R and am trying to streamline an ANOVA script to read a set of factor names from a table, and perform statistical tests on the interactions between these factors.
My basic question is how to not have to manually write the name of factors when I call aov, like this:
aov2 <- aov(no_gap ~ Diag*Age, data=data)
But instead, to index a variable which contains the names of the factors of interest, like this (but this doesn't work):
aov2 <- aov(get(vars[5]) ~ get(vars[1])*get(vars[2]), data=data)
Here's my whole script:
#Load data
outName <- read_file("fileNameToWrite.txt")
data <- read.table(header=TRUE, "testDataTable.txt",stringsAsFactors = TRUE)
vars <- colnames(data)
# Make sure subject column is a factor
cols <- c(vars[1:2])
data[,cols] <- data.frame(apply(data[cols], 2, as.factor))
##
# 2x2 between:
aov2 <- aov(get(vars[5]) ~ get(vars[1])*get(vars[2]), data=data)
aov2 <- aov(no_gap ~ Diag*Age, data=data)
aov2 <- aov(apply(vars[5]) ~ get(vars[1])*get(vars[2]), data=data)
summary(aov2)
For reference, this is what "vars" looks like when evaluated:
> vars
[1] "subject" "Diag" "Age" "gap" "no_gap"
Thanks so much for your help!!
The argument no_gap ~ Diag*Age you are passing to aov is a formula object. You can create a formula object from vars as follows:
myform <- as.formula(sprintf("%s ~ %s * %s", vars[5], vars[1], vars[2]))
aov2 <- aov(myform, data=data)

how do i exclude specific variables from a glm in R?

I have 50 variables. This is how I use them all in my glm.
var = glm(Stuff ~ ., data=mydata, family=binomial)
But I want to exclude 2 of them. So how do I exclude 2 in specific? I was hoping there would be something like this:
var = glm(Stuff ~ . # notthisstuff, data=mydata, family=binomial)
thoughts?
In addition to using the - like in the comments
glm(Stuff ~ . - var1 - var2, data= mydata, family=binomial)
you can also subset the data frame passed in
glm(Stuff ~ ., data=mydata[ , !(names(mydata) %in% c('var1','var2'))], family=binomial)
or
glm(Stuff ~ ., data=subset(mydata, select=c( -var1, -var2 ) ), family=binomial )
(be careful with that last one, the subset function sometimes does not work well inside of other functions)
You could also use the paste function to create a string representing the formula with the terms of interest (subsetting to the group of predictors that you want), then use as.formula to convert it to a formula.

How to pass data frame columns into a column vector to be used on RHS of regression model?

I have a list of 100 columns in a data frame Data1. One of these variables is the dependent variable. The others are predictors.
I need to extract 99 predictors into a column (say varlist) to be used in the equation below
equation <- as.formula(paste('y', "~", paste(varlist, collapse="+"),collapse=""))
I can use dput on the dataframe to extract all the columns but I could not get rid of the dependent variable y from the list:
Varlist <- dput(names(Data1))
It would be much more appropriate to go a different route. If you want to include all of the other variables in your data frame besides the response variable you can just use y ~ . to specify that.
fakedata <- as.data.frame(matrix(rnorm(100000), ncol = 100))
names(fakedata)[1] <- "y"
o <- lm(y ~ ., data = fakedata)
This fit a regression using the 99 other columns in fakedata as the predictors and 'y' as the response and stored it into 'o'
Edit: If you want to exclude some variables you can exclude those from the data set. The following removes the 10th column through the 100th column leaving a regression of y on columns 2-9
o <- lm(y ~ ., data = fakedata[,-(10:100)])

Resources