Setting dependent variable via numeric indexing in linear model in R - r

I'm trying to set the name of a column (or a specific vector element) as my dependent variable (DV) in a linear model in R.
When I do this manually by typing "ITEM26", there are no errors. The DV (y) is ITEM26, and the predictors are every other variable in the data frame.
> lm(ITEM26 ~ ., data = M.compsexitems)
I now want to set the DV in a linear model using the colnames function and numeric indexing, which provides the output of "ITEM26" when I refer to the first element. (My ultimate goal is to set up a for loop so that I can quickly set all column names as the DV of separate linear models.)
> colnames(M.compsexitems)[1]
[1] "ITEM26"
When I try setting up a linear model by using the colnames function and numeric indexing, however, I get an error.
> lm(colnames(M.compsexitems)[1] ~ ., data = M.compsexitems)
Error in model.frame.default(formula = colnames(M.compsexitems)[1] ~ ., :
variable lengths differ (found for 'ITEM26')
I get the same error if I manually create a vector of item names (sexitems), and refer to a specific element in the vector via indexing.
> sexitems
[1] "ITEM26" "ITEM27"
> summary(lm(sexitems[1] ~ ., data = M.compsexitems))$r.squared
Error in model.frame.default(formula = sexitems[1] ~ ., data = M.compsexitems, :
variable lengths differ (found for 'ITEM26')
Does anyone know why this error might exist, or how to overcome this error? I have a feeling that the lm function isn't treating the indexed vector element like it's the same as a variable in the data frame, but I'm not sure why.
Example dummy data frame on which the above problems hold true:
> M.compsexitems
ITEM26 ITEM27
1 2 4
2 3 5
Thank you in advance for your assistance.

Running lm using the first column as dependent variable and all other columns as independent variables can be done like this:
fm <- lm(M.compsexitems)
giving:
> fm
Call:
lm(formula = M.compsexitems)
Coefficients:
(Intercept) ITEM27
-2 1
If you need to get the formula explicitly:
fo <- formula(fm)
giving:
> fo
ITEM26 ~ ITEM27
<environment: 0x000000000e2f2b50>
If you want the above formula to explicitly appear in the output of lm then:
do.call("lm", list(fo, quote(M.compsexitems)))
giving:
Call:
lm(formula = ITEM26 ~ ITEM27, data = M.compsexitems)
Coefficients:
(Intercept) ITEM27
-2 1
If it's a huge regression and you don't want to run the large computation twice then run it the first time using head(M.compsexitems) or alternately construct the formula from character strings:
fo <- formula(paste(names(M.compsexitems)[1], "~."))
do.call("lm", list(fo, quote(M.compsexitems)))
giving:
Call:
lm(formula = ITEM26 ~ ., data = M.compsexitems)
Coefficients:
(Intercept) ITEM27
-2 1
Note
The input M.compsexitems in reproducible form used was:
Lines <- "
ITEM26 ITEM27
1 2 4
2 3 5"
M.compsexitems <- read.table(text = Lines)

Related

linear regression function creating a list instead of a model

I'm trying to fit an lm model using R. However, for some reason this code creates a list of the data instead of the usual regression model.
The code I use is this one
lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden )
But instead of the usual coefficients, the title of the variable appears mixed with the data in this way:
(Intercept) Inclusivity0.631 Inclusivity0.681 Inclusivity0.716 Inclusivity0.9
35.00 -4.00 -6.74 -4.30 4.90
Does anybody know why this happened and how it can be fixed?
What you are seeing is called a named num (a numeric vector with names). You can do the following:
Model <- lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden) # Assign the model to an object called Model
summary(Model) # Summary table
Model$coefficients # See only the coefficients as a named numeric vector
Model$coefficients[[1]] # See the first coefficient without name
If you want all the coefficients without names (so just a numeric vector), try:
unname(coef(Model))
It would be good if you could provide a sample of your data but I'm guessing that the key problem is that the numeric data in Inclusivity is stored as a factor. e.g.,
library(tidyverse)
x <- tibble(incl = as.factor(c(0.631, 0.681, 0.716)),
soc_vote=1:3)
lm(soc_vote ~ incl, x)
Call:
lm(formula = soc_vote ~ incl, data = x)
Coefficients:
(Intercept) incl0.681 incl0.716
1 1 2
Whereas, if you first convert the Inclusivity column to double, you get
y <- x %>% mutate(incl = as.double(as.character(incl)))
lm(soc_vote ~ incl, y)
Call:
lm(formula = soc_vote ~ incl, data = y)
Coefficients:
(Intercept) incl
-13.74 23.29
Note that I needed to convert to character first since otherwise I just get the ordinal equivalent of each factor.

How to use data frames to conduct ANOVAs in R

I am currently learning R and am playing around with a dataset that has four nominal variables (Hour.Of.Arrival, Mode, Unit, Weekday), and a continuous dependent variable (Overall). This is all imported from a .csv in a data frame named basic. What I am trying to do is run an ANOVA just using this data frame, without creating separate vectors (e.g. Mode<-basic$Mode). "Fit" holds the results of the ANOVA. Here is the code that I wrote:
Fit<-aov(basic["Overall"],basic["Unit"],data=basic)
However, I keep getting the error
"Error in terms.default(formula, "Error", data = data) : no terms
component nor attribute
I hope this question isn't too basic!!
Thanks :)
I think you want something more like Fit<-aov(Overall ~ Unit,data=basic). The Overall ~ Unit tells R to treat Overall as an outcome being predicted by Unit; you already specify that the dataframe to find these variables is basic.
Here's an example to show you how it works:
> y <- rnorm(100)
> x <- factor(rep(c('A', 'B', 'C', 'D'), each = 25))
> dat <- data.frame(x, y)
> aov(y ~ x, data = dat)
Call:
aov(formula = y ~ x, data = dat)
Terms:
x Residuals
Sum of Squares 2.72218 114.54631
Deg. of Freedom 3 96
Residual standard error: 1.092333
Estimated effects may be unbalanced
Note, you don't need to use the data argument, you could also use aov(dat$y ~ dat$x), but the first argument to the function should be a formula.

Getting a subset of variables in R summary

When using the summary function in R, is there an option I can pass in there to present only a subset of the variables?
In my example, I ran a panel regression I have several explanatory variables, and have many dummy variables whose coefficients I do not want to present. I suppose there is a simple way to do this, but couldn't find it in the function documentation. Thanks
It is in the documentation, but you have to look for the associacted print method for summary.plm. The argument is subset. Use it as in the following example:
library(plm)
data("Grunfeld", package = "plm")
mod <- plm(inv ~ value + capital, data = Grunfeld)
print(summary(mod), subset = c("capital"))
Assuming the regression you ran behaves similarly as the summary() of a basic lm() model:
# set up data
x <- 1:100 * runif(100, .01, .02)
y <- 1:100 * runif(100, .01, .03)
# run a very basic linear model
mylm <- lm(x ~ y)
summary(mylm)
# we can save summary of our linear model as a variable
mylm_summary <- summary(mylm)
# we can then isolate coefficients from this summary (summary is just a list)
mylm_summary$coefficients
#output:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2007199 0.04352267 4.611846 1.206905e-05
y 0.5715838 0.03742379 15.273273 1.149594e-27
# note that the class of this "coefficients" object is a matrix
class(mylm_summ$coefficients)
# output
[1] "matrix"
# we can convert that matrix into a data frame so it is easier to work with and subset
mylm_df_coefficients <- data.frame(mylm_summary$coefficients)

passing model parameters to R's predict() function robustly

I am trying to use R to fit a linear model and make predictions. My model includes some constant side parameters that are not in the data frame. Here's a simplified version of what I'm doing:
dat <- data.frame(x=1:5,y=3*(1:5))
b <- 1
mdl <- lm(y~I(b*x),data=dat)
Unfortunately the model object now suffers from a dangerous scoping issue: lm() does not save b as part of mdl, so when predict() is called, it has to reach back into the environment where b was defined. Thus, if subsequent code changes the value of b, the predict value will change too:
y1 <- predict(mdl,newdata=data.frame(x=3)) # y1 == 9
b <- 5
y2 <- predict(mdl,newdata=data.frame(x=3)) # y2 == 45
How can I force predict() to use the original b value instead of the changed one? Alternatively, is there some way to control where predict() looks for the variable, so I can ensure it gets the desired value? In practice I cannot include b as part of the newdata data frame, because in my application, b is a vector of parameters that does not have the same size as the data frame of new observations.
Please note that I have greatly simplified this relative to my actual use case, so I need a robust general solution and not just ad-hoc hacking.
eval(substitute the value into the quoted expression
mdl <- eval(substitute(lm(y~I(b*x),data=dat), list(b=b)))
mdl
# Call:
# lm(formula = y ~ I(1 * x), data = dat)
# ...
We could also use bquote
mdl <- eval(bquote(lm(y~I(.(b)*x), data=dat)))
mdl
#Call:
#lm(formula = y ~ I(1 * x), data = dat)
#Coefficients:
#(Intercept) I(1 * x)
# 9.533e-15 3.000e+00
According to ?bquote description
‘bquote’ quotes its
argument except that terms wrapped in ‘.()’ are evaluated in the
specified ‘where’ environment.

cv.glm variable lengths differ

I am trying to cv.glm on a linear model however each time I do I get the error
Error in model.frame.default(formula = lindata$Y ~ 0 + lindata$HomeAdv + :
variable lengths differ (found for 'air-force-falcons')
air-force-falcons is the first variable in the dataset lindata. When I run glm I get no errors. All the variables are in a single dataset and there are no missing values.
> linearmod5<- glm(lindata$Y ~ 0 + lindata$HomeAdv + ., data=lindata, na.action="na.exclude")
> set.seed(1)
> cv.err.lin=cv.glm(lindata,linearmod5,K=10)
Error in model.frame.default(formula = lindata$Y ~ 0 + lindata$HomeAdv + :
variable lengths differ (found for 'air-force-falcons')
I do not know what is driving this error or the solution. Any ideas? Thank you!
What is causing this error is a mistake in the way you specify the formula
This will produce the error:
mod <- glm(mtcars$cyl ~ mtcars$mpg + .,
data = mtcars, na.action = "na.exclude")
cv.glm(mtcars, mod, K=11) #nrow(mtcars) is a multiple of 11
This not:
mod <- glm(cyl ~ ., data = mtcars)
cv.glm(mtcars, mod, K=11)
neither this:
mod <- glm(cyl ~ + mpg + disp, data = mtcars)
cv.glm(mtcars, mod, K=11)
What happens is that you specify the variable in like mtcars$cyl this variable have a number of rows equal to that of the original dataset. When you use cv.glm you partition the data frame in K parts, but when you refit the model on the resampled data it evaluates the variable specified in the form data.frame$var with the original (non partitioned) length, the others (that specified by .) with the partitioned length.
So you have to use relative variable in the formula (without $).
Other advices on formula:
avoid using a mix of specified variables and . you double variables. The dot is for all vars in the df except those on the left of tilde.
Why do you add a zero? if it is in the attempt to remove the intercept use -1 instead. However, this is a bad practice in my opinion

Resources