linear regression function creating a list instead of a model - r

I'm trying to fit an lm model using R. However, for some reason this code creates a list of the data instead of the usual regression model.
The code I use is this one
lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden )
But instead of the usual coefficients, the title of the variable appears mixed with the data in this way:
(Intercept) Inclusivity0.631 Inclusivity0.681 Inclusivity0.716 Inclusivity0.9
35.00 -4.00 -6.74 -4.30 4.90
Does anybody know why this happened and how it can be fixed?

What you are seeing is called a named num (a numeric vector with names). You can do the following:
Model <- lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden) # Assign the model to an object called Model
summary(Model) # Summary table
Model$coefficients # See only the coefficients as a named numeric vector
Model$coefficients[[1]] # See the first coefficient without name
If you want all the coefficients without names (so just a numeric vector), try:
unname(coef(Model))

It would be good if you could provide a sample of your data but I'm guessing that the key problem is that the numeric data in Inclusivity is stored as a factor. e.g.,
library(tidyverse)
x <- tibble(incl = as.factor(c(0.631, 0.681, 0.716)),
soc_vote=1:3)
lm(soc_vote ~ incl, x)
Call:
lm(formula = soc_vote ~ incl, data = x)
Coefficients:
(Intercept) incl0.681 incl0.716
1 1 2
Whereas, if you first convert the Inclusivity column to double, you get
y <- x %>% mutate(incl = as.double(as.character(incl)))
lm(soc_vote ~ incl, y)
Call:
lm(formula = soc_vote ~ incl, data = y)
Coefficients:
(Intercept) incl
-13.74 23.29
Note that I needed to convert to character first since otherwise I just get the ordinal equivalent of each factor.

Related

Too many coefficients with lm

I'm studying the relationship between expenditure per student and performance on pisa (a standardized test), i know that this regression can't give me a ceteris paribus relationship but this is the point of my exercise, i have to explain why it will not work.
I was running the regression on R with the basic code:
lm1=lm(a~b)
but the problem is that R reports me 32 coefficient, which is the number of the components of my population, while i should only receive the slope and the intercept, given that is a simple regression
This is the output that R gives me:
Call:
lm(formula = a ~ b)
Coefficients:
(Intercept) b10167.3 b10467.8 b10766.4 b10863.4 b10960.1 b11.688.4 b11028.1 b11052 b11207.3 b11855.9 b12424.3 b13930.8
522.9936 5.9561 0.3401 -20.6884 -14.8603 -15.0777 -3.5752 -23.0459 -27.1021 -42.2692 -20.4485 -35.3906 -30.7468
b14353.3 b2.997.9 b20450.9 b3714.8 b4996.3 b5291.6 b5851.7 b6190.7 b6663.3 b6725.3 b6747.2 b7074.9 b8189.1
-18.4412 -107.2872 -39.6793 -98.2315 -80.2505 -36.2202 -48.6179 -64.2414 1.3887 -19.0389 -59.9734 -32.0751 -31.5962
b8406.2 b8533.5 b8671.1 b8996.3 b9265.7 b9897.2
-13.4219 -26.0155 -13.9045 -37.9996 -17.0271 -27.2954
As you can see there are 32 coefficient while i should receive only two, it seems that R is reading each unite of the population as a variable but the dataset is, as always, set with variable in row. I can't figure out what is the problem.
It's not a problem with the lm function. It appears that R is treating $b$ as a categorical variable.
I have a made a small data with 5 observations, $a$ (numeric variable) and $b$ (categorical variable).
When I fit my model you will see that I am seeing a similar output as you (5 estimated coefficients).
data = data.frame(a = 1:5, b = as.factor(rnorm(5)))
lm(a~b, data)
Call:
lm(formula = a ~ b, data = data)
Coefficients:
(Intercept) b-0.16380292500502 b0.213340249988902 b0.423891299272316 b0.63738307939327
4 -3 -1 1 -2
To correct this you need to convert $b$ into a numerical vector.
data$b = as.numeric(as.character(data$b))
lm(a~b, data)
Call:
lm(formula = a ~ b, data = data)
Coefficients:
(Intercept) b
2.9580 0.2772
```

Setting dependent variable via numeric indexing in linear model in R

I'm trying to set the name of a column (or a specific vector element) as my dependent variable (DV) in a linear model in R.
When I do this manually by typing "ITEM26", there are no errors. The DV (y) is ITEM26, and the predictors are every other variable in the data frame.
> lm(ITEM26 ~ ., data = M.compsexitems)
I now want to set the DV in a linear model using the colnames function and numeric indexing, which provides the output of "ITEM26" when I refer to the first element. (My ultimate goal is to set up a for loop so that I can quickly set all column names as the DV of separate linear models.)
> colnames(M.compsexitems)[1]
[1] "ITEM26"
When I try setting up a linear model by using the colnames function and numeric indexing, however, I get an error.
> lm(colnames(M.compsexitems)[1] ~ ., data = M.compsexitems)
Error in model.frame.default(formula = colnames(M.compsexitems)[1] ~ ., :
variable lengths differ (found for 'ITEM26')
I get the same error if I manually create a vector of item names (sexitems), and refer to a specific element in the vector via indexing.
> sexitems
[1] "ITEM26" "ITEM27"
> summary(lm(sexitems[1] ~ ., data = M.compsexitems))$r.squared
Error in model.frame.default(formula = sexitems[1] ~ ., data = M.compsexitems, :
variable lengths differ (found for 'ITEM26')
Does anyone know why this error might exist, or how to overcome this error? I have a feeling that the lm function isn't treating the indexed vector element like it's the same as a variable in the data frame, but I'm not sure why.
Example dummy data frame on which the above problems hold true:
> M.compsexitems
ITEM26 ITEM27
1 2 4
2 3 5
Thank you in advance for your assistance.
Running lm using the first column as dependent variable and all other columns as independent variables can be done like this:
fm <- lm(M.compsexitems)
giving:
> fm
Call:
lm(formula = M.compsexitems)
Coefficients:
(Intercept) ITEM27
-2 1
If you need to get the formula explicitly:
fo <- formula(fm)
giving:
> fo
ITEM26 ~ ITEM27
<environment: 0x000000000e2f2b50>
If you want the above formula to explicitly appear in the output of lm then:
do.call("lm", list(fo, quote(M.compsexitems)))
giving:
Call:
lm(formula = ITEM26 ~ ITEM27, data = M.compsexitems)
Coefficients:
(Intercept) ITEM27
-2 1
If it's a huge regression and you don't want to run the large computation twice then run it the first time using head(M.compsexitems) or alternately construct the formula from character strings:
fo <- formula(paste(names(M.compsexitems)[1], "~."))
do.call("lm", list(fo, quote(M.compsexitems)))
giving:
Call:
lm(formula = ITEM26 ~ ., data = M.compsexitems)
Coefficients:
(Intercept) ITEM27
-2 1
Note
The input M.compsexitems in reproducible form used was:
Lines <- "
ITEM26 ITEM27
1 2 4
2 3 5"
M.compsexitems <- read.table(text = Lines)

Getting a subset of variables in R summary

When using the summary function in R, is there an option I can pass in there to present only a subset of the variables?
In my example, I ran a panel regression I have several explanatory variables, and have many dummy variables whose coefficients I do not want to present. I suppose there is a simple way to do this, but couldn't find it in the function documentation. Thanks
It is in the documentation, but you have to look for the associacted print method for summary.plm. The argument is subset. Use it as in the following example:
library(plm)
data("Grunfeld", package = "plm")
mod <- plm(inv ~ value + capital, data = Grunfeld)
print(summary(mod), subset = c("capital"))
Assuming the regression you ran behaves similarly as the summary() of a basic lm() model:
# set up data
x <- 1:100 * runif(100, .01, .02)
y <- 1:100 * runif(100, .01, .03)
# run a very basic linear model
mylm <- lm(x ~ y)
summary(mylm)
# we can save summary of our linear model as a variable
mylm_summary <- summary(mylm)
# we can then isolate coefficients from this summary (summary is just a list)
mylm_summary$coefficients
#output:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2007199 0.04352267 4.611846 1.206905e-05
y 0.5715838 0.03742379 15.273273 1.149594e-27
# note that the class of this "coefficients" object is a matrix
class(mylm_summ$coefficients)
# output
[1] "matrix"
# we can convert that matrix into a data frame so it is easier to work with and subset
mylm_df_coefficients <- data.frame(mylm_summary$coefficients)

How to ignore linearly correlated variables introduced by factor reference cell coding

Assume I have a dataset containing two categorical predictor variables (a,b) and a binary target (y) variable.
> df <- data.frame(
> a = factor(c("cat1","cat2","cat3","cat1","cat2")),
> b = factor(c("cat1","cat1","cat3","cat2","cat2")),
> y = factor(c(T,F,T,F,T))
> )
The following logical relations exist in the data:
if (a = cat3) then (b = cat3 and y = true)
else if (a = b) then (y = true) else y = false
I want to use glm to build a model for my dataset.
glm will automatically apply reference cell coding on my categorical variables a and b. It will also take care of finding the right number of codes for each factor variable, so that no alias variables are introduced (explained here).
However it can happen, as for the dataset above, that a linear relationship exists between one reference code generated for variable a and one reference code of variable b.
See the output of my model:
> model <- glm(y ~ ., family=binomial(link='logit'), data=df)
> summary(model)
...
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.965e-16 1.732e+00 0.000 1.000
acat2 -2.396e-16 2.000e+00 0.000 1.000
acat3 1.857e+01 6.523e+03 0.003 0.998
bcat2 0.000e+00 2.000e+00 0.000 1.000
bcat3 NA NA NA NA # <- get rid of this?
How should I handle this case?
Is there a way to tell glm to omit some of the generated reference codes?
In the real problem my "cat3" value corresponds to NA. I have two meaningful factor variables which are NA in exactly the same instances of my dataset.
EDIT:
The checked answer solves the question, however, in this specific case the singularities can simply be ignored as pointed out in the comments.
The comments made under the question are pertinent but it may still be useful to try eliminating the NA model matrix columns so that you can compare it to not doing such elimition in order to satisfy yourself regarding the equivalence.
In particular, you could run glm twice removing the redundant model matrix columns on the second run:
model <- glm(y ~ ., family=binomial(link='logit'), data=df) # as in question
mm <- model.matrix(model)[, !is.na(coef(model)) ]
df0 <- data.frame(y = df$y, mm[, -1])
update(model, data = df0)
giving:
Call: glm(formula = y ~ ., family = binomial(link = "logit"), data = df0)
Coefficients:
(Intercept) acat2 acat3 bcat2
1.965e-16 -2.396e-16 1.857e+01 0.000e+00
Degrees of Freedom: 4 Total (i.e. Null); 1 Residual
Null Deviance: 6.73
Residual Deviance: 5.545 AIC: 13.55
Note that if you don't want to use the fact that we know that the response is named y then we could extract the response and its name replacing the assignment to df0 above with:
df0 <- data.frame(model.response(model.frame(model)), mm[, -1])
names(df0)[1] <- as.character(attr(terms(model), "variables")[[2]])

passing model parameters to R's predict() function robustly

I am trying to use R to fit a linear model and make predictions. My model includes some constant side parameters that are not in the data frame. Here's a simplified version of what I'm doing:
dat <- data.frame(x=1:5,y=3*(1:5))
b <- 1
mdl <- lm(y~I(b*x),data=dat)
Unfortunately the model object now suffers from a dangerous scoping issue: lm() does not save b as part of mdl, so when predict() is called, it has to reach back into the environment where b was defined. Thus, if subsequent code changes the value of b, the predict value will change too:
y1 <- predict(mdl,newdata=data.frame(x=3)) # y1 == 9
b <- 5
y2 <- predict(mdl,newdata=data.frame(x=3)) # y2 == 45
How can I force predict() to use the original b value instead of the changed one? Alternatively, is there some way to control where predict() looks for the variable, so I can ensure it gets the desired value? In practice I cannot include b as part of the newdata data frame, because in my application, b is a vector of parameters that does not have the same size as the data frame of new observations.
Please note that I have greatly simplified this relative to my actual use case, so I need a robust general solution and not just ad-hoc hacking.
eval(substitute the value into the quoted expression
mdl <- eval(substitute(lm(y~I(b*x),data=dat), list(b=b)))
mdl
# Call:
# lm(formula = y ~ I(1 * x), data = dat)
# ...
We could also use bquote
mdl <- eval(bquote(lm(y~I(.(b)*x), data=dat)))
mdl
#Call:
#lm(formula = y ~ I(1 * x), data = dat)
#Coefficients:
#(Intercept) I(1 * x)
# 9.533e-15 3.000e+00
According to ?bquote description
‘bquote’ quotes its
argument except that terms wrapped in ‘.()’ are evaluated in the
specified ‘where’ environment.

Resources