Apparently the car package vif function and performance package check_collinearity function both calculate generalized variance inflation factor (GVIF) automatically if a categorical variable is entered into a regression. As an example, this can be done with the iris dataset with the following code:
#### Categorical Fit ####
fit.cat <- lm(
Petal.Length ~ Species + Petal.Width + Sepal.Width,
iris
)
check_collinearity(fit.cat)
Which gives an an expected value of 26.10 that I have already hand calculated. However, Jamovi doesn't allow one to automatically add factors to a regression, so I dummy coded the same regression factor and entered the regression like so:
You can see in the arrow that the value doesn't match that obtained from the R function. I also double checked in R to see if it is just calculating VIF instead:
1/(1-summary(lm(as.numeric(Species) ~ 1 + Petal.Width + Sepal.Width,
iris))$r.squared)
But the values don't match, as this gives me a VIF of 12.78. Why is it doing this and is there a solution in Jamovi for hacking this?
Related
I need to find a regression in R which has the form of
lm(Binary_value ~ Age, data=dataframe)
But my age variable starts at 15 yrs old so I'm not interested in ages that are less than 14. How can I specify that I only want my regression to be accurate at the age point of 15 and not worry about smaller values? I tried it this way:
lm(Binary_value ~ Age, data=dataframe)
But I get nonsense results for higher ages.
First things first, remember that R is case-sensitive, so the function would look like lm, not LM. I edited your question to fix that. Second, a regression only includes data that is available for prediction. It will not magically make up 14 data points if they are not already present, so there is no issue there. However, the regression line will not map to just => 15 years old because it uses the model coefficients to draw an intercept. An example below with fake data:
#### Create Fake Data ####
set.seed(123)
x <- 15:100 # use these numbers for age
age <- sample(x, # using x
size=1000, # sample 1000 times
replace=T) # sample with replacement
outcome <- age * .60 + rnorm(n=1000,sd=15) # make fake outcome variable
df <- data.frame(age,outcome)
#### Fit Data ####
fit <- lm(outcome ~ age, data = df)
summary(fit)
plot(age,outcome)
abline(fit,
col = "red")
You will see that the regression line, despite only including 15, will still draw to the left where there is no data. This is because the intercept is a conditional value based on the coefficients.
P.S. I used a normal Gaussian regression for this example because you used the lm function in your question, but included a binary response. For a logistic regression, the rationale would be the same, but it would use glm instead.
Hello (first timer here),
I would like to estimate a "two-way" cluster-robust variance-covariance matrix in R. I am using a particular canned routine from the "multiwayvcov" library. My question relates solely to the set-up of the cluster.vcov function in R. I have panel data of various crime outcomes. My cross-sectional unit is the "precinct" (over 40 precincts) and I observe crime in those precincts over several "months" (i.e., 24 months). I am evaluating an intervention that 'turns on' (dummy coded) for only a few months throughout the year.
I include "precinct" and "month" fixed effects (i.e., a full set of precinct and month dummies enter the model). I have only one independent variable I am assessing. I want to cluster on "both" dimensions but I am unsure how to set it up.
Do I estimate all the fixed effects with lm first? Or, do I simply run a model regressing crime on the independent variable (excluding fixed effects), then use cluster.vcov i.e., ~ precinct + month_year.
This seems like it would provide the wrong standard error though. Right? I hope this was clear. Sorry for any confusion. See my set up below.
library(multiwayvcov)
model <- lm(crime ~ as.factor(precinct) + as.factor(month_year) + policy, data = DATASET_full)
boot_both <- cluster.vcov(model, ~ precinct + month_year)
coeftest(model, boot_both)
### What the documentation offers as an example
### https://cran.r-project.org/web/packages/multiwayvcov/multiwayvcov.pdf
library(lmtest)
data(petersen)
m1 <- lm(y ~ x, data = petersen)
### Double cluster by firm and year using a formula
vcov_both_formula <- cluster.vcov(m1, ~ firmid + year)
coeftest(m1, vcov_both_formula)
Is is appropriate to first estimate a model that ignores the fixed effects?
First the answer: you should first estimate your lm -model using fixed effects. This will give you your asymptotically correct parameter estimates. The std errors are incorrect because they are calculated from a vcov matrix which assumes iid errors.
To replace the iid covariance matrix with a cluster robust vcov matrix, you can use cluster.vcov, i.e. my_new_vcov_matrix <- cluster.vcov(~ precinct + month_year).
Then a recommendation: I warmly recommend the function felm from lfe for both multi-way fe's and cluster-robust standard erros.
The syntax is as follows:
library(multiwayvcov)
library(lfe)
data(petersen)
my_fe_model <- felm(y~x | firmid + year | 0 | firmid + year, data=petersen )
summary(my_fe_model)
I am using the effects package to find the effect of variables in my linear model.
library(effects)
data(iris)
lm1 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length*Petal.Width,data=iris)
For a simple term in the model, I can get the effects for each data point using
effect("Sepal.Width", lm1, xlevels=iris['Sepal.Width'])
How can I get a similar 1-dimensional vector of values for my interaction term at each point? Does this even make sense? Everything thing I've tried is returning a 2-d matrix e.g.
effect("Petal.Length:Petal.Width", lm1 ,xlevels=iris['Petal.Length']*iris['Petal.Width'])
I'm not sure what should be used for the the xlevels argument in this case to give me more than just the default 5 equally spaced points.
Think I've figure out something which gives me what I want.
# Create dataframe with all possible combinations
eff_df <- data.frame(effect("Petal.Length:Petal.Width",lm1,xlevels=list(Petal.Length=iris$Petal.Length, Petal.Width=iris$Petal.Width)))
# Create column to merge on in eff_df
eff_df$merge_col <- paste0(eff_df$Petal.Length,eff_df$Petal.Width)
# Create column it will share in iris
iris$merge_col <- paste0(iris$Petal.Length,iris$Petal.Width)
# Only eff_df$fit values which correspond to a value in iris will be merged
iris <- merge(iris, eff_df[,c(7,3)], by="merge_col", all.x=T)
Then the effects vector is stored in iris$fit.
I'm constructing a linear model to evaluate the effect of distances from a habitat boundary on the richness of an order of insects. There are some differences in equipment used so I am including equipment as a categorical variable to ensure that it hasn't had a significant affect on richness.
The categorical factor is 3 leveled so I asked r to produced dummy variables in the lm by using the code:
lm(Richness ~ Distances + factor(Equipment), data = Data)
When I ask for the summary of the model I can see two of the levels with their coefficients. I am assuming that this means r is using one of the levels as the "standard" to compare the coefficients of the other levels to.
How can I find the coefficient for the third level in order to see what effect it has on the model?
Thank you
You can do lm(y~x-1) to remove the intercept, which in your case is the reference level of one of the factors. That being said, there are statistical reasons for using one of the levels as a reference.
To determine how to extract your coefficient, here is a simple example:
# load data
data(mtcars)
head(mtcars)
# what are the means of wt given the factor carb?
(means <- with(mtcars, tapply(wt, factor(carb), mean)))
# run the lm
mod <- with(mtcars, lm(wt~factor(carb)))
# extract the coefficients
coef(mod)
# the intercept is the reference level (i.e., carb 1)
coef(mod)[1]
coef(mod)[2:6]
coef(mod)[1] + coef(mod)[2:6]
means
So you can see that the coefficients are simply added to the reference level (i.e., intercept) in this simple case. However, if you have a covariate, it gets more complicated
mod2 <- lm(wt ~ factor(carb) + disp, data=mtcars)
summary(mod2)
The intercept is now the carb 1 when disp = 0.
Iam using this command in R for building decision trees :
> library(party)
> ind = sample(2,nrow(iris),replace=TRUE,prob=c(0.8,0.2))
> myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
> iris_ctree <- ctree(myFormula,data = iris[ind==1,])
> predict(iris_ctree)
What exactly does predict function compute and how does it perform the computation?
the example first constructs "ind" based on a sampling of 1's with probability .8 and 2's with probability .2. It then specifies a Formula that defines the hypothesis function for the model. It then fits the conditional inference tree to the estimate the parameters based on the hypothesis specification using the sampled data - which is just the data containing 1's.
It then runs a prediction based on the full sample of 1's and 2's.
So basically it trained on 1's, but runs predict on 1's and 2's.