Remove linear dependent variables while using the bife package - r

Some pre-programmed models automatically remove linear dependent variables in their regression output (e.g. lm()) in R. With the bife package, this does not seem to be possible. As stated in the package description in CRAN on page 5:
If bife does not converge this is usually a sign of linear dependence between one or more regressors
and the fixed effects. In this case, you should carefully inspect your model specification.
Now, suppose the problem at hand involves doing many regressions and one cannot inspect adequately each regression output -- one has to suppose some sort of rule-of-thumb regarding the regressors. What could be some of the alternatives to remove linear dependent regressors more or less automatically and achieve an adequate model specification?
I set a code as an example below:
#sample coding
x=10*rnorm(40)
z=100*rnorm(40)
df1=data.frame(a=rep(c(0,1),times=20), x=x, y=x, z=z, ID=c(1:40), date=1, Region=rep(c(1,2, 3, 4),10))
df2=data.frame(a=c(rep(c(1,0),times=15),rep(c(0,1),times=5)), x=1.4*x+4, y=1.4*x+4, z=1.2*z+5, ID=c(1:40), date=2, Region=rep(c(1,2,3,4),10))
df3=rbind(df1,df2)
df3=rbind(df1,df2)
for(i in 1:4) {
x=df3[df3$Region==i,]
model = bife::bife(a ~ x + y + z | ID, data = x)
results=data.frame(Region=unique(df3$Region))
results$Model = results
if (i==1){
df4=df
next
}
df4=rbind(df4,df)
}
Error: Linear dependent terms detected!

Since you're only looking at linear dependencies, you could simply leverage methods that detect them, like for instance lm.
Here's an example of solution with the package fixest:
library(bife)
library(fixest)
x = 10*rnorm(40)
z = 100*rnorm(40)
df1 = data.frame(a=rep(c(0,1),times=20), x=x, y=x, z=z, ID=c(1:40), date=1, Region=rep(c(1,2, 3, 4),10))
df2 = data.frame(a=c(rep(c(1,0),times=15),rep(c(0,1),times=5)), x=1.4*x+4, y=1.4*x+4, z=1.2*z+5, ID=c(1:40), date=2, Region=rep(c(1,2,3,4),10))
df3 = rbind(df1, df2)
vars = c("x", "y", "z")
res_all = list()
for(i in 1:4) {
x = df3[df3$Region == i, ]
coll_vars = feols(a ~ x + y + z | ID, x, notes = FALSE)$collin.var
new_fml = xpd(a ~ ..vars | ID, ..vars = setdiff(vars, coll_vars))
res_all[[i]] = bife::bife(new_fml, data = x)
}
# Display all results
for(i in 1:4) {
cat("\n#\n# Region: ", i, "\n#\n\n")
print(summary(res_all[[i]]))
}
The functions needed here are feols and xpd, the two are from fixest. Some explanations:
feols, like lm, removes variables on-the-fly when they are found to be collinear. It stores the names of the collinear variables in the slot $collin.var (if none is found, it's NULL).
Contrary to lm, feols also allows fixed-effects, so you can add it when you look for linear dependencies: this way you can spot complex linear dependencies that would also involve the fixed-effects.
I've set notes = FALSE otherwise feols would have prompted a note referring to collinearity.
feols is fast (actually faster than lm for large data sets) so won't be a strain on your analysis.
The function xpd expands the formula and replaces any variable name starting with two dots with the associated argument that the user provide.
When the arguments of xpd are vectors, the behavior is to coerce them with pluses, so if ..vars = c("x", "y") is provided, the formula a ~ ..vars | ID will become a ~ x + y | ID.
Here it replaces ..vars in the formula by setdiff(vars, coll_vars)), which is the vector of variables that were not found to be collinear.
So you get an algorithm with automatic variable removal before performing bife estimations.
Finally, just a side comment: in general it's better to store results in lists since it avoids copies.
Update
I forgot, but if you don't need bias correction (bife::bias_corr), then you can directly use fixest::feglm which automatically removes collinear variables:
res_bife = bife::bife(a ~ x + z | ID, data = df3)
res_feglm = fixest::feglm(a ~ x + y + z | ID, df3, family = binomial)
rbind(coef(res_bife), coef(res_feglm))
#> x z
#> [1,] -0.02221848 0.03045968
#> [2,] -0.02221871 0.03045990

Related

How to make regression based on grouped rows and loop over columns?

What I want to do is to perform a regression loop that always has the same predictor but loops over responses (here: y1, y2 and y3). The problem is that I want it also to be done for each category of a grouping variable. In the example data below, I want to make the regression y_i=x for all three y variables, which would result in three regressions. But I want this to be done separately for group=a, group=b and group=c, resulting in 9 different regressions (preferably stored as lists). Cant figure out how to do it! Anyone who has an idea on how to do this?
My idea so far was to maybe do a for-loop or lapply combined with dplyr::group_by, but can't get it to work.
Example data (I have a much larger data set for the actual analysis).
set.seed(123)
dat <- data.frame(group=c(rep("a",10), rep("b",10), rep("c",10)),
x=rnorm(30), y1=rnorm(30), y2=rnorm(30), y3=rnorm(30))
1) Use lmList in nlme (which comes with R so you don't have to install it).
library(nlme)
regs <- lmList(cbind(y1, y2, y3) ~ x | group, dat)
giving an lmList object having a component for each group. We show the component for group a and the other groups are similar.
> regs$a
Call:
lm(formula = object, data = dat, na.action = na.action)
Coefficients:
y1 y2 y3
(Intercept) 0.2943 0.1395 0.4539
x 0.3721 -0.2206 -0.2255
2) Another approach is to perform one overall lm giving an lm object having the same coefficients as above.
lm(cbind(y1, y2, y3) ~ group + x:group + 0, dat)
3) We could also use one of several list comprehension packages. This gives a list of 9 components. The names of the components identify the combination used as does the call component (shown in the Call: line of the output) within each main component. Note t hat the current CRAN version is 0.1.0 but the code below relies on listcompr 0.1.1 which can be obtained from github until it is put on CRAN.
# install.github("patrickroocks/listcompr")
library(listcompr)
packageVersion("listcompr") # need version 0.1.1 or later
regs <- gen.named.list("{y}.{g}",
do.call("lm",
list(reformulate("x", y), quote(dat), subset = bquote(dat$group == .(g)))
), y = c("y1", "y2", "y3"), g = unique(dat$group)
)
If you don't mind that the Call: line in the output is less descriptive then it can be simplified to:
gen.named.list("{y}.{g}", lm(reformulate("x", y), dat, subset = group == g),
y = c("y1", "y2", "y3"), g = unique(dat$group))
Note
The input corrected from question which had two y2's.
set.seed(123)
dat <- data.frame(group=c(rep("a",10), rep("b",10), rep("c",10)),
x=rnorm(30), y1=rnorm(30), y2=rnorm(30), y3=rnorm(30))

How to do granger causality test after panel vector autoregression (pVAR) in R?

How to do granger causality test after running a panel vector autoregression in R (using the panelvar package)?
In order to run the panel VAR, one could do the following:
library(plm)
library(panelvar)
set.seed(12345)
x = rnorm(240)
z = x + rnorm(240)
y = rep(rnorm(15), each=16) + 2*x + 3*z + rnorm(240)
country = rep(c("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O"), each=16 )
year = rep(seq(1995, 2010), 15)
panel = cbind.data.frame(country,year,x,z,y)
model <- pvargmm(dependent_vars = c("y", "x", "z"),
lags = 1,
transformation = "fod",
data = panel,
panel_identifier=c("country", "year"),
steps = c("twostep"),
system_instruments = FALSE,
max_instr_dependent_vars = 99,
max_instr_predet_vars = 99,
min_instr_dependent_vars = 2L,
min_instr_predet_vars = 1L,
collapse = TRUE
)
My question then is how to perform the granger causality test (panelvar does not offer this as a function).
It seems that one would need to use the function pgrangertest from the plm package. However, I am not sure what the "formula" would be, since a pVAR model is different from a simple linear model. Also, should the "order" be the number of lags found to be best after running our pVAR with several lag options and then selecting the one that provided the best model fit (based on BIC, AIC, etc provided by the Andrews_Lu_MMSC function)?
pgrangertest(inv ~ value, data = Grunfeld, order = 2L)
In other words, I need to replace "inv ~ value" for something else, and I am not clear on how to do that.
Given that I am interested in the interrelationship between y, x, and z, should I run the pgrangertest six times? Would the following make sense?
pgrangertest(y ~ x, data = panel, order = 2L)
pgrangertest(y ~ z, data = panel, order = 2L)
pgrangertest(x ~ z, data = panel, order = 2L)
pgrangertest(x ~ y, data = panel, order = 2L)
pgrangertest(z ~ x, data = panel, order = 2L)
pgrangertest(z ~ y, data = panel, order = 2L)
I know that the pgrangertest only allows for two variables at a time, but should not I control for the third one as well?
This is just a suggestion, so it may or may not help. Although the function only allows for 2 variables, you might be able to explore exactly what the function is doing to discover the "formula" and adapt/modify it to your needs using edit(pgrangertest). Analogously, you can overcome errors about colinearity for the function grangertest by manually specifying the type of test you want to use and mimicking what the actual function is doing (see my own question and answer here). Perhaps that would allow you to specify all the variables you want? Also, the other question here might be helpful too (though it's about regular multivariate granger)
Alternatively, try emailing the creator of the package. It's a longshot, but it could also be super helpful and they might be able to actually solve your issue.

fixed effects in R: plm vs lm + factor()

I'm trying to run a fixed effects regression model in R. I want to control for heterogeneity in variables C and D (neither are a time variable).
I tried the following two approaches:
1) Use the plm package: Gives me the following error message
formula = Y ~ A + B + C + D
reg = plm(formula, data= data, index=c('C','D'), method = 'within')
duplicate couples (time-id)Error in pdim.default(index[[1]], index[[2]]) :
I also tried creating first a panel using
data_p = pdata.frame(data,index=c('C','D'))
But I have repeated observations in both columns.
2) Use factor() and lm: works well
formula = Y ~ A + B + factor(C) + factor(D)
reg = lm(formula, data= data)
What is the difference between the two methods? Why is plm not working for me? is it because one of the indices should be time?
That error is saying you have repeated id-time pairs formed by variables C and D.
Let's say you have a third variable F which jointly with C keep individuals distinct from other one (or your first dimension, whatever it is). Then with dplyr you can create a unique indice, say id :
data.frame$id <- data.frame %>% group_indices(C, F)
The the index argument in plm becomes index = c(id, D).
The lm + factor() is a solution just in case you have distinct observations. If this is not the case, it will not properly weights the result within each id, that is, the fixed effect is not properly identified.

Plotting SVM Linear Separator in R

I'm trying to plot the 2-dimensional hyperplanes (lines) separating a 3-class problem with e1071's svm. I used the default method (so there is no formula involved) like so:
library('e1071')
## S3 method for class 'default':
machine <- svm(x, y, kernel="linear")
I cannot seem to plot it by using the plot.svm method:
plot(machine, x)
Error in plot.svm(machine, x) : missing formula.
But I did not use the formula method, I used the default one, and if I pass '~' or '~.' as a formula argument it'll complain about the matrix x not being a data.frame.
Is there a way of plotting the fitted separator/s for the 2D problem while using the default method?
How may I achieve this?
Thanks in advance.
It appears that although svm() allows you to specify your input using either the default or formula method, plot.svm() only allows a formula method. Also, by only giving x to plot.svm(), you are not giving it all the info it needs. It also needs y.
Try this:
library(e1071)
x <- prcomp(iris[,1:4])$x[,1:2]
y <- iris[,5]
df <- data.frame(cbind(x[],y[]))
machine <- svm(y ~ PC1 + PC2, data=df)
plot(machine, data=df)
It appears that your x has more than two feature-variables or columns.
Since plot.svm() plots only 2-Dimensions at a time, you need to specify these dimensions explicitly by providing a formula argument.
Ex:- ## more than two variables: fix 2 dimensions
data(iris)
m2 <- svm(Species~., data = iris)
plot(m2, iris, Petal.Width ~ Petal.Length,slice = list(Sepal.Width = 3, Sepal.Length = 4))
In cases where the data-frames has only two dimensions by default, you can ignore the formula argument.
Ex:- ## a simple example
data(cats, package = "MASS")
m <- svm(Sex~., data = cats)
plot(m, cats)
These details can be found at plot.svm() documentation here https://www.rdocumentation.org/packages/e1071/versions/1.7-3/topics/plot.svm

Specifying formula in R with glm without explicit declaration of each covariate

I would like to force specific variables into glm regressions without fully specifying each one. My real data set has ~200 variables. I haven't been able to find samples of this in my online searching thus far.
For example (with just 3 variables):
n=200
set.seed(39)
samp = data.frame(W1 = runif(n, min = 0, max = 1), W2=runif(n, min = 0, max = 5))
samp = transform(samp, # add A
A = rbinom(n, 1, 1/(1+exp(-(W1^2-4*W1+1)))))
samp = transform(samp, # add Y
Y = rbinom(n, 1,1/(1+exp(-(A-sin(W1^2)+sin(W2^2)*A+10*log(W1)*A+15*log(W2)-1+rnorm(1,mean=0,sd=.25))))))
If I want to include all main terms, this has an easy shortcut:
glm(Y~., family=binomial, data=samp)
But say I want to include all main terms (W1, W2, and A) plus W2^2:
glm(Y~A+W1+W2+I(W2^2), family=binomial, data=samp)
Is there a shortcut for this?
[editing self before publishing:] This works! glm(formula = Y ~ . + I(W2^2), family = binomial, data = samp)
Okay, so what about this one!
I want to omit one main terms variable and include only two main terms (A, W2) and W2^2 and W2^2:A:
glm(Y~A+W2+A*I(W2^2), family=binomial, data=samp)
Obviously with just a few variables no shortcut is really needed, but I work with high dimensional data. The current data set has "only" 200 variables, but some others have thousands and thousands.
Your use of . creatively to build the formula containing all or almost all variables is a good and clean approach. Another option that is useful sometimes is to build the formula programatically as a string, and then convert it to formula using as.formula:
vars <- paste("Var",1:10,sep="")
fla <- paste("y ~", paste(vars, collapse="+"))
as.formula(fla)
Of course, you can make the fla object way more complicated.
Aniko answered your question. To extend a bit :
You can also exclude variables using - :
glm(Y~.-W1+A*I(W2^2), family=binomial, data=samp)
For large groups of variables, I often make a frame for grouping the variables, which allows you to do something like :
vars <- data.frame(
names = names(samp),
main = c(T,F,T,F),
quadratic =c(F,T,T,F),
main2=c(T,T,F,F),
stringsAsFactors=F
)
regform <- paste(
"Y ~",
paste(
paste(vars[vars$main,1],collapse="+"),
paste(vars[1,1],paste("*I(",vars[vars$quadratic,1],"^2)"),collapse="+"),
sep="+"
)
)
> regform
[1] "Y ~ W1+A+W1 *I( W2 ^2)+W1 *I( A ^2)"
> glm(as.formula(regform),data=samp,family=binomial)
Using all kind of conditions (on name, on structure, whatever) to fill the dataframe, allows me to quickly select groups of variables in large datasets.

Resources