Allowing for aliased coefficients when running `grangertest()` in R - r

I'm currently trying to run a granger causality analysis in R/R Studio. I am receiving errors about aliased coefficients when using the function grangertest(). From my understanding, this occurs because there is perfect multicolinearity between the variables.
Due to having a very large number of pairwise comparisons (e.g. 200+), I would like to simply run the granger with the aliased coefficients as per normal rather than returning an error. According to one answer here, the solution is or was to add set singular.ok=TRUE, but either I am doing it incorrectly the answer s out of date. I've tried checking the documentation, but have come up empty. Any help would be appreciated.
library(lmtest)
x <- c(0,1,2,3)
y <- c(0,3,6,9)
grangertest(x,y,1) # I want this to run successfully even if there are aliased coefficients.
grangertest(x,y,1, singular.ok=TRUE) # this also doesn't work
"Error in waldtest.lm(fm, 2, ...) :
there are aliased coefficients in the model"
Additionally is there a way to flag x and y are actually aliased variables? There seem to be some answers like here but I'm having issues getting it working properly.
alias((x~ y))
Thanks in advance.

After some investigation and emailing the creator of the grangertest package, they sent me this solution. The solution should run on aliased variables when granger test does not. When the variables are not aliased, the solution should give the same values as the normal granger test.
library(lmtest)
library(dynlm)
# Some data that is multicolinear
x <- c(0,1,2,3,4)
y <- c(0,3,6,9,12)
# Some data that is not multicolinear
# x <- c(0,125,200,230,777)
# y <- c(0,3,6,9,200)
# Convert to time series (this is an important step)
x=ts(x)
y=ts(y)
# This will run even when the data is multicolinear (but also when it is not)
# and is functionally the same as running the granger test (which by default uses the waldtest
m1 = dynlm(x ~ L(x, 1:1) + L(y, 1:1))
m2 = dynlm(x ~ L(x, 1:1))
result <-anova(m1, m2, test="F")
# This will fail if the data is multicolinear or aliased but should give the same results as the anova otherwise (F value and P value etc)
#grangertest(y,x,1)

Related

Why does R and PROCESS render different result of a mediation model (one is significant, the other one is not)?

As a newcomer who just gets started in R, I am confused about the result of the mediation analysis.
My model is simple: IV 'T1Incivi', Mediator 'T1Envied', DV 'T2PSRB'. I ran the same model in SPSS using PROCESS, but the result was insignificant in PROCESS; however, the indirect effect is significant in R. Since I am not that familiar with R, could you please help me to see if there is anything wrong with my code? And tell me why the result is significant in R but not in SPSS?Thanks a bunch!!!
My code in R:
X predict M
apath <- lm(T1Envied~T1Incivi, data=dat)
summary(apath)
X and M predict Y
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
summary(bpath)
Bootstrapping for indirect effect
getindirect <- function(dataset,random){
d=dataset[random,]
apath <- lm(T1Envied~T1Incivi, data=d)
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
indirect <- apath$coefficients["T1Incivi"]*bpath$coefficients["T1Envied"]
return(indirect)
}
library(boot)
set.seed(6452234)
Ind1 <- boot(data=dat,
statistic=getindirect,
R=5000)
boot.ci(Ind1,
conf = .95,
type = "norm")`*PSRB as outcome*
In your function getindirect all linear regressions should be based on the freshly shuffled data in d.
However there is the line
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
that makes the wrong reference to the variable dat which should really not be used within this function. That alone can explain incoherent results.

predict.lm with arbitrary coefficients r

I'm trying to predict an lm object using predict.lm. However, I would like to use manually inserted coefficients.
To do this I tried:
model$coefficients <- coeff
(where "coeff" is a vector of correct coefficients)
which would indeed modify the coefficients as I want. Nevertheless, when I execute
predict.lm(model, new.data)
I just get predictions calculated with the "old" parameters. Is there a way I could force predict.lm to use the new ones?
Post Scriptum: I need to do this to fit a bin-smooth (also called regressogram).
In addition, when I predict "by hand" (i.e. using matrix multiplication) the results are fine, hence I'm quite sure that the problem lies in the predict.lm not recognizing my new coefficients.
Thanks in advance for the help!
Hacking the $coefficients element does indeed seem to work. Can you show what doesn't work for you?
dd <- data.frame(x=1:5,y=1:5)
m1 <- lm(y~x,dd)
m1$coefficients <- c(-2,1)
m1
## Call:
## lm(formula = y ~ x, data = dd)
##
## Coefficients:
## [1] -2 1
predict(m1,newdata=data.frame(x=7)) ## 5 = -2+1*7
predict.lm(...) gives the same results.
I would be very careful with this approach, checking each time you do something different with the hacked model.
In general it would be nice if predict and simulate methods took a newparams argument, but they don't in general ...

computing ridge estimate manually in R, simple

I'm trying to learn about ridge regression, and I am using R. From what I understand the following should be the same beta.r1 and beta.r2 in the code below are the same
library(MASS)
n=50
v1=runif(n)
v2=v1+2
V=cbind(1,v1,v2)
w=3+v1+v2
I=diag(3)
lambda=2 #arbitrarily chosen
beta.r1=solve(t(V)%*%V+lambda*I)%*%t(V)%*%w
#Using library(MASS)
fit=lm.ridge(w~v1+v2,lambda=2, Inter=FALSE)
beta.r2=coef(fit)
#Shouldn't beta.r1 and beta.r2 be the same?
I think the variable scaling performed in the lm.ridge code (which you can access by typing lm.ridge into your R console) that likely cause differences. The code scales each variable by its root-mean-squared value:
Xscale <- drop(rep(1/n, n) %*% X^2)^0.5
X <- X/rep(Xscale, rep(n, p))
Your code does not perform any variable scaling.
The variable scaling is hinted at on the ?lm.ridge help page in the description of what is returned by lm.ridge:
scales: scalings used on the X matrix.
Therefore you can access the scaling used by lm.ridge:
fit$scales
# v1 v2
# 0.2650311 0.2650311

predict in caret ConfusionMatrix is removing rows

I'm fairly new to using the caret library and it's causing me some problems. Any
help/advice would be appreciated. My situations are as follows:
I'm trying to run a general linear model on some data and, when I run it
through the confusionMatrix, I get 'the data and reference factors must have
the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine.
I went through every variable and everything was balanced until I got to the
confusionMatrix predict. I discovered this by doing the following:
a <- table(testing2$hold1yes0no)
a[1]+a[2]
1543
b <- table(predict(modelFit,trainTR2))
dim(b)
[1] 1538
Those two values shouldn't disagree. Where are the missing 5 rows?
My code is below:
set.seed(2382)
inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE)
training2 <- HOLD[inTrain2,]
testing2 <- HOLD[-inTrain2,]
preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox")
trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)])
trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)])
modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2)
confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))
I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try:
modelFit <- train(trainPC2, training2$hold1yes0no, method="glm")
Which specifies y = training2$hold1yes0no and x = trainPC2.

Computing all subsets of a vector in R

I want do write a small function which I can use for automatic feature selection in a logistic regression in R, by testing in a brute force manner all subsets of predictor variables and then evaluate via CV their classification performance.
Surprisingly I did not find a package which does this "all subset feature selection" and thus I would like to implement it myself.
Unfortunately my limited R knowledge makes me fail to write a loop which generates all subsets of a given vector and I was wondering if someone could point me in the right direction
Caveat incernor
The bestglm package is what you are after
The function bestglm selects the best subset of inputs for the glm family. The selec-
tion methods available include a variety of information criteria as well as cross-validation
The vignette goes through a number of examples.
library(bestglm)
data(SAHeart)
# using Cross valiation for selection
out<-bestglm(SAheart,IC = 'CV', family=binomial, t = 10)
out
# CVd(d = 373, REP = 10)
# BICq equivalent for q in (0.190525988534159, 0.901583162187443)
# Best Model:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -6.44644451 0.92087165 -7.000372 2.552830e-12
# tobacco 0.08037533 0.02587968 3.105731 1.898095e-03
# ldl 0.16199164 0.05496893 2.946967 3.209074e-03
# famhistPresent 0.90817526 0.22575844 4.022774 5.751659e-05
# typea 0.03711521 0.01216676 3.050542 2.284290e-03
# age 0.05046038 0.01020606 4.944159 7.647325e-07
Wouldn't drop1() and add1() be helpful for your purpose? They come with the usual caution that automatic feature selection may not always be the most appropriate thing to do, but I presume you have made an informed choice on this.
You can use paste() + combn(), e.g.
varnames <- c("a","b","c")
rhs <- unlist( sapply(1:length(varnames),function(k) apply(combn(varnames,k),2,paste,collapse=" + ") ) )
formulae <- as.formula( quote( paste("z ~", rhs) ) )
... but perhaps there is a more elegant way?

Resources