Multiple variable names in VAR causality function - r

I'm working on some code to determine granger causalities for a set of financial and public interest data. I've run into a bit of an issue with the syntax of the causality() function within the VAR package. Here's a sample of code and its potential result:
data = cbind(x, y, z, price, vol)
data_VAR = VAR(data, type="both", lag.max=30, ic="AIC")
causality(data_VAR, cause="x")$Granger
Granger causality H0: x do not Granger-cause y z price vol
data: VAR object data_VAR
F-Test = 1.6696, df1 = 120, df2 = 185, p-value = 0.0008476
This will give me results against the hypothesis that x does not granger cause changes in y, z, price and vol.
If I wanted to test x and y as variables that granger cause the others, what would the syntax be? According to the documentation I found online, it's possible to run this with multiple variables as the "causers" if you will, but based on the code for the function, I can't seem to figure out exactly how multiple variables could be read.
Thanks for any help in advance!

You need to put all the causes into a vector.
> library(vars)
> data(Canada)
> var.2c <- VAR(Canada, p = 2, type = "const")
> causality(var.2c, cause = c("e", "prod"))$Granger
Granger causality H0: e prod do not Granger-cause rw U
data: VAR object var.2c
F-Test = 6.8545, df1 = 8, df2 = 292, p-value = 2.919e-08

Related

Multiple imputation and mlogit for a multinomial regression

I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".
So making the data
library(mlogit)
library(nnet)
library(tidyverse)
library(mice)
df <- data.frame(vax = sample(1:6, 500, replace = T),
age = runif(500, 12, 18),
var1 = sample(1:2, 500, replace = T),
var2 = sample(1:5, 500, replace = T))
# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp
df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)
# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)
It works using nnet's multinom:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))
But throws up an error when I try to reshape the data into mlogit format
test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))
Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?
Answer
You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids works
When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.
If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:
It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.
The problems
This is where both your problems derive from:
In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom line is simple: just don't specify data:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira object, so that we can still use pool:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
Offering this as a way forward to circumvent the error with dfidx():
df5 <- df4$imp %>%
# work with a list, where each top-element is a different imputation run (imp_n)
map(~as.list(.x)) %>%
transpose %>%
# for each run, impute and return the full (imputed) data set
map(function(imp_n.x) {
df_out <- df4$data
df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
df_out$age[is.na(df_out$age)] <- imp_n.x$age
df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
return(df_out)
}) %>%
# No errors with dfidx() now
map(function(imp_n.x) {
dfidx(data = imp_n.x, choice = "vax", shape = "wide")
})
However, I'm not too familiar with mlogit(), so can't help beyond this.
Update 8/2/21
As #slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:
# To fit mlogit() for each imputed data set
df5 %>%
map(function(imp_n.x) {
# form as specified in the comments
mlogit(vax ~ 1 | age + var1 + var2,
data = imp_n.x,
reflevel = "1",
nests = list(type1 = c('1', '2'),
type2 = c('3','4'),
type3 = c('5','6')))
})

How to perform Johansen cointegration test iteratively for 2 variables taking some rows at a time?

I want to test cointegration between two time series using Johansen Cointegration test. I want to perform the test incrementally, first 120 observations then leaving 1 at top and adding one at bottom for total 2250 observations. I want to automate this using a For loop, but the code is giving error. Please help.
library(urca)
x= BDICOM$BDI
y= BDICOM$Soybn
for(i in 1:2666){
A = x[i:i+120]; B = y[i:i+120]
jocot[i] = ca.jo(data.frame(A,B), type = "eigen", ecdet = "none",K = 2, spec = "longrun");i=i+1
}
Try this, but without sample of your data i can't be sure if it's correct:
library(urca)
x = BDICOM$BDI
y = BDICOM$Soybn
jocot <- vector('numeric', (2000-120))
for(i in 1:(2000-120)){
A = x[i:(i+120)]
B = y[i:(i+120)]
jocot[i] = ca.jo(
data.frame(A, B),
type = "eigen",
ecdet = "none",
K = 2,
spec = "longrun"
)#teststat[2]
}

Consistency of categorical encodings in h2o (and R) for training and new test sample

I'm having trouble understanding whether I need to be consistent with the categorical / factor encodings of variables. With consistency I mean that I need to assure that the encodings from integers and levels should be the same in the training and the new testing sample.
This answer seems to suggest that it is not necessary. On the contrary, this answer suggests that IT is indeed necessary.
Suppose I have a training sample with an xcat that can take values a, b, c. The expected result is that the y variable will tend to take values close to 1 when xcat is a, 2when xcat is b, and 3 when xcat is c.
First I'll create the dataframe, pass it to h2o and then encode with the function as.factor:
library(h2o)
localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)
n = 20
y <- sample(1:3, size = n, replace = T)
xcat <- letters[y]
xnum <- sample(1:10, size = n, replace = T)
y <- dep + rnorm(0, 0.3, n = 20)
df <- data.frame(xcat=xcat, xnum=xnum , y=y)
df.hex <- as.h2o(df, destination_frame="df.hex")
#Encode as factor. You will get: a=1, b=2, c=3
df.hex[ , "xcat"] = as.factor(df.hex[, "xcat"])
Now I'll estimate it with an glm model and predict on the same sample:
x = c("xcat", "xnum")
glm <- h2o.glm( y = c("y"), x = x, training_frame=df.hex,
family="gaussian", seed=1234)
glm.fit <- h2o.predict(object=glm, newdata=df.hex)
glm.fit gives the expected results (no surprises here).
Now I'll create a new test dataset that only has a and c, no b value:
xcat2 = c("c", "c", "a")
xnum2 = c(2, 3, 1)
y = c(1, 2, 1) #not really needed
df.test = data.frame(xcat=xcat2, xnum=xnum2, y=y)
df.test.hex <- as.h2o(df.test, destination_frame="df.test.hex")
df.test.hex[ , "xcat"] = as.factor(df.test.hex[, "xcat"])
Running str(df.test.hex$xcat) shows that this time the factor encoding has assigned 2 to c and 1 to a. This looked like it could be trouble, but then the fitting works as expected:
test.fit = h2o.predict(object=glm, newdata=df.test.hex)
test.fit
#gives 2.8, 2.79, 1.21 as expected
What's going on here? Is it that the glm model carries around the information of levels of the x variables so it doesn't mind if the internal encoding is different in the training and the new test data? Is that the general case for all h2o models?
From looking at one of the answers I linked above, it seems that at least some R models do require consistency.
Thanks and best!

How to get df2 in causality() Granger test in R

I use a VAR(1) model with two variables (f,m) each with 59 observations;
I already saw R help and several books about this topic but can't figure how df2 = 108.
library(vars)
var.causal.m <- causality(ajustVAR1FM, cause = "m")
> var.causal.m
$Granger
Granger causality H0: m do not Granger-cause f
data: VAR object ajustVAR1FM
F-Test = 5.9262, df1 = 1, df2 = 108, p-value = 0.01656
If you see the package manual, it is clearly written that the test is distributed as F(pK1k2, KT-n*) where K=k1+k2 and n* equal to the total number of parameters in the above VAR(p) (including deterministic regressors). Further, for the test, the vector of endogenous variables yt is split into two subvectors y1t and y2t with dimensions (K1×1) and (K2×1) with K=K1+K2.
You can also type causality in console and see the following:
df1 <- p * length(y1.names) * length(y2.names)
df2 <- K * obs - length(PI)
Example: using Canada data
library(vars)
var.2c <- VAR(Canada, p = 2, type = "const")
causality(var.2c, cause = "e")
> dim(Canada)
[1] 84 4
Causality(var.2c, cause = "e")
$Granger
Granger causality H0: e do not Granger-cause prod rw U
data: VAR object var.2c
F-Test = 6.2768, df1 = 6, df2 = 292, p-value = 3.206e-06
Cause variable is 1 so k1=1, k2=3 (4-1) where 4 is total number of variables, T is the effective number of observations (here 84-2(lag=2))=82, n*=36 (4 equations with 9 parameters each). So, df1=2*1*3=6 and df2=4*82-36=292
Note:
In your case lag p=1,n*=8 (you estimate two models with 4 parameters in each (I suspect you also have trend so it should be 4),obs (effective 59-1 (lag p=1)) = 58, k1=1 , k2=1 and K=2. So, df1=1*1*1=1 and df2=2*58-8=108.

random model formula object

I want to put formula in random model, but I think following error is due to wrong formula object (?), but could not fix it.
set.seed(1234)
mydata <- data.frame (A = rep(1:3, each = 20), B = rep(1:2, each = 30),
C = rnorm(60, 10, 5))
mydata$A <- as.factor(mydata$A)
mydata$B <- as.factor(mydata$B)
myfunction <- function (mydata, yvars, genovar, replication) {
require("lme4")
formula = paste ("yvars" ~ 1|"genovar" + 1|"replication")
model1 <- lmer(formula, data = dataframe, REML = TRUE)
return(ranef(model2))
}
myfunction(mydata=dataf, yvars = "C", genovar = "A", replication = "B")
Error: length(formula <- as.formula(formula)) == 3 is not TRUE
There were several wonky things in here, but this is I think close to what you want.
set.seed(1234)
mydata <- data.frame (A = factor(rep(1:3, each = 20)),
B = factor(rep(1:2, each = 30)),
C = rnorm(60, 10, 5))
require("lme4")
myfunction <- function (mydata, yvars, genovar, replication) {
formula <- paste (yvars,"~ (1|",genovar,") + (1|",replication,")")
model1 <- lmer(as.formula(formula), data = mydata, REML = TRUE)
return(ranef(model1))
}
myfunction(mydata=mydata, yvars = "C", genovar = "A", replication = "B")
Beware, however, that lmer doesn't work the way that classical random-effects ANOVA does -- it may perform very badly with such small numbers of replicates. (In the example I tried it set the variance of A to zero, which is at least not unreasonable.) The GLMM FAQ has some discussion of this issue. (Random-effects ANOVA would have exceedingly low power in that case but might not be quite as bad.) If you really want to do random-effects models on such small samples you might want to consider reconstructing the classical method-of-moments approach (as I recall there is/was a raov formula in S-PLUS that did random-effects ANOVA, but I don't know if it was ever implemented in R).
Finally, for future questions along these lines you may do better on the r-sig-mixed-models#r-project.org mailing list -- Stack Overflow is nice but there is more R/mixed-model expertise over there.

Resources