Multivariate Granger's causality - r

I'm having issues doing a multivariate Granger's causal test. I'll like to check if conditioning a third variable affects the results of a causal test.
Here's one sample for a single dependent and independent variable based on an earlier question I asked and was answered by #Alex
Granger's causality test by column
library(lmtest)
M1<- matrix( c(2,3, 1, 4, 3, 3, 1,1, 5, 7), nrow=5, ncol=2)
M2<- matrix( c(7,3, 6, 9, 1, 2, 1,2, 8, 1), nrow=5, ncol=2)
M3<- matrix( c(1, 3, 1,5, 7,3, 1, 3, 3, 4), nrow=5, ncol=2)
For example, the equation for a conditioned linear regression will be
formula = y ~ w + x * z
How do I carry out this test as a function of a third or fourth variable please?

1. The solution for stationary variables are well-established: See FIAR (v 0.3) package.
This is the paper related with the package that includes concrete example of multivariate Granger causality (in the case of all of the variables are stationary).
Page 12: Theory, Page 15: Practice.
2. In case of mixed (stationary, nonstationary) variables, make all the variables stationary first (via differencing etc.). Do not handle stationary ones (they are already stationary). Now again, you finish by the above procedure (in case I).
3. In case of "non-cointegrated nonstationary" variables, then there is no need for VECM. Run VAR with the stationary variables (by making them stationary first, of course). Apply FIAR::condGranger etc.
4. In case of "cointegrated nonstationary" variables, the answer is really really very long:
Johansen Procedure (detect rank via urca::cajo)
Apply vec2var to convert VECM to VAR (since FIAR is based on VAR).
John Hunter's latest book nicely summarizes what can happen and what can be done in this last case.
You may wanna read this as well.
To my knowledge: Conditional/partial Granger causality supersides the GC via "Block exogeneity Wald test over VAR".

Related

Which statistical test in R to use to detect differential expression on a simulated dataset when there are three replicates

I was asked to begin this exercise in bioinformatics (https://uclouvain-cbio.github.io/WSBIM1322/sec-testing.html) by simulating a dataset of log2 fold-changes measured in triplicate for 1000 genes (abs function used to avoid negative logs)
sim <- abs(rnorm(3000, mean = 0, sd = 1))
simlog <- log2(sim)
simlog_mat <- matrix(simlog, ncol = 3,
dimnames = list(paste0("gene", 1:1000), paste0("repl", 1:3)))
What statistical test should I use to test for 'differential expression'? The way the question is phrased, it seems I need to compare each replicate against the other? As there are 3 replicates I don't think I can use a t.test although the course material I'm using has only covered the ttest and FDR in this chapter.

Specifying truncation point in glmmTMB R package

I am working with a large dataset that contains longitudinal data on gambling behavior of 184,113 participants. The data is based on complete tracking of electronic gambling behavior within a gambling operator. Gambling behavior data is aggregated on a monthly level, a total of 70 months. I have an ID variable separating participants, a time variable (months), as well as numerous gambling behavior variables such as active days played for given month, bets placed for given month, total losses for given month, etc. Participants vary in when they have been active gambling. One participant may have gambled at month 2, 3, 4, and 7, another participant at 3, 5, and 7, and a third at 23, 24, 48, 65 etc.
I am attempting to run a negative binomial 2 truncated model in glmmTMB and I am wondering how the package handles lack of 0. I have longitudinal data on gambling behavior, days played for each month (for a total of 70 months). The variable can take values between 1-31 (depending on month), there are no 0. Participants’ months with 0 are absent from dataset. Example of how data are structured with just two participants:
# Example variables and data frame in long form
# Includes id variable, time variable and example variable
id <- c(1, 1, 1, 1, 2, 2, 2)
time <- c(2, 3, 4, 7, 3, 5, 7)
daysPlayed <- c(2, 2, 3, 3, 2, 2, 2)
dfLong <- data.frame(id = id, time = time, daysPlayed = daysPlayed)
My question: How do I specify where the truncation happens in glmmTMB? Does it default to 0? I want to truncate 0 and have run the following code (I am going to compare models, the first one is a simple unconditional one):
DaysPlayedUnconditional <- glmmTMB(daysPlayed ~ 1 + (1 | id), dfLong, family = truncated_nbinom2)
Will it do the trick?
From Ben Bolker through r-sig-mixed-models#r-project.org:
"I'm not 100% clear on your question, but: glmmTMB only does zero-truncation, not k-truncation with k>0, i.e. you can only specify the model Prob(x==0) = 0 Prob(x>0) = Prob(NBinom(x))/Prob(NBinom(x>0)) (terrible notation, but hopefully you get the idea)"

R: How do i aggregate losses by a poisson observation?

I'm new to R but i am trying to use it in order to aggregate losses that are observed from a severity distribution by an observation from a frequency distribution - essentially what rcompound does. However, i need a more granular approach as i need to manipulate the severity distribution before 'aggregation'.
Lets take an example. Suppose you have:
rpois(10,lambda=3)
Thereby, giving you something like:
[1] 2 2 3 5 2 5 6 4 3 1
Additionally, suppose we have severity of losses determined by:
rgamma(20,shape=1,scale=10000)
So that we also have the following output:
[1] 233.0257 849.5771 7760.4402 731.5646 8982.7640 24172.2369 30824.8424 22622.8826 27646.5168 1638.2333 6770.9010 2459.3722 782.0580 16956.1417 1145.4368 5029.0473 3485.6412 4668.1921 5637.8359 18672.0568
My question is: what is an efficient way to get R to take each Poisson observation in turn and then aggregate losses from my severity distribution? For example, the first Poisson observation is 2. Therefore, adding two observations (the first two) from my Gamma distribution gives 1082.61.
I say this needs to be 'efficient' (run time) due to the fact:
- The Poisson parameter may be come significantly large, i.e. up to 1000 or so.
- The realisations are likely to be up to 1,000,000, i.e. up to a million Poisson and Gamma observations to sort through.
Any help would be greatly appreciated.
Thanks, Dave.
It looks like you want to split the gamma vector at positions indicated by the accumulation of the poisson vector.
The following function (from here) does the splitting:
splitAt <- function(x, pos) unname(split(x, cumsum(seq_along(x) %in% pos)))
pois <- c(2, 2, 3, 5, 2, 5, 6, 4, 3, 1)
gam <- c(233.0257, 849.5771, 7760.4402, 731.5646, 8982.7640, 24172.2369, 30824.8424, 22622.8826, 27646.5168, 1638.2333, 6770.9010, 2459.3722, 782.0580, 16956.1417, 1145.4368, 5029.0473, 3485.6412, 4668.1921, 5637.8359, 18672.0568)
posits <- cumsum(pois)
Then do the following:
sapply(splitAt(gam, posits + 1), sum)
[1] 1082.603 8492.005 63979.843 61137.906 17738.200 19966.153 18672.057
According to post I linked to above, the splitAt() function slows down for large arrays, so you could (if necessary) consider the alternatives proposed in that post. For my part, I generated 1e6 poissons and 1e6 gammas, and the above function ran in 0.78 sec on my machine.

How can I minimize error between estimates and actuals by multiplying by a constant (in R)?

I have a two large datasets in R, one of actual measurements and one of the predictions I made for these measurements. I found that the trends of my predictions were accurate, but the amplitude was off. I am wondering if there is a way to find a constant in R that, when the predictions are multiplied by the constant, minimizes the error between the actuals and the predictions.
For example:
predictions <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
actuals <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
The constant I would want to generate in this case would be 2.
I have looked into using the optim() function, but get the warning message that "one-dimensional optimization by Nelder-Mead is unreliable: use 'Brent' or optimize() directly."
f <- function(p) cor(p*observed, actual)
optim(
c(1),
f,
control = list(fnscale = -1)
)
I am not familiar with optimization, so it is probable that I am approaching this problem the wrong way. I appreciate the help!
First let's define an error function to minimize:
MultError <- function(constant, predictions, actuals) {
return(sum((constant*predictions - actuals)^2))
}
This is sum of squared errors...you could use a different one!
optimize() expects a function, an interval to search in (which you could get by inspecting the min and max of predictions / actuals), and any extra parameters. It will minimize by default
optimize(MultError, interval=c(0, 5), predictions=predictions, actuals=actuals)
This returns
$minimum
[1] 2
$objective
[1] 0
Which is the value of the minimum and the value of the error function, respectively.
Presumably, your match is not perfect, so I also tried it with artificial noise
set.seed(1)
actuals <- rnorm(length(predictions), 2, 0.4) * predictions
Then it returns
$minimum
[1] 2.087324
$objective
[1] 22.21434
Pretty good!
EDIT:
I answered this question using optimize because of the title and the direction the OP had gone, but in thinking harder, it seemed like it might overkill. What's wrong with simply taking mean(actuals / predictions)?
So I decided to test them both...
set.seed(1)
arithmetic <- opt <- numeric(10000)
for (trial in 1:10000) {
actuals <- rnorm(length(predictions), 2, 0.4) * predictions
arithmetic[trial] <- mean(actuals / predictions)
opt[trial] <- optimize(MultError, interval=c(0, 5), predictions=predictions, actuals=actuals)$minimum
}
For 10,000 possible datasets, we've recovered the constant using the average and by minimizing sum of squared errors. What are the mean and variance of our estimators?
> mean(arithmetic)
[1] 1.999102
> mean(opt)
[1] 1.998695
Both do pretty well on average.
> var(arithmetic)
[1] 0.0159136
> var(opt)
[1] 0.02724814
The arithmetic mean estimator has a tighter spread, however. So I would argue that you should just take the average!
You might get a pretty good approximation using linear regression, the lm() function.
m = lm(actuals ~ predictions)
m is the object where the linear regression model will be stored.
coef(m) will give you the constant to multiply with plus an offset.

modeling a beta-binomial regression

Assume this easy example:
treatment <- factor(rep(c(1, 2), c(43, 41)), levels = c(1, 2),labels = c("placebo", "treated"))
improved <- factor(rep(c(1, 2, 3, 1, 2, 3), c(29, 7, 7, 13, 7, 21)),levels = c(1, 2, 3),labels = >c("none", "some", "marked"))
numberofdrugs<-rpois(84, 50)+1
healthvalue<-rpois(84,5)
y<-data.frame(healthvalue,numberofdrugs, treatment, improved)
test<-lm(healthvalue~numberofdrugs+treatment+improved, y)
What am I supossed to do when I'd like to estimate a beta-binomial regression with R? Is anybody familiar with it? Any thought is appreciated!
I don't see how this example relates to beta-binomial regression (i.e., you have generated count data, rather than (number out of total possible)). To simulate beta-binomial data, see rbetabinom in either the emdbook or the rmutil packages ...
library(sos); findFn("beta-binomial") finds a number of useful starting points, including
aod (analysis of overdispersed data), betabin function
betabinomial family in VGAM
hglm package
emdbook package (for dbetabinom) plus mle2 package

Resources