I use a VAR(1) model with two variables (f,m) each with 59 observations;
I already saw R help and several books about this topic but can't figure how df2 = 108.
library(vars)
var.causal.m <- causality(ajustVAR1FM, cause = "m")
> var.causal.m
$Granger
Granger causality H0: m do not Granger-cause f
data: VAR object ajustVAR1FM
F-Test = 5.9262, df1 = 1, df2 = 108, p-value = 0.01656
If you see the package manual, it is clearly written that the test is distributed as F(pK1k2, KT-n*) where K=k1+k2 and n* equal to the total number of parameters in the above VAR(p) (including deterministic regressors). Further, for the test, the vector of endogenous variables yt is split into two subvectors y1t and y2t with dimensions (K1×1) and (K2×1) with K=K1+K2.
You can also type causality in console and see the following:
df1 <- p * length(y1.names) * length(y2.names)
df2 <- K * obs - length(PI)
Example: using Canada data
library(vars)
var.2c <- VAR(Canada, p = 2, type = "const")
causality(var.2c, cause = "e")
> dim(Canada)
[1] 84 4
Causality(var.2c, cause = "e")
$Granger
Granger causality H0: e do not Granger-cause prod rw U
data: VAR object var.2c
F-Test = 6.2768, df1 = 6, df2 = 292, p-value = 3.206e-06
Cause variable is 1 so k1=1, k2=3 (4-1) where 4 is total number of variables, T is the effective number of observations (here 84-2(lag=2))=82, n*=36 (4 equations with 9 parameters each). So, df1=2*1*3=6 and df2=4*82-36=292
Note:
In your case lag p=1,n*=8 (you estimate two models with 4 parameters in each (I suspect you also have trend so it should be 4),obs (effective 59-1 (lag p=1)) = 58, k1=1 , k2=1 and K=2. So, df1=1*1*1=1 and df2=2*58-8=108.
Related
I have a table with frequencies for control and treatment group for a multinomial factor (`response'), with three levels (Negative, Neutral, Positive). I want to calculate for each levelthe difference between treatment and control, and confidence intervals, and add them to the table.
I am looking for something that can be applied to several similar frequency tables that compare treatment and control groups, where the response categories vary (e.g. unlikely, 50-50, likely).
Here is the table:
N_A <- data.frame (response = c("Negative", "Neutral", "Positive"),
n_T = c(48, 43, 42), # treatment group
n_C = c(36, 40, 51) # control group
)
I have tried to use the BinomDiffCI function from the DescTools package. I managed to write a function that runs BinomDiffCI for the first row, and extracts the lower CI.
library(DescTools)
lci.diff <- function(){
xci <- BinomDiffCI(x1 = N_A[1,2], n1 = sum(N_A[2]), x2 = N_A[1,3], n2 = sum(N_A[3]), method=c("waldcc"))
xci[,2]
}
It's not great, but maybe a start. I want to 1) add difference and upper CI, 2) do the same for all rows, 3) attach this to the dataset, and 4) apply the same to other frequency tables comparing treatment and control.
Here is the code to create the lower and upper bounds of the confidence interval
library(DescTools)
ci_diff <- function(df, i) {
tbl <- BinomDiffCI(x1 = df[i,2], n1 = sum(df[2]), x2 = df[i,3], n2 = sum(df[3]), method=c("waldcc"))
tbl[ , c("lwr.ci", "upr.ci")]
}
N_A <- cbind(N_A, t(sapply(1:nrow(N_A), \(i) ci_diff(N_A, i)))
response n_T n_C lwr.ci upr.ci
1 Negative 48 36 -0.04342071 0.1982961
2 Neutral 43 40 -0.11268594 0.1293812
3 Positive 42 51 -0.20971246 0.0381418
I am doing the simulation of linear regression with R.
A regression model I consider is y_i = a + b_1 * x_1i + b_2 * x_2i + e_i.
The parameter design as follows:
x_1i ~ N(2,1), x_2i ~ Poisson(4), e_i ~ N(0, 1), theta = (a, b_1, b_2)
The following code I am doing is that I would like to generate 100 independent random samples of (y, x_1, x_2) 1000 times using the distribution which I have mentioned above, and I also want to estimate theta_hat (the estimator of theta). After getting the theta_hat, I would like to plot the distribution of estimator of a (a_hat), b_1 (b_1_hat), b_2 (b_2_hat), respectively.
## Construct 1000 x_1
x_1_1000 <- as.data.frame(replicate(n = 1000,expr = rnorm(n = 100,
mean = 2, sd = 1)))
colnames(x_1_1000) <- paste("x_1", 1:1000, sep = "_")
x_2_1000 <- as.data.frame(replicate(n = 1000,expr = rpois(n = 100,
lambda = 4)))
colnames(x_2_1000) <- paste("x_2", 1:1000, sep = "_")
error_1000 <- as.data.frame(replicate(n = 1000, expr = rnorm(n = 100,
mean = 0, sd = 1)))
colnames(error_1000) <- paste("e", 1:1000, sep = "_")
y_1000 <- as.data.frame(matrix(data = 0, nrow = 100, ncol = 1000))
y_1000 = 1 + x_1_1000 * 1 + x_2_1000*(-2) + error_1000
colnames(y_1000) <- paste("y", 1:1000, sep = "_")
######################################################################
lms <- lapply(1:1000, function(x) lm(y_1000[,x] ~ x_1_1000[,x] + x_2_1000[,x]))
theta_hat_1000 <- as.data.frame(sapply(lms, coef))
After doing the linear regression, I just store the result into lms which is a list. Because I just want the data of coefficient, I store those simulation coefficients into "theta_hat_1000" However, when I wanna plot the distribution graph, I cannot get what I want in the final. I have tried two ways to solve the problem but still being confused.
The first way I tried is that I just rename the data frame "theta_hat_1000". I have successfully renamed the column_i, where i from 1 to 1000. However, I just cannot successfully rename the rows.
rownames(theta_hat_1000[1,]) <- "ahat"
rownames(theta_hat_1000[2,]) <- "x1hat"
rownames(theta_hat_1000[3,]) <- "x2hat"
The code listed above did not show any error message but finally failed to change the row names. Thus, I have tried the following code
rownames(theta_hat_1000) <- c("ahat", "x1hat", "x2hat")
This has successfully renamed. However, when I want to check there is anything store in the data frame, it reports "NULL"
theta_hat_1000$ahat
NULL
Therefore, I notice that there is something weird. Thus I have tried the second way like the following.
I have tried to unlist "theta_hat_1000" which is a list stored in my global environment. However, after doing such things, I am not getting what I want. The expected result is just getting three rows and each row with 1000 values, but the actual is that I got 3000 obs with 1 column.
The ideal result is getting three columns and each column with 1000 values and putting them into a data frame to do a further process like using ggplot to demonstrate the distribution of estimated coefficients.
I have stuck on this for several hours. It would be appreciated if anyone can help me and give me some suggestions.
This line theta_hat_1000$ahat in your code does not work, because "ahat" is a rowname not a column name in the data frame. You would get the result by calling theta_hat_1000["ahat",].
However, I understand that your desired result is actually a dataframe with 3 columns (and 1000 rows) representing the 3 parameters of your regression model (intercept, x1, x2). This line in your code as.data.frame(sapply(lms, coef)) produces a dataframe with 3 rows and 1000 columns. You can, for instance, transpose the matrix before changing it into a data frame to get 1000 rows and 3 columns.
theta_hat_1000 <- sapply(lms, coef)
theta_hat_1000 <- as.data.frame(t(theta_hat_1000))
colnames(theta_hat_1000) <- c("ahat", "x1hat", "x2hat")
head(theta_hat_1000)
ahat x1hat x2hat
1 2.0259326 0.7417404 -2.111874
2 0.7827929 0.9437324 -1.944320
3 1.1034906 1.0091594 -2.035405
4 0.9677150 0.8168757 -1.905367
5 1.0518646 0.9616123 -1.985357
6 0.8600449 1.0781489 -2.017061
Now you could also call the variables with theta_hat_1000$ahat.
I am trying to compare two percentages/proportions for statistical significance in R, using a Chi-Square test. I am familiar with a SAS method for Chi Square in which I supply a dataset column for a numerator, another column for denominator, and a categorical variable to distinguish distributions (A/B).
However I am getting unexpected values in R using some examples sets. When I test two similar populations, with low sample sizes, I am getting p-values of (approximately) zero, where I would expect the p-values to be very high (~ 1).
My test set is below, where I went with sugar content in a batch of water: e.g. "does group A use the same ratio of sugar as group B?". My actual problem is similar, where this isn't a pass-fail type test and the numerator and denominator values can vary wildly between samples (different sugar and/or water weights per sample). My first objective is to verify that I can get a high p-value from two similar sets. The next question is, at what sample size does the p-value become low enough to indicate significance?
# CREATE 2 NEARLY-EQUAL DISTRIBUTIONS (EXPECTING HIGH P-VALUE FROM PROP.TEST)
set.seed(108)
group_A = tibble(group = "A", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
group_B = tibble(group = "B", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
batches <- rbind(group_A, group_B)
I then do a summarize to calculate the overall sugar percentage tendency between groups:
# SUMMARY TOTALS
totals <- batches %>%
group_by(group) %>%
summarize(batch_count = n(),
batch_lbs_sum = sum(batch_lbs),
sugar_lbs_sum = sum(sugar_lbs),
sugar_percent_overall = sugar_lbs_sum / batch_lbs_sum) %>%
glimpse()
I then supply the sugar percentage between groups to a prop.test, expecting a high p-value
# ADD P-VALUE & CONFIDENCE INTERVAL
stats <- totals %>%
rowwise() %>%
summarize(p_val = prop.test(x = sugar_percent_overall, n = batch_count, conf.level = 0.95, alternative = "two.sided")$p.value) %>%
mutate(p_val = round(p_val, digits = 3)) %>%
mutate(conf_level = 1 - p_val) %>%
select(p_val, conf_level) %>%
glimpse()
# FINAL SUMMARY TABLE
cbind(totals, stats) %>%
glimpse()
Unforunately the final table gives me a p-value of 0, suggesting the two nearly-identical sets are independent/different. Shouldn't I get a p-value of ~1?
Observations: 2
Variables: 7
$ group <chr> "A", "B"
$ batch_count <int> 50, 50
$ batch_lbs_sum <dbl> 1475.579, 1475.547
$ sugar_lbs_sum <dbl> 495.4983, 484.6928
$ sugar_percent_overall <dbl> 0.3357992, 0.3284833
$ p_val <dbl> 0, 0
$ conf_level <dbl> 1, 1
From another angle, I also tried to compare the recommended sample size from power.prop.test with an actual prop.test using this recommended sample size. This gave me the reverse problem -- I was a expecting low p-value, since I am using the recommended sample size, but instead get a p-value of ~1.
# COMPARE PROP.TEST NEEDED COUNTS WITH AN ACTUAL PROP.TEXT
power.prop.test(p1 = 0.33, p2 = 0.34, sig.level = 0.10, power = 0.80, alternative = "two.sided") ## n = 38154
prop.test(x = c(0.33, 0.34), n = c(38154, 38154), conf.level = 0.90, alternative = "two.sided") ## p = 1 -- shouldn't p be < 0.10?
Am I using prop.test wrong or am I misinterpreting something? Ideally, I would prefer to skip the summarize step and simply supply the dataframe, the numerator column 'sugar_lbs', and the denominator 'batch_lbs' as I do in SAS -- is this possible in R?
(Apologies for any formatting issues as I'm new to posting)
---------------------------------
EDIT - EXAMPLE WITH ONLY PROPORTIONS & SAMPLE SIZE
I think my choice of using normal distributions may have distracted from the original question. I found an example that gets to the heart of what I was trying to ask, which is how to use prop test given only a proportion/percentage and the sample size. Instead of city_percent and city_total below, I could simply rename these to sugar_percent and batch_lbs. I think this reference answers my question, where prop.test appears to be the correct test to use.
My actual problem has an extremely non-normal distribution, but is not easily replicated via code.
STANFORD EXAMPLE (pages 37-50)
- https://web.stanford.edu/class/psych10/schedule/P10_W7L1
df <- tibble(city = c("Atlanta", "Chicago", "NY", "SF"), washed = c(1175, 1329, 1169, 1521), not_washed = c(413, 180, 334, 215)) %>%
mutate(city_total = washed + not_washed,
city_percent = washed / city_total) %>%
select(-washed, -not_washed) %>%
glimpse()
# STANFORD CALCULATION (p = 7.712265e-35)
pchisq(161.74, df = 3, lower.tail = FALSE)
# PROP TEST VERSION (SAME RESULT, p = 7.712265e-35)
prop.test(x = df$city_percent * df$city_total, n = df$city_total, alternative = "two.sided", conf.level = 0.95)$p.value
The documentation for prop.test says:
Usage prop.test(x, n, p = NULL,
alternative = c("two.sided", "less", "greater"),
conf.level = 0.95, correct = TRUE)
Arguments
x a vector of counts of successes, a one-dimensional table with two entries, or a
two-dimensional table (or matrix) with 2 columns, giving the counts of
successes and failures, respectively.
n a vector of counts of trials; ignored if x is a matrix or a table.
So if you want a "correct" test, you would have to use sugar_lbs_sum as the x instead of sugar_percent_overall. You should still receive some kind of warning that the x is non-integral, but that's not my major concern.
But from a statistical perspective this is the complete wrong way of doing things. You are directly causing spurious correlation for a testing of difference between two quantities by dividing by their sum arbitrarily. If the samples (sugar_lbs_sum) are independent, but you divide by their sums, you have made the ratios dependent. This violates the assumptions of the statistical test in a critical way. Kronmal 1993 "Spurious correlation and the fallacy of the ratio" covers this.
The data you generated are independent normal, so don't sum them, rather test for a difference with the t-test.
The Stanford link I added to my original post answered my question. I modified the Stanford example to simply rename the variables from city to group, and washed counts to sugar_lbs. I also doubled one batch, (or comparing a small versus large city). I now get the expected high p-value (0.65) indicating that there is no statistical significance that the proportions are different.
When I add more groups (for more degrees of freedom) and continue to vary batch sizes proportionally, I continue to get high p-values as expected, confirming the recipe is the same. If I modify the sugar percent of any one group, the p-value immediately drops to zero indicating one of the groups is different, as expected.
Finally, when doing the prop.text within a 'dplyr' pipe, I found I should not have used the rowwise() step, which causes my p-values to fall to zero. Removing this step gives the correct p-value. The only downside is that I don't yet know which group is different until I compare only 2 groups at a time iteratively.
#---------------------------------------------------------
# STANFORD EXAMPLE - MODIFIED TO SUGAR & ONE DOUBLE BATCHED
#--------------------------------------------------------
df <- tibble(group = c("A", "B"), sugar_lbs = c(495.5, 484.7), water_lbs = c(1475.6 - 495.5, 1475.6 - 484.7)) %>%
mutate(sugar_lbs = ifelse(group == "B", sugar_lbs * 2, sugar_lbs),
water_lbs = ifelse(group == "B", water_lbs * 2, water_lbs)) %>%
mutate(batch_lbs = sugar_lbs + water_lbs,
sugar_percent = sugar_lbs / batch_lbs) %>%
glimpse()
sugar_ratio_all <- sum(df$sugar_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
water_ratio_all <- sum(df$water_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
dof <- (2 - 1) * (length(df$group) - 1)
df <- df %>%
mutate(sugar_expected = (sugar_lbs + water_lbs) * sugar_ratio_all,
water_expected = (sugar_lbs + water_lbs) * water_ratio_all) %>%
mutate(sugar_chi_sq = (sugar_lbs - sugar_expected)^2 / sugar_expected,
water_chi_sq = (water_lbs - water_expected)^2 / water_expected) %>%
glimpse()
q <- sum(df$sugar_chi_sq) + sum(df$water_chi_sq)
# STANFORD CALCULATION
pchisq(q, df = dof, lower.tail = F)
# PROP TEST VERSION (SAME RESULT)
prop.test(x = df$sugar_percent * df$batch_lbs, n = df$batch_lbs, alternative = "two.sided", conf.level = 0.95)$p.value
I'm working on some code to determine granger causalities for a set of financial and public interest data. I've run into a bit of an issue with the syntax of the causality() function within the VAR package. Here's a sample of code and its potential result:
data = cbind(x, y, z, price, vol)
data_VAR = VAR(data, type="both", lag.max=30, ic="AIC")
causality(data_VAR, cause="x")$Granger
Granger causality H0: x do not Granger-cause y z price vol
data: VAR object data_VAR
F-Test = 1.6696, df1 = 120, df2 = 185, p-value = 0.0008476
This will give me results against the hypothesis that x does not granger cause changes in y, z, price and vol.
If I wanted to test x and y as variables that granger cause the others, what would the syntax be? According to the documentation I found online, it's possible to run this with multiple variables as the "causers" if you will, but based on the code for the function, I can't seem to figure out exactly how multiple variables could be read.
Thanks for any help in advance!
You need to put all the causes into a vector.
> library(vars)
> data(Canada)
> var.2c <- VAR(Canada, p = 2, type = "const")
> causality(var.2c, cause = c("e", "prod"))$Granger
Granger causality H0: e prod do not Granger-cause rw U
data: VAR object var.2c
F-Test = 6.8545, df1 = 8, df2 = 292, p-value = 2.919e-08
I am trying to run a Monte Carlo simulation of a difference in differences estimator, but I am running into an error. Here is the code I am running:
# Set the random seed
set.seed(1234567)
library(MonteCarlo)
#Set up problem, doing this before calling the function
# set sample size
n<- 400
# set true parameters: betas and sd of u
b0 <- 1 # intercept for control data (b0 in diffndiff)
b1 <- 1 # shift on both control and treated after treatment (b1 in
#diffndiff)
b2 <- 2 # difference between intercept on control vs. treated (b2-this is
#the level difference pre-treatment to compare to coef on treat)
b3 <- 3 # shift after treatment that is only for treated group (b3-this is
#the coefficient of interest in diffndiff)
b4 <- 0 # parallel time trend (not measured in diffndiff) biases b0,b1 but
#not b3 that we care about
b5 <- 0 # allows for treated group trend to shift after treatment (0 if
#parallel trends holds)
su <- 4 # std. dev for errors
dnd <- function(n,b0,b1,b2,b3,b4,b5,su){
#initialize a time vector (set observations equal to n)
timelength = 10
t <- c(1:timelength)
num_obs_per_period = n/timelength #allows for multiple observations in one
#time period (can simulate multiple states within one group or something)
t0 <- c(1:timelength)
for (p in 1:(num_obs_per_period-1)){
t <- c(t,t0)
}
T<- 5 #set treatment period
g <- t >T
post <- as.numeric(g)
# assign equal amounts of observations to each state to start with (would
#like to allow selection into treatment at some point)
treat <- vector()
for (m in 1:(round(n/2))){
treat <- c(treat,0)
}
for (m in 1:(round(n/2))){
treat <- c(treat,1)
}
u <- rnorm(n,0,su) #This assumes the mean error is zero
#create my y vector now from the data
y<- b0 + b1*post + b2*treat + b3*treat*post + b4*t + b5*(t-T)*treat*post +u
interaction <- treat*post
#run regression
olsres <- lm(y ~ post + treat + interaction)
olsres$coefficients
# assign the coeeficients
bhat0<- olsres$coefficients[1]
bhat1 <- olsres$coefficients[2]
bhat2<- olsres$coefficients[3]
bhat3<- olsres$coefficients[4]
bhat3_stderr <- coef(summary(olsres))[3, "Std. Error"]
#Here I will use bhat3 to conduct a t-test and determine if this was a pass
#or a fail
tval <- (bhat3-b3)/ bhat3_stderr
#decision at 5% confidence I believe (False indicates the t-stat was less
#than 1.96, and we fail to reject the null)
decision <- abs(tval) > 1.96
decision <- unname(decision)
return(list(decision))
}
#Define a parameter grid to simulate over
from <- -5
to <- 5
increment <- .25
gridparts<- c(from , to , increment)
b5_grid <- seq(from = gridparts[1], to = gridparts[2], by = gridparts[3])
parameter <- list("n" = n, "b0" = b0 , "b1" = b1 ,"b2" = b2 ,"b3" = b3 ,"b4"
=
b4 ,"b5" = b5_grid ,"su" = su)
#Now simulate this multiple times in a monte carlo setting
results <- MonteCarlo(func = dnd ,nrep = 100, param_list = parameter)
And the error that comes up is:
in results[[i]] <- array(NA, dim = c(dim_vec, nrep)) :
attempt to select less than one element in integerOneIndex
This leads me to believe that somewhere something is attempting to access the "0th" element of a vector, which doesn't exist in R as far as I understand. I don't think the part that is doing this arises from my code vs. internal to this package however, and I can't make sense of the code that runs when I run the package.
I am also open to hearing about other methods that will essentially replace simulate() from Stata.
The function passed to MonteCarlo must return a list with named components. Changing line 76 to
return(list("decision" = decision))
should work