Testing and adjusting for autocorrelation / serial correlation - r

Unfortunately im not able to provide a reproducible example, but hopefully you get the idea regardless.
I am conducting some regression analyses where the dependent variable is a DCC of a pair of return series - two stocks. Im using dummies to represent shocks in the return series, i.e. the worst 1% of observed returns. In sum:
DCC = c + 1%Dummy
When I run the DurbinWatsonTest I get the output:
Autocorrelation: 0,9987
D-W statistic: 0
p-value: 0
HA: rho !=0
Does this just mean that its highly significant presence of autocorrelation?
I also tried dwtest, but that yields NA values for both P and DW-stat.
To correct for autocorrealtion I used the code:
spx10 = lm(bit_sp500 ~ Spx_0.1)
spx10_hc = coeftest(spx10, vcov. = vcovHC(spx10, method = "arellano",type = "HC3"))
How can I be certain that it had any effect, as I cannot run the DW-test for the spx10_hc, nor did the regression output change noteworthy. Is it common that regression analysis with 1 independent variable changes just ever so slightly when adjusting for autocorrelation?

Related

DHARMa outlier test is significant, what are my next steps?

I'm looking for information and guidance to help me understand the outlier test in DHARMa for negative binomial regression. Here is the diagnostic plot from DHARMa using the function simulateResiduals().
First off, The dispersion test is significant in the plot. Using testDispersion() on the model and on the residuals, I get the results of 2.495. Visually, the dots seem to aline pretty well on the QQ line. The developer stated ' If you see a dispersion parameter of 1.01, I would not worry, even if the test is significant. A significant value of 5, however, is clearly a reason to move to a model that accounts for overdispersion.' here I conclude that the deviation is within the acceptable range for the NB regression.
Second, the Outlier test is also significant. I never had this before, and I can't find much information regarding how many outliers is okay vs not okay to have. Following the recommendation of DHARMa's developer, I looked at the magnitude of the outlier to investigate this. reference. Here is the code and output:
ModelNB <- glm.nb(BUD ~ Treatment*YEAR, data=Data_Bud) simulationOutput <- simulateResiduals(fittedModel = ModelNB, plot = T) testOutliers(simulationOutput, type = "binomial")
`
DHARMa outlier test based on exact binomial test with
approximate expectations
data: simulationOutput
outliers at both margin(s) = 12, observations = 576, p-value =
0.00269
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.01081011 0.03610864
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0.02083333
`
**Can someone help me understand this output? ** Is having 12 outliers per 576 observations okay? In statistics classes, I was told that taking out outliers was a big No-No. What does "true probability of success is not equal to 0.007968127" mean? I can't accept H1 and need to accept H0 for the outlier???
Information on my model:
ModelNB <- glm.nb(BUD ~ Treatment*YEAR, data=Data_Bud)
BUD = The number of floral buds on a twig
Treatment = 5 different fertiliser treatment
YEAR = 2 different years (2020 and 2021)

GAM smooths interaction differences - calculate p value using mgcv and gratia 0.6

I am using the useful gratia package by Gavin Simpson to extract the difference in two smooths for two different levels of a factor variable. The smooths are generated by the wonderful mgcv package. For example
library(mgcv)
library(gratia)
m1 <- gam(outcome ~ s(dep_var, by = fact_var) + fact_var, data = my.data)
diff1 <- difference_smooths(m1, smooth = "s(dep_var)")
draw(diff1)
This give me a graph of the difference between the two smooths for each level of the "by" variable in the gam() call. The graph has a shaded 95% credible interval (CI) for the difference.
Statistical significance, or areas of statistical significance at the 0.05 level, is assessed by whether or where the y = 0 line crosses the CI, where the y axis represents the difference between the smooths.
Here is an example from Gavin's site where the "by" factor variable had 3 levels.
The differences are clearly statistically significant (at 0.05) over nearly all of the graphs.
Here is another example I have generated using a "by" variable with 2 levels.
The difference in my example is clearly not statistically significant anywhere.
In the mgcv package, an approximate p value is outputted for a smooth fit that tests the null hypothesis that the coefficients are all = 0, based on a chi square test.
My question is, can anyone suggest a way of calculating a p value that similarly assesses the difference between the two smooths instead of solely relying on graphical evidence?
The output from difference_smooths() is a data frame with differences between the smooth functions at 100 points in the range of the smoothed variable, the standard error for the difference and the upper and lower limits of the CI.
Here is a link to the release of gratia 0.4 that explains the difference_smooths() function
enter link description here
but gratia is now at version 0.6
enter link description here
Thanks in advance for taking the time to consider this.
Don
One way of getting a p value for the interaction between the by factor variables is to manipulate the difference_smooths() function by activating the ci_level option. Default is 0.95. The ci_level can be manipulated to find a level where the y = 0 is no longer within the CI bands. If for example this occurred when ci_level = my_level, the p value for testing the hypothesis that the difference is zero everywhere would be 1 - my_level.
This is not totally satisfactory. For example, it would take a little manual experimentation and it may be difficult to discern accurately when zero drops out of the CI. Although, a function could be written to search the accompanying data frame that is outputted with difference_smooths() as the ci_level is varied. This is not totally satisfactory either because the detection of a non-zero CI would be dependent on the 100 points chosen by difference_smooths() to assess the difference between the two curves. Then again, the standard errors are approximate for a GAM using mgcv, so that shouldn't be too much of a problem.
Here is a graph where the zero first drops out of the CI.
Zero dropped out at ci_level = 0.88 and was still in the interval at ci_level = 0.89. So an approxiamte p value would be 1 - 0.88 = 0.12.
Can anyone think of a better way?
Reply to Gavin Simpson's comments Feb 19
Thanks very much Gavin for taking the time to make your comments.
I am not sure if using the criterion, >= 0 (for negative diffs), is a good way to go. Because of the draws from the posterior, there is likely to be many diffs that meet this criterion. I am interpreting your criterion as sample the posterior distribution and count how many differences meet the criterion, calculate the percentage and that is the p value. Correct me if I have misunderstood. Using this approach, I consistently got p values at around 0.45 - 0.5 for different gam models, even when it was clear the difference in the smooths should be statistically significant, at least at p = 0.05, because the confidence band around the smooth did not contain zero at a number of points.
Instead, I was thinking perhaps it would be better to compare the means of the posterior distribution of each of the diffs. For example
# get coefficients for the by smooths
coeff.level1 <- coef(gam.model1)[31:38]
coeff.level0 <- coef(gam.model1)[23:30]
# these indices are specific to my multi-variable gam.model1
# in my case 8 coefficients per smooth
# get posterior coefficients variances for the by smooths' coefficients
vp_level1 <- gam.model1$Vp[31:38, 31:38]
vp_level0 <- gam.model1$Vp[23:30, 23:30]
#run the simulation to get the distribution of each
#difference coefficient using the joint variance
library(MASS)
no.draws = 1000
sim <- mvrnorm(n = no.draws, (coeff.level1 - coeff.level0),
(vp_level1 + vp_level0))
# sim is a no.draws X no. of coefficients (8 in my case) matrix
# put the results into a data.frame.
y.group <- data.frame(y = as.vector(sim),
group = c(rep(1,no.draws), rep(2,no.draws),
rep(3,no.draws), rep(4,no.draws),
rep(5,no.draws), rep(6,no.draws),
rep(7,no.draws), rep(8,no.draws)) )
# y has the differences sampled from their posterior distributions.
# group is just a grouping name for the 8 sets of differences,
# (one set for each difference in coefficients)
# compare means with a linear regression
lm.test <- lm(y ~ as.factor(group), data = y.group)
summary(lm.test)
# The p value for the F statistic tells you how
# compatible the data are with the null hypothesis that
# all the group means are equal to each other.
# Same F statistic and p value from
anova(lm.test)
One could argue that if all coefficients are not equal to each other then they all can't be equal to zero but that isn't what we want here.
The basis of the smooth tests of fit given by summary(mgcv::gam.model1)
is a joint test of all coefficients == 0. This would be from a type of likelihood ratio test where model fit with and without a term are compared.
I would appreciate some ideas how to do this with the difference between two smooths.
Now that I got this far, I had a rethink of your original suggestion of using the criterion, >= 0 (for negative diffs). I reinterpreted this as meaning for each simulated coefficient difference distribution (in my case 8), count when this occurs and make a table where each row (my case, 8) is for one of these distributions with two columns holding this count and (number of simulation draws minus count), Then on this table run a chi square test. When I did this, I got a very low p value when I believe I shouldn't have as 0 was well within the smooth difference CI across almost all the levels of the exposure. Maybe I am still misunderstanding your suggestion.
Follow up thought Feb 24
In a follow up thought, we could create a variable that represents the interaction between the by factor and continuous variable
library(dplyr)
my.dat <- my.dat %>% mutate(interact.var =
ifelse(factor.2levels == "yes", 1, 0)*cont.var)
Here I am assuming that factor.2levels has the levels ("no", "yes"), and "no" is the reference level. The ifelse function creates a dummy variable which is multiplied by the continuous variable to generate the interactive variable.
Then we place this interactive variable in the GAM and get the usual statistical test for fit, that is, testing all the coefficients == 0.
#GavinSimpson actually posted a method of how to get the difference between two smooths and assess its statistical significance here in 2017. Thanks to Matteo Fasiolo for pointing me in that direction.
In that approach, the by variable is converted to an ordered categorical variable which causes mgcv::gam to produce difference smooths in comparison to the reference level. Statistical significance for the difference smooths is then tested in the usual way with the summary command for the gam model.
However, and correct me if I have misunderstood, the ordered factor approach causes the smooth for the main effect to now be the smooth for the reference level of the ordered factor.
The approach I suggested, see the main post under the heading, Follow up thought Feb 24, where the interaction variable is created, gives an almost identical result for the p value for the difference smooth but does not change the smooth for the main effect. It also does not change the intercept and the linear term for the by categorical variable which also both changed with the ordered variable approach.

meanBEINF vs predict(model, type = "response') in BEINF GAMLSS. and determining odds of predictor variable coefficient

A variation of this question has been asked, but certain items remain unanswered -
I am modeling the proportion of mortality (Prop) using a single continuous predictor variable which is temperature (Temp). I have three questions.
1.) Should I be using meanBEINF for my model predicted estimates of the response? If so, how would I extract the associated standard errors? You think the way I have it specified currently would give you the response estimates, however, running predict(beinf_mod, type = "response", what = "mu") yields the same results which has me questioning.
2.) If I exponentiate the predictor variable coefficient (contained within the mu parameter) does this give me the odds between (0,1)? nu and tau currently don't have predictor variable coefficients so I'm not sure if those are to be worked in to get odds for the total domain [0,1].
3.) Is my interpretation of the odds correct in this scenario? I am familiar with a regular beta regression or logistic model, but, the uncertainties in question 2 make me wonder if this is appropriate.
Thanks in advance for the help, and it is greatly appreciated.
# generate DB
DB <- data.frame(Prop = c(0.688888889, 0.519230769, 0.378294574, 0.253644315, 0.234200744, 0.156626506,
0.191011236, 0.0625, 0.064516129, 0, 0, 0),
Temp = c(62.90857143, 62.75428571, 60.05428571, 60.23428571, 59.64285714, 57.94571429,
57.71428571, 57.14857143, 54.39714286, 51.87714286, 50.38571429, 49.1))
# beta inflated model. I understand na.omit works on the data, and that na.exclude is not really useful.
# I removed the NA's for this reproduction of the problem
beinf_mod <- gamlss(Prop ~ cs(Temp),
family=BEINF,data=na.omit(DB),na.action=na.exclude)
# obtain predictions for the estimated/expected value of y
predict(beinf_mod, type="response", se.fit=TRUE)
# get the odds of the explanatory variable. exponentiation gets us 1.47,
# so a one unit increase in temperature results in a 47% increase in the odds of mortality within the domain (0,1)
exp(coef(beinf_mod)[2])
allow me to answer my own questions
1.) yes, meanBEINF provides the estimated response values and bootstrapping would get you the errors.
2.) yes. Nu and Tau do not get worked in because one can only calculate the odds for the mu parameter (0,1) because one cannot calculate the odds for 0 and 1. You cannot divide by 0 if the odds are 1 and the odds of 0 prob to any other prob is 0.
3.) yes.

predict.coxph() and survC1::Est.Cval -- type for predict() output

Given a coxph() model, I want to use predict() to predict hazards and then use survC1::Est.Cval( . . . nofit=TRUE) to get a c-value for the model.
The Est.Cval() documentation is rather terse, but says that "nofit=TRUE: If TRUE, the 3rd column of mydata is used as the risk score directly in calculation of C."
Say, for simplicity, that I want to predict on the same data I built the model on. For
coxModel a Cox regression model from coxph();
time a vector of times (positive reals), the same times that coxModel was built on; and
event a 0/1 vector, the same length, of event/censor indicators, the same events that coxModel was built on --
does this indicate that I want
predictions <- predict(coxModel, type="risk")
dd <- cbind(time, event, pred)
Est.Cval(mydata=dd, tau=tau, nofit=TRUE)
or should that first line be
predictions <- predict(coxModel, type="lp")
?
Thanks for any help,
The answer is that it doesn't matter.
Basically, the concordance value is testing, for all comparable pairs of times (events and censors), how probable it is that the later time has the lower risk (for a really good model, almost always). But since e^u is a monotonic function of real u, and the c-value is only testing comparisons, it doesn't matter whether you provide the hazard ratio, e^(sum{\beta_i x_i}), or the linear predictor, sum{\beta_i x_i}.
Since #42 motivated me to come up with a minimal working example, we can test this. We'll compare the values that Est.Cval() provides using one input versus using the other; and we can compare both to the value we get from coxph().
(That last value won't match exactly, because Est.Cval() uses the method of Uno et al. 2011 (Uno, H., Cai, T., Pencina, M. J., D’Agostino, R. B. & Wei, L. J. On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statist. Med. 30, 1105–1117 (2011), https://onlinelibrary.wiley.com/doi/full/10.1002/sim.4154) but it can serve as a sanity check, since the values should be close.)
The following is based on the example worked through in Survival Analysis with R, 2017-09-25, by Joseph Rickert, https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/.
library("survival")
library("survC1")
# Load dataset included with survival package
data("veteran")
# The variable `time` records survival time; `status` indicates whether the
# patient’s death was observed (status=1) or that survival time was censored
# (status = 0).
# The model they build in the example:
coxModel <- coxph(Surv(time, status) ~ trt + celltype + karno + diagtime +
age + prior, data=veteran)
# The results
summary(coxModel)
Note the c-score it gives us:
Concordance= 0.736 (se = 0.021 )
Now, we calculate the c-score given by Est.Cval() on the two types of values:
# The value from Est.Cval(), using a risk input
cvalByRisk <- Est.Cval(mydata=cbind(time=veteran$time, event=veteran$status,
predictions=predict(object=coxModel, newdata=veteran, type="risk")),
tau=2000, nofit=TRUE)
# The value from Est.Cval(), using a linear predictor input
cvalByLp <- Est.Cval(mydata=cbind(time=veteran$time, event=veteran$status,
predictions=predict(object=coxModel, newdata=veteran, type="lp")),
tau=2000, nofit=TRUE)
And we compare the results:
cvalByRisk$Dhat
[1] 0.7282348
cvalByLp$Dhat
[1] 0.7282348

Weighted censored regression in R?

I am very new to R (mostly program in SQL) but was faced with a problem that SQL couldn't help me with. I'll try to simplify the problem below.
Assume I have a set of data with 100 rows where each row has a different weight associated with it. Out of those 100 rows of data, 5 have an X value that is top-coded at 1000. Also assume that X can be represented by the linear equation X ~ Y + Z + U + 0 (want a positive value so I don't want a Y-intercept).
Now, without taking the weights of each row of data into consideration, the formula I used in R was:
fit = censReg(X ~ Y + Z + U + 0, left = -Inf, right = 1000, data = dataset)
If I computed summary(fit) I would get 0 left-censored values, 95 uncensored values, and 5 right censored values which is exactly what I want, minus the fact that the weights haven't been sufficiently added into the mix. I checked the reference manual on the censReg function and it doesn't seem like it accepts a weight argument.
Is there something I'm missing about the censReg function or is there another function that would be of better use to me? My end goal is to estimate X in the cases where it is censored (i.e. the 5 cases where it is 1000).
You should use Tobit regression for this situation, it is designed specifically to linearly model latent variables such as the one you describe.
The regression accounts for your weights and the censored observations, which can be seen in the derivation of the log-likelihood function for the Type I Tobit (upper and lower bounded).
Tobit regression can be found in the VGAM package using the vglm function with a tobit control parameter. An excellent example can be found here:
http://www.ats.ucla.edu/stat/r/dae/tobit.htm

Resources