I have two variables (a financial stress index "CISS" and output growth).
Using the tsDyn package in R, I first calculated the TVAR. paperis the time series consisting of CISS and the output growth.
tvarpaper = TVAR(paper, lag=2, nthresh=1, thDelay=2, thVar= paper[,1])
I want to calculate the impulse response functions. Having used https://github.com/MatthieuStigler/tsDyn_GIRF, this is not exactly what I want to plot. I want to plot the IRFs for the low stress and the high stress regime separately with the corresponding confidence bands.
I first thought of splitting up the sample and then calculating the IRF with the normal irf function. In the following case I tried it for the high -stress regime.
SplitUPCISS <- paper[paper[,1] > -42.9926,]
tsSplitUPCISS <- ts(SplitUPCISS)
growthUPCISS <- VAR(SplitUPCISS, p=2)
SplitUPCISSIRF <- irf(growthUPCISS, impulse="tsyCISS12", reponse="tslogygdp12")
However, I am not 100% sure since there is hardly any movement if I plot it. Do I actually still need to calculate the VAR for the split up sample since I already calculated the tvar beforehand to find out about the threshold variable?
Related
I am using the useful gratia package by Gavin Simpson to extract the difference in two smooths for two different levels of a factor variable. The smooths are generated by the wonderful mgcv package. For example
library(mgcv)
library(gratia)
m1 <- gam(outcome ~ s(dep_var, by = fact_var) + fact_var, data = my.data)
diff1 <- difference_smooths(m1, smooth = "s(dep_var)")
draw(diff1)
This give me a graph of the difference between the two smooths for each level of the "by" variable in the gam() call. The graph has a shaded 95% credible interval (CI) for the difference.
Statistical significance, or areas of statistical significance at the 0.05 level, is assessed by whether or where the y = 0 line crosses the CI, where the y axis represents the difference between the smooths.
Here is an example from Gavin's site where the "by" factor variable had 3 levels.
The differences are clearly statistically significant (at 0.05) over nearly all of the graphs.
Here is another example I have generated using a "by" variable with 2 levels.
The difference in my example is clearly not statistically significant anywhere.
In the mgcv package, an approximate p value is outputted for a smooth fit that tests the null hypothesis that the coefficients are all = 0, based on a chi square test.
My question is, can anyone suggest a way of calculating a p value that similarly assesses the difference between the two smooths instead of solely relying on graphical evidence?
The output from difference_smooths() is a data frame with differences between the smooth functions at 100 points in the range of the smoothed variable, the standard error for the difference and the upper and lower limits of the CI.
Here is a link to the release of gratia 0.4 that explains the difference_smooths() function
enter link description here
but gratia is now at version 0.6
enter link description here
Thanks in advance for taking the time to consider this.
Don
One way of getting a p value for the interaction between the by factor variables is to manipulate the difference_smooths() function by activating the ci_level option. Default is 0.95. The ci_level can be manipulated to find a level where the y = 0 is no longer within the CI bands. If for example this occurred when ci_level = my_level, the p value for testing the hypothesis that the difference is zero everywhere would be 1 - my_level.
This is not totally satisfactory. For example, it would take a little manual experimentation and it may be difficult to discern accurately when zero drops out of the CI. Although, a function could be written to search the accompanying data frame that is outputted with difference_smooths() as the ci_level is varied. This is not totally satisfactory either because the detection of a non-zero CI would be dependent on the 100 points chosen by difference_smooths() to assess the difference between the two curves. Then again, the standard errors are approximate for a GAM using mgcv, so that shouldn't be too much of a problem.
Here is a graph where the zero first drops out of the CI.
Zero dropped out at ci_level = 0.88 and was still in the interval at ci_level = 0.89. So an approxiamte p value would be 1 - 0.88 = 0.12.
Can anyone think of a better way?
Reply to Gavin Simpson's comments Feb 19
Thanks very much Gavin for taking the time to make your comments.
I am not sure if using the criterion, >= 0 (for negative diffs), is a good way to go. Because of the draws from the posterior, there is likely to be many diffs that meet this criterion. I am interpreting your criterion as sample the posterior distribution and count how many differences meet the criterion, calculate the percentage and that is the p value. Correct me if I have misunderstood. Using this approach, I consistently got p values at around 0.45 - 0.5 for different gam models, even when it was clear the difference in the smooths should be statistically significant, at least at p = 0.05, because the confidence band around the smooth did not contain zero at a number of points.
Instead, I was thinking perhaps it would be better to compare the means of the posterior distribution of each of the diffs. For example
# get coefficients for the by smooths
coeff.level1 <- coef(gam.model1)[31:38]
coeff.level0 <- coef(gam.model1)[23:30]
# these indices are specific to my multi-variable gam.model1
# in my case 8 coefficients per smooth
# get posterior coefficients variances for the by smooths' coefficients
vp_level1 <- gam.model1$Vp[31:38, 31:38]
vp_level0 <- gam.model1$Vp[23:30, 23:30]
#run the simulation to get the distribution of each
#difference coefficient using the joint variance
library(MASS)
no.draws = 1000
sim <- mvrnorm(n = no.draws, (coeff.level1 - coeff.level0),
(vp_level1 + vp_level0))
# sim is a no.draws X no. of coefficients (8 in my case) matrix
# put the results into a data.frame.
y.group <- data.frame(y = as.vector(sim),
group = c(rep(1,no.draws), rep(2,no.draws),
rep(3,no.draws), rep(4,no.draws),
rep(5,no.draws), rep(6,no.draws),
rep(7,no.draws), rep(8,no.draws)) )
# y has the differences sampled from their posterior distributions.
# group is just a grouping name for the 8 sets of differences,
# (one set for each difference in coefficients)
# compare means with a linear regression
lm.test <- lm(y ~ as.factor(group), data = y.group)
summary(lm.test)
# The p value for the F statistic tells you how
# compatible the data are with the null hypothesis that
# all the group means are equal to each other.
# Same F statistic and p value from
anova(lm.test)
One could argue that if all coefficients are not equal to each other then they all can't be equal to zero but that isn't what we want here.
The basis of the smooth tests of fit given by summary(mgcv::gam.model1)
is a joint test of all coefficients == 0. This would be from a type of likelihood ratio test where model fit with and without a term are compared.
I would appreciate some ideas how to do this with the difference between two smooths.
Now that I got this far, I had a rethink of your original suggestion of using the criterion, >= 0 (for negative diffs). I reinterpreted this as meaning for each simulated coefficient difference distribution (in my case 8), count when this occurs and make a table where each row (my case, 8) is for one of these distributions with two columns holding this count and (number of simulation draws minus count), Then on this table run a chi square test. When I did this, I got a very low p value when I believe I shouldn't have as 0 was well within the smooth difference CI across almost all the levels of the exposure. Maybe I am still misunderstanding your suggestion.
Follow up thought Feb 24
In a follow up thought, we could create a variable that represents the interaction between the by factor and continuous variable
library(dplyr)
my.dat <- my.dat %>% mutate(interact.var =
ifelse(factor.2levels == "yes", 1, 0)*cont.var)
Here I am assuming that factor.2levels has the levels ("no", "yes"), and "no" is the reference level. The ifelse function creates a dummy variable which is multiplied by the continuous variable to generate the interactive variable.
Then we place this interactive variable in the GAM and get the usual statistical test for fit, that is, testing all the coefficients == 0.
#GavinSimpson actually posted a method of how to get the difference between two smooths and assess its statistical significance here in 2017. Thanks to Matteo Fasiolo for pointing me in that direction.
In that approach, the by variable is converted to an ordered categorical variable which causes mgcv::gam to produce difference smooths in comparison to the reference level. Statistical significance for the difference smooths is then tested in the usual way with the summary command for the gam model.
However, and correct me if I have misunderstood, the ordered factor approach causes the smooth for the main effect to now be the smooth for the reference level of the ordered factor.
The approach I suggested, see the main post under the heading, Follow up thought Feb 24, where the interaction variable is created, gives an almost identical result for the p value for the difference smooth but does not change the smooth for the main effect. It also does not change the intercept and the linear term for the by categorical variable which also both changed with the ordered variable approach.
I have a large set of high frequency data of wind. I use this data in a model to calculate gas exchange between atmosphere and water. I am using the average wind of a 10-day series of measurements to represent gas exchange at a given time. Since the wind is an average value from a 10-day series I want to apply the error to the output by adding the error to the input:
#fictional time series, manually created by me.
wind <- c(0,0,0,0,0,4,3,2,4,3,2,0,0,1,0,0,0,0,1,1,4,5,4,3,2,1,0,0,0,0,0)
I then create 100 values around the mean and sd of the wind vector:
df <- as.data.frame(mapply(rnorm,mean=mean(wind),sd=sd(wind),n=100))
The standard deviation generates negative values. If these are run in the gas exchange model I get disproportionately large error simply because wind speed can't be negative and the model is not constructed to be capable to run with negative wind measurements. I have been suggested to log transform the raw data and run the rnorm() with logged values, and then transform back. But since there are several zeros in the data (0=no wind) I can't simply log the values. Hence I use the log(x+c) method:
wind.log <- log(wind+1)
df.log <- as.data.frame(mapply(rnorm,
mean=mean(wind.log),
sd=sd(wind.log),n=100))
However, I will need to convert values back to actual wind measurements before running them in the model.
This is where it gets problematic, since I will need to use exp(x)-c to convert values back and then I end up with negative values again.
Is there a way to work around this without truncating the 0's and screwing up the generated distribution around the mean?
My only alternative is otherwise is to calculate gas exchange directly at every given time point and generate a distribution from that, those values would never be negative or = 0 and can hence be log-transformed.
Suggestion: use a zero-inflated/altered model, where you generate some proportion of zero values and draw the rest from a log-normal distribution(to make sure you don't get negative values):
wind <- c(0,0,0,0,0,4,3,2,4,3,2,0,0,1,0,0,0,0,1,1,4,5,4,3,2,1,0,0,0,0,0)
prop_nonzero <- mean(wind>0)
lmean <- mean(log(wind[wind>0]))
lsd <- sd(log(wind[wind>0]))
n <- 500
vals <- rbinom(n, size=1,prob=prop_nonzero)*rlnorm(n,meanlog=lmean,sdlog=lsd)
Alternatively you could use a Tweedie distribution (as suggested by #aosmith), or fit a censored model to estimate the distribution of wind values that get measured as zero (assuming that the wind speed is never exactly zero, just too small to measure)
Good afternoon,
I have a series of annual maxima data (say "AMdata") I'd like to model through a non-stationary GEV distribution. In particular, I want the location to vary linearly in time, i.e.:
mu = mu0 + mu1*t.
To this end, I am using the ismev package in R, computing the parameters as follows:
require(ismev)
ydat = cbind(1:length(AMdata)) ### Co-variates - years from 1 to number of annual maxima in the data
GEV_fit_1_loc = gev.fit(xdat=AMdata,ydat=ydat,mul=1)
In such a way, I obtain 4 parameters, namely mu0,mu1,shape and scale.
My question is: can I apply the gev.fit function fixing as a condition the value of mu1? not as a starting value for the successive iterations, but as a given parameter (thus estimating the three parameters left)?
Any tip would be really appreciated!
Francesco
I'm using the 'spatstat' package in R and obtained a set of Ripley's K functions (or L functions). I want to find a good way to average out this set of graphs on a single average line, as well as graphing out the standard deviation or confidence interval around this average line.
So far I've tried:
env.A <- envelope(A, fun=Lest, correction=c("Ripley"), nsim=99, rank=1, global=TRUE)
Aa <- env.A
avg <- eval.fv((Aa+Bb+Cc+Dd+Ee+Ff+Gg+Hh+Ii+Jj+Kk+Ll+Mm+Nn+Oo+Pp+Qq+Rr+Ss+Tt+Uu+Vv+Ww+Xx)/24)
plot(avg, xlim=c(0,200), . - r ~ r, ylab='', legend='')
With this, I got the average line from the data set.
However, I'm now stuck on finding the confidence interval around this average line.
Does anyone know a good way to do this?
The help file for envelope explains how to do this.
E <- envelope(A, Lest, correction="Ripley", nsim=100, VARIANCE=TRUE)
plot(E, . - r ~ r)
See help(envelope) for more explanation.
In this example, the average or middle curve is computed using a theoretical formula, because the simulations are generated from Complete Spatial Randomness, and the theoretical value of the L function is known. If you want the middle curve to be determined by the sample averages instead, set use.theo = FALSE in the call to envelope.
Can I also point out that the bands you get from envelope are not confidence intervals. A confidence interval would be centred around the estimated L function for the data point pattern A. The bands you get from the envelope command are centred around the mean value of the simulated curves. They are significance bands and their interpretation is related to a statistical significance test. This is also explained in the help file.
I'm new to R. Having a set of samples along with the target, I want to fit a numeric function to solve the target of new samples. My sample is time in seconds indicating the duration of a user's staying at this place:
>b <- c(101,25711,13451,19442,26,3083,133,184,4403,9713,6918,10056,12201,10624,14984,5241,
+21619,44285,3262,2115,1822,11291,3243,12989,3607,12882,4462,11553,7596,2926,12955,
+1832,3539,6897,13571,16668,813,1824,10304,2508,1493,4407,7820,507,15866,7442,7738,
+5705,2869,10137,11276,12884,11298,...)
Firstly, I convert them to hours dividing by 3600, and I want to fit a function as pdf of the duration:
> b <- b/3600
> hist(c,xlim=c(0,13),prob=T,breaks=seq(0,24,by=0.5))
> lines(density(x), col=red)
I want to fit the red line on the figure, and interpolate new values to find the probability of the specific duration on this place say p(duration = 1.5hours).
Thanks for your attention!
As suggested above, you can fit a distribution with fitdistr in MASS package.
If you use a continuous distribution you will have the probability that the time is within an interval. If you use a discrete distribution, you may compute the probability of a certain time (in hours).
For the continuous case, you can use a Gamma distribution: fitdistr(b, "Gamma") will give you the parameter estimates, and then you can use pgamma with those estimates and an interval.
For the discrete case, you can use a Poisson distribution: fitdistr(b, "Poisson") and then the dpois function with the estimate and the value you want.
To decide which one to use, I'd just plot the pdf with the histogram and take a look.