Stationarity Tests in R, checking mean, variance and covariance - r

Preamble
I am not sure if this question belongs to stackoverflow or stackexchange, I posted it on stackoverflow as it contains R-code. However, you are free to move it, if you feel it belongs to stackexchange.
Question
I was modelling some time-series and played around with different stationarity tests in R, when I noticed something strange. Do we have stationarity tests that check for all three components of a stationary time-series (according to Wikipedia: constant mean, constant variance, and constant covariance)?
Simulating some time-series, I found that the usual tests are not capable of finding some non-stationary time-series. Now the question is, am I missing something, or did I get something wrong, or are there better options than the tests mentioned?
Simulating some data
The first time-series should be stationary, whereas the other 5 are simulated to have either different means, or different variances.
set.seed(123)
dat <- data.frame(x = 1:1000,
# stationary
ts1 = rnorm(1000),
# random walk
ts2 = cumsum(rnorm(1000)),
# jumps in mean
ts3 = c(rnorm(200), rnorm(400, mean = 5), rnorm(400)),
# jumps in variance
ts4 = c(rnorm(100), rnorm(300, sd = 5), rnorm(500), rnorm(100)),
# increasing variance
ts5 = sapply(1:1000, function(x) rnorm(1, sd = x/1000)),
# variance as a squared function
ts6 = sapply(1:1000, function(x) rnorm(1, sd = (x - 500)^2))/1000000)
The series look like this. I would say that it is clear that only the first series has a constant mean and variance and that the other series are non-stationary.
The issues (?) with tests
So far, I looked into 4 different tests: adf.test, Box.test, kpss.test, and PP.test (all from the tseries-packages.
While a significant p-value in the adf.test and the PP.test indicates a stationary series, a significant value indicates non-stationarity for the other two tests.
library(tseries)
tests <- c("adf.test", "Box.test", "kpss.test", "PP.test")
sapply(tests, function(test){
apply(dat[, c("ts1", "ts2", "ts3", "ts4", "ts5", "ts6")], 2, function(x) {
get(test)(x)$p.value
})
})
# resulting in
# adf.test Box.test kpss.test PP.test # stat:non_stat
# ts1 0.0100000 0.386053779 0.10 0.0100000 # 4:0 clearly stat
# ts2 0.4195604 0.000000000 0.01 0.3260713 # 0:4 clearly non-stat
# ts3 0.5467517 0.000000000 0.01 0.0100000 # 1:3 most-likely non-stat
# ts4 0.0100000 0.004360365 0.10 0.0100000 # 2:2 ?!
# ts5 0.0100000 0.033007310 0.10 0.0100000 # 2:2 ?!
# ts6 0.0100000 0.307453035 0.10 0.0100000 # 4:0 stationary ?!
While the first series is found to be stationary by all four tests, the tests have issues with the other series, especially series with changing variance are missclassified in many cases.
Do you know if there are better tests that account for changes in variance and/or covariance?
Thank you very much for ideas, thoughts, and solutions.

Related

Calculating 95% confidence intervals for a weighted median over grouped data in dplyr

I have a dataset with several groups, where I want to calculate a median value for each group using dplyr. The data are weighted, and the weights need to be taken into account in calculating the median. I found the weighted.median function from spatstat which seems to work fine. Consider the following simplified example:
require(spatstat, dplyr)
tst <- data.frame(group = rep(c(1:5), each = 100))
tst$val = runif(500) * tst$group
tst$wt = runif(500) * tst$val
tst %>%
group_by(group) %>%
summarise(weighted.median(val, wt))
# A tibble: 5 × 2
group `weighted.median(val, wt)`
<int> <dbl>
1 1 0.752
2 2 1.36
3 3 1.99
4 4 2.86
5 5 3.45
However, I would also like to add 95% confidence intervals to these values, and this has me stumped. Things I've considered:
Spatstat also has a weighted.var function but there's no documentation, and it's not even clear to me whether this is variance around the median or mean.
This rcompanion post suggests various methods for calculating CIs around medians, but as far as I can tell none of them handle weights.
This blog post suggests a function for calculating CIs and a median for weighted data, and is the closest I can find to what I need. However, it doesn't work with my dplyr groupings. I suppose I could write a loop to do this one group at a time and build the output data frame, but that seems cumbersome. I'm also not totally sure I understand the function in the post and slightly suspicious of its results- for instance, testing this out I get wider estimates for alpha=0.1 than for alpha=0.05, which seems backwards to me. Edit to add: upon further investigation, I think this function works as intended if I use alpha=0.95 for 95% CIs, rather than alpha = 0.05 (at least, this returns values that feel intuitively about right). I can also make it work with dplyr by editing to return just a single moe value rather than a pair of high/low estimates. So this may be a good option- but I'm also considering others.
Is there an existing function in some library somewhere that can do what I want, or an otherwise straightforward way to implement this?
There are several approaches.
You could use the asymptotic formula for standard error of the sample median. The sample median is asymptotically normal with standard error 1/sqrt(4 n f(m)) where n is the number of observations, m is the true median, and f(x) is the probability density of the (weighted) random variable. You could estimate the probability density using the base R function density.default with the weights argument. If x is the vector of observed values and w the corresponding vector of weights, then
med <- weighted.median(x, w)
f <- density(x, weights=w)
fmed <- approx(f$x, f$y, xout=med)$y
samplesize <- length(x)
se <- 1/sqrt(4 * samplesize * fmed)
ci <- med + c(-1,1) * 1.96 * se
This relies on several asymptotic approximations so it may be inaccurate. Also the sample size depends on the interpretation of the weights. In some cases the sample size could be equal to sum(w).
If there is very little data in each group, you could use the even simpler normal reference approximation,
med <- weighted.median(x, w)
v <- weighted.var(x, w)
sdm <- sqrt(pi/2) * sqrt(v)
samplesize <- length(x)
se <- sdm/sqrt(samplesize)
ci <- med + c(-1,1) * 1.96 * se
Alternatively you could use bootstrapping - generate random resamples of the input data (by choosing random resamples of the indices 1, 2, ..., n), extract the corresponding weighted observations (x_i, w_i), compute the weighted median of each resampled dataset, and construct the 95% confidence interval.
(This approach implicitly assumes the sample size is equal to n)

R function to find difference in mean greater than or equal to a specific number

I have just started my basic statistic course using R and we're studying using R for paired t-tests. I have come across questions where we're given two sets of data and we're asked to find whether the difference in mean is equal to 0 or greater than 0 so on so forth. The function we use for two samples x and y with an unknown variance is similar to the one below;
t.test(x, y, var.equal=TRUE, alternative="greater")
My question is, how would we to do this if we wanted to test the difference in mean is more than or equal to a specified number against the alternative that its less than a specific number and not 0.
For example, say we're given two datas for before and after weights of 10 people. How do we test that the mean difference in weight is more than or equal to say 3kg against the alternative where the mean difference in weight is less than 3kg. Is there a way to do this? Would really appreciate any guidance on this matter.
It might be worthwhile posting on https://stats.stackexchange.com/ as well if you're in need of more theoretical proof. Is it ok to add/subtract the 3kg from either x or y and then use the t-test to check for similarity? I think this would tell you at least which outcome is more likely, if that's the end goal. It would be good to get feedback on this
# number of obs, and rnorm dist for simulating
N <- 10
mu <- 70
sd <- 10
set.seed(1)
x <- round(rnorm(N, mu, sd), 1)
# three outcomes
# (1) no change
y_same <- x + round(rnorm(N, 0, 5), 1)
# (2) average increase of 3
y_imp <- x + rnorm(N, 3, 5)
# (3) average decrease of 3
y_dec <- x + rnorm(N, -3, 5)
# say y_imp is true
y_act <- y_imp
# can we test whether we're closer to the output by altering
# the original data? or conversely, altering y_imp
t_inc <- t.test(x+3, y_act, var.equal=TRUE, alternative="two.sided")
t_dec <- t.test(x-3, y_act, var.equal=TRUE, alternative="two.sided")
t_inc$p.value
[1] 0.8279801
t_dec$p.value
[1] 0.0956033
# one with the highest p.value has the closest distribution, so
# +3 kg more likely than -3kg
You can set mu=3 to change the null hypothesis from 0 to 3 assuming your x variables are in the units you describe above.
t.test(x, y, mu=3, alternative="greater", paired=TRUE)
More (general) information on Stack Exchange [here].(https://stats.stackexchange.com/questions/206316/can-a-paired-or-two-group-t-test-test-if-the-difference-between-two-means-is-l/206317#206317)

Calculating standard error after a log-transform

Consider a random set of numbers that are normally distributed:
x <- rnorm(n=1000, mean=10)
We'd like to know the mean and the standard error on the mean so we do the following:
se <- function(x) { sd(x)/length(x) }
mean(x) # something near 10.0 units
se(x) # something near 0.001 units
Great!
However, let's assume we don't necessarily know that our original distribution follows a normal distribution. We log-transform the data and perform the same standard error calculation.
z <- log(x, base=10)
mean(z) # something near 1 log units
se(z) # something near 0.000043 log units
Cool, but now we need to back-transform to get our answer in units NOT log units.
10^mean(z) # something near 10.0 units
10^se(z) # something near 1.00 units
My question: Why, for a normal distribution, does the standard error differ depending on whether it was calculated from the distribution itself or if it was transformed, calculated, and back-transformed? In this example, it is interesting that the difference is almost exactly 3 orders of magnitude. Note: the means came out the same regardless of the transformation.
EDIT #1: Ultimately, I am interested in calculating a mean and confidence intervals for non-normally distributed data, so if you can give some guidance on how to calculate 95% CI's on transformed data including how to back-transform to their native units, I would appreciate it!
END EDIT #1
EDIT #2: I tried using the quantile function to get the 95% confidence intervals:
quantile(x, probs = c(0.05, 0.95)) # around [8.3, 11.6]
10^quantile(z, probs = c(0.05, 0.95)) # around [8.3, 11.6]
So, that converged on the same answer, which is good. However, using this method doesn't provide the exact same interval using non-normal data with "small" sample sizes:
t <- rlnorm(10)
mean(t) # around 1.46 units
10^mean(log(t, base=10)) # around 0.92 units
quantile(t, probs = c(0.05, 0.95)) # around [0.211, 4.79]
10^(quantile(log(t, base=10), probs = c(0.05, 0.95))) # around [0.209, 4.28]
Which method would be considered "more correct". I assume one would pick the most conservative estimate?
As an example, would you report this result for the non-normal data (t) as having a mean of 0.92 units with a 95% confidence interval of [0.211, 4.79]?
END EDIT #2
Thanks for your time!

Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf

I am working with the cumulative emergence of flies over time (taken at irregular intervals) over many summers (though first I am just trying to make one year work). The cumulative emergence follows a sigmoid pattern and I want to create a maximum likelihood estimation of a 3-parameter Weibull cumulative distribution function. The three-parameter models I've been trying to use in the fitdistrplus package keep giving me an error. I think this must have something to do with how my data is structured, but I cannot figure it out. Obviously I want it to read each point as an x (degree days) and a y (emergence) value, but it seems to be unable to read two columns. The main error I'm getting says "Non-numeric argument to mathematical function" or (with slightly different code) "data must be a numeric vector of length greater than 1". Below is my code including added columns in the df_dd_em dataframe for cumulative emergence and percent emergence in case that is useful.
degree_days <- c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94)
emergence <- c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0)
cum_em <- cumsum(emergence)
df_dd_em <- data.frame (degree_days, emergence, cum_em)
df_dd_em$percent <- ave(df_dd_em$emergence, FUN = function(df_dd_em) 100*(df_dd_em)/46)
df_dd_em$cum_per <- ave(df_dd_em$cum_em, FUN = function(df_dd_em) 100*(df_dd_em)/46)
x <- pweibull(df_dd_em[c(1,3)],shape=5)
dframe2.mle <- fitdist(x, "weibull",method='mle')
Here's my best guess at what you're after:
Set up data:
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd <- transform(dd,cum_em=cumsum(emergence))
We're actually going to fit to an "interval-censored" distribution (i.e. probability of emergence between successive degree day observations: this version assumes that the first observation refers to observations before the first degree-day observation, you could change it to refer to observations after the last observation).
library(bbmle)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun <- function(scale,shape,x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull(c(-Inf,x), ## or (c(x,Inf))
shape=shape,scale=scale)),1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
library(bbmle)
I should probably have used something more systematic like the method of moments (i.e. matching the mean and variance of a Weibull distribution with the mean and variance of the data), but I just hacked around a bit to find plausible starting values:
## preliminary look (method of moments would be better)
scvec <- 10^(seq(0,4,length=101))
plot(scvec,sapply(scvec,NLLfun,shape=1))
It's important to use parscale to let R know that the parameters are on very different scales:
startvals <- list(scale=1000,shape=1)
m1 <- mle2(NLLfun,start=startvals,
control=list(parscale=unlist(startvals)))
Now try with a three-parameter Weibull (as originally requested) -- requires only a slight modification of what we already have:
library(FAdist)
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
Looks like the three-parameter fit is much better:
library(emdbook)
AICtab(m1,m2)
## dAIC df
## m2 0.0 3
## m1 21.7 2
And here's the graphical summary:
with(dd,plot(cum_em~degree_days,cex=3))
with(as.list(coef(m1)),curve(sum(dd$emergence)*
pweibull(x,shape=shape,scale=scale),col=2,
add=TRUE))
with(as.list(coef(m2)),curve(sum(dd$emergence)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))
(could also do this more elegantly with ggplot2 ...)
These don't seem like spectacularly good fits, but they're sane. (You could in principle do a chi-squared goodness-of-fit test based on the expected number of emergences per interval, and accounting for the fact that you've fitted a three-parameter model, although the values might be a bit low ...)
Confidence intervals on the fit are a bit of a nuisance; your choices are (1) bootstrapping; (2) parametric bootstrapping (resample parameters assuming a multivariate normal distribution of the data); (3) delta method.
Using bbmle::mle2 makes it easy to do things like get profile confidence intervals:
confint(m1)
## 2.5 % 97.5 %
## scale 1576.685652 1777.437283
## shape 4.223867 6.318481
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd$cum_em <- cumsum(dd$emergence)
dd$percent <- ave(dd$emergence, FUN = function(dd) 100*(dd)/46)
dd$cum_per <- ave(dd$cum_em, FUN = function(dd) 100*(dd)/46)
dd <- transform(dd)
#start 3 parameter model
library(FAdist)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$percent) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
summary(m2)
#graphical summary
windows(5,5)
with(dd,plot(cum_per~degree_days,cex=3))
with(as.list(coef(m2)),curve(sum(dd$percent)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))

Simlulating a t-test in R

I am looking for a way to simulate the power of a simple t-test in different sample sizes. My idea is to generate 400 random normal distribution samples, each with mean 5 and variance 1, and perform a t-test on each one concerning the hypothesis that the true mean is 4, i.e. the t-test would take the form:
t=(mean(x)-4)*sqrt(n)/sd(x) # for each sample x which consists of n observations.
For comparison I would like, the first 100 samples to consist of 10 observations, the next 100 ones of 100, the next 100 of 1000 and finally the last 100 of 5000, which I think is the upper limit. A t-test will have to be performed on each and every sample.
Lastly, I would like to see on what percentage of each sample group- let's call them, n10,n100,n1000,n5000, depending on how many observations they comprise- my (false) hypothesis is rejected.
Could you please help me write the corresponding R-code? I know the small commands but have trouble putting it all together. This is a nice exercise and hopefully I shall then be able to modify it a bit and use it for different purposes as well.
Thank you in advance.
Here's a one liner for 400 t.tests of n=10:
R>simulations <- replicate(400, t.test(rnorm(10, mean=5, sd=1), mu=4),
simplify=FALSE);
Then you can analyze it:
R>table(sapply(simulations, "[[", "p.value") < .05)
FALSE TRUE
75 325
I'm still learning R, too, so handle with care:
n <- 5
N <- 100
samplesizes <- as.integer(10^(1:n))
set.seed(1)
# generate samples
samples <- replicate(N, mapply(rnorm, samplesizes, mean=4, sd=sqrt(1)))
# perform t-tests
t.tests <- lapply(samples, function(x) t.test(x, mu=5, alternative="two.sided"))
# get p-values
t.test.pvalues <- sapply(t.tests, function(x) x$p.value)
rejected <- t.test.pvalues > .05
sampleIndices <- rep(1:n, N)
res <- aggregate(rejected, list(sample=sampleIndices), FUN=function(x) sum(x)/length(x) )
names(res)[2] <- "percRejected"
print(res, row.names=F)
# sample percRejected
# 1 0.16
# 2 0.00
# 3 0.00
# 4 0.00
# 5 0.00

Resources