Breakpoints and F statistics in strucchange package

Breakpoints and F statistics in strucchange package - r

As far as I know the breakpoint should correspond to the observation which maximizes the F statistics, but I can't see any meaningful association between the F statistics and the timing of the break. What do I get wrong?
y <- c(rnorm(30), 2+rnorm(20)) # 1 breakpoint
f <- Fstats(y ~ 1) # calculate F statistics
f$breakpoint # breakpoint Fstats suggests
which(f$Fstats == max(f$Fstats)) # observation with max F statistics
order(f$Fstats) # observations ordered by F statistics
As can be seen the observation of the breakpoint is not the observation with the highest F statistics.

Your y is not class ts. So the output became a bit curious ts data and unfortunately you failed to interpret it.
set.seed(1)
y <- c(rnorm(30), 2+rnorm(20))
ts.y <- ts(y, start = 1, frequency = 1) # change `y` into class `ts`
ts.f <- Fstats(ts.y ~ 1)
ts.f$breakpoint # [1] 30
ts.f$Fstats
# Time Series:
# Start = 7 # this means ts.f$Fstats[1] is 7th
which(ts.f$Fstats == max(ts.f$Fstats)) # [1] 24 # ts.f$Fstats[24] is 30th
plot(ts.f)
lines(breakpoints(ts.f))

Related

How to code the permutation equivalent of Mood's Median Test in R? (get the p values using permutation)

I can do it for the two sample t test but not for Median test or Wilcoxon test or Hodges Lehmann test
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
mean(data_2000)
mean(data_2019)
mean(data_2019) - mean(data_2000)
combined_data <- c(data_2000, data_2019)
set.seed(123)
null_dist <- c()
for (i in 1:100000) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
null_dist[i] <- mean(shuffled_2019) - mean(shuffled_2000)
}
(p_value <- (sum(null_dist >= 49.57143) + sum(null_dist <=
`enter code here`-49.57143))/length(null_dist))

I think this is what you're trying to do. I altered your code as little as possible. There are packages like infer that will do this for you and the for loop is not the most efficient but it's plenty good enough and may help you learn. As long as we're looping I did mean and median at the same time since all other parts of the code are identical. ifelse is a nice easy way to make 1s and 0s to sum.
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
delta_mean <- mean(data_2019) - mean(data_2000)
delta_median <- median(data_2019) - median(data_2000)
combined_data <- c(data_2000, data_2019)
trials <- 100000
set.seed(123)
mean_diff <- c()
median_diff <- c()
for (i in 1:trials) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
mean_diff[i] <- mean(shuffled_2019) - mean(shuffled_2000)
median_diff[i] <- median(shuffled_2019) - median(shuffled_2000)
}
p_mean <- sum(ifelse(mean_diff > delta_mean | mean_diff < -1 * delta_mean, 1, 0)) / trials
p_median <- sum(ifelse(median_diff > delta_median | median_diff < -1 * delta_median, 1, 0)) / trials
p_mean
#> [1] 0.31888
p_median
#> [1] 0.24446
Following up on your question about HL test. Quoting Wikipedia
The Hodges–Lehmann statistic also estimates the difference between two populations. For two sets of data with m and n observations, the set of two-element sets made of them is their Cartesian product, which contains m × n pairs of points (one from each set); each such pair defines one difference of values. The Hodges–Lehmann statistic is the median of the m × n differences.
You could run it on your data with the following code...
Do NOT run it 100,000 times the answer is the same everytime because you're already making all 49 possible pairings
hl_df <- expand.grid(data_2019, data_2000)
hl_df$pair_diffs <- hl_df$Var1 - hl_df$Var2
median(hl_df$pair_diffs)
[1] 49

You can do the Wilcoxon test with wilcox.test in the stats package (loaded by default as part of R core). You need to set exact = FALSE because an exact p-value is not possible if there are ties.
wilcox.test(data_2019, data_2000, exact = FALSE)
Wilcoxon rank sum test with continuity correction
data: data_2019 and data_2000
W = 33.5, p-value = 0.2769
alternative hypothesis: true location shift is not equal to 0
I'll update this when I figure out how to do the other tests.

Error in my math formula for implementing CUSUM in R

I'm trying to implement a check for decreasing values of avg temperatures to see when the temperature starts falling. See the chart of temperatures here:
Here is the formula I'm trying to implement:
Here is my code to implement that formula:
temps <- read.delim("temps.txt")
date_avgs <- rowMeans(temps[2:length(temps)], dims=1, na.rm=T)
mu <- 87
threshold <- 86
constant <- 3
date_avgs
S <- 0 * date_avgs
for (i in 2:length(date_avgs)) {
value <- S[i-1] + (mu - date_avgs[i] - constant)
cat("\nvalue", value, "si", date_avgs[i], i)
S[i] <- max(0, value)
if(S[i] >= threshold){
#Once I hit this for the first time, that indicates at this index the temp is decreasing
cat("\nDecreased past my threshold!!!", S[i] ,i)
}
}
But I'm not able to detect the change as I expect. My formula doesn't get over the threshold until index 108, when it should get there around index 60.
Here is the plot of my S (or CUSUM) values:
Any ideas what I'm doing wrong in my formula?

I think the problem is mu <- mean(date_avgs) basically means of all the observations. But mu should be "mean of X if no change". Thus mu should be about 87 but according your code and plotted data seems to be 80 or less.
# simulated data
set.seed(4422)
date_avgs <- c(runif(60, 84, 92), 88-(1:50)-rnorm(50,0,4))
plot(date_avgs)
# setting constants
mu <- 87
threshold <- 86
constant <- 3
# after running for cycle
Index <- match(S[S >= threshold][1], S)
Index
[1] 75
# for data
> date_avgs[74]
[1] 73.41981
# Considering a lower threshold
# (as maximum allowable difference to detect trend 2 * C)
mu <- 87
threshold <- 6 # arbitrary
constant <- 3
# after running for cycle
Index <- match(S[S >= threshold][1], S)
Index
[1] 66
So I think code is fine, maybe the interpretation is not

Carrying out a PBIB.test

I have data set from a incomplete lattice design study that I have imported into R from excel and would like to conduct a PBIB.test. However, after running the function as shown below, the output shows object Area not found, even after repeated times.
library("agricolae", lib.loc = "~/R/win-library/3.3")
Rdata2 <- PBIB.test("BlockNo", "AccNo", "Rep", Area, k = 9, c("REML"), console = TRUE)
Error in data.frame(v1 = 1, y) : object 'Area' not found
What is the problem?

See below for a sample application of PBIB.test, based on the agricolae tutorial.
First, create some sample data.
# Construct the alpha design with 30 treatments, 2 repetitions, and block size = 3
Genotype <- c(paste("gen0", 1:9, sep= ""), paste("gen", 10:30, sep= ""));
r <- 2;
k <- 3;
s <- 10;
b <- s * r;
book <- design.alpha(Genotype, k, r,seed = 5);
# Source dataframe
df <- book$book;
Create a vector of response values.
# Response variable
response <- c(
5,2,7,6,4,9,7,6,7,9,6,2,1,1,3,2,4,6,7,9,8,7,6,4,3,2,2,1,1,2,
1,1,2,4,5,6,7,8,6,5,4,3,1,1,2,5,4,2,7,6,6,5,6,4,5,7,6,5,5,4);
Run PBIB.test
model <- with(df, PBIB.test(block, Genotype, replication, response, k = 3, method="REML"))
head(model);
#$ANOVA
#Analysis of Variance Table
#
#Response: yield
# Df Sum Sq Mean Sq F value Pr(>F)
#Genotype 29 72.006 2.4830 1.2396 0.3668
#Residuals 11 22.034 2.0031
#
#$method
#[1] "Residual (restricted) maximum likelihood"
#
#$parameters
# test name.t treatments blockSize blocks r alpha
# PBIB-lsd Genotype 30 3 10 2 0.05
#
#$statistics
# Efficiency Mean CV
# 0.6170213 4.533333 31.22004
#
#$model
#Linear mixed-effects model fit by REML
# Data: NULL
# Log-restricted-likelihood: -73.82968
# Fixed: y ~ trt.adj
# (Intercept) trt.adjgen02 trt.adjgen03 trt.adjgen04 trt.adjgen05 trt.adjgen06
# 6.5047533 -3.6252940 -0.7701618 -2.5264354 -3.1633495 -1.9413054
#trt.adjgen07 trt.adjgen08 trt.adjgen09 trt.adjgen10 trt.adjgen11 trt.adjgen12
# -3.0096514 -4.0648738 -3.5051139 -2.8765561 -1.7111335 -1.6308755
#trt.adjgen13 trt.adjgen14 trt.adjgen15 trt.adjgen16 trt.adjgen17 trt.adjgen18
# -2.2187974 -2.3393290 -2.0807215 -0.3122845 -3.4526453 -1.0320169
#trt.adjgen19 trt.adjgen20 trt.adjgen21 trt.adjgen22 trt.adjgen23 trt.adjgen24
# -3.1257616 0.2101325 -1.7632411 -1.9177848 -1.0500345 -2.5612960
#trt.adjgen25 trt.adjgen26 trt.adjgen27 trt.adjgen28 trt.adjgen29 trt.adjgen30
# -4.3184716 -2.3071359 1.2239927 -1.3643068 -1.4354599 -0.4726870
#
#Random effects:
# Formula: ~1 | replication
# (Intercept)
#StdDev: 8.969587e-05
#
# Formula: ~1 | block.adj %in% replication
# (Intercept) Residual
#StdDev: 1.683459 1.415308
#
#Number of Observations: 60
#Number of Groups:
# replication block.adj %in% replication
# 2 20
#
#$Fstat
# Fit Statistics
#AIC 213.65937
#BIC 259.89888
#-2 Res Log Likelihood -73.82968

How can I perform multiple pairwise t.test in R using the same reference vector?

Let's consider the following vectors in the dataframe:
ctrl <- rnorm(50)
x1 <- rnorm(30, mean=0.2)
x2 <- rnorm(100,mean=0.1)
x3 <- rnorm(100,mean=0.4)
x <- data.frame(data=c(ctrl,x1,x2,x3),
Group=c(
rep("ctrl", length(ctrl)),
rep("x1", length(x1)),
rep("x2", length(x2)),
rep("x3", length(x3))) )
I know I could use
pairwise.t.test(x$data,
x$Group,
pool.sd=FALSE)
to get pairwise comparison like
Pairwise comparisons using t tests with non-pooled SD
data: x$data and x$Group
ctrl x1 x2
x1 0.08522 - -
x2 0.99678 0.10469 -
x3 0.00065 0.99678 2.8e-05
P value adjustment method: holm
However I am not interested in every possible combination of vectors. I am seeking a way to compare ctrl vector with every other vectors, and to take into account alpha inflation. I'd like to avoid
t.test((x$data[x$Group=='ctrl']), (x$data[x$Group=='x1']), var.equal=T)
t.test((x$data[x$Group=='ctrl']), (x$data[x$Group=='x2']), var.equal=T)
t.test((x$data[x$Group=='ctrl']), (x$data[x$Group=='x3']), var.equal=T)
And then perform manual correction for multiple comparisons. What would be the best way to do so ?

You can use p.adjust to get a Bonferroni adjustment to multiple p-values. You should not bundle thos unequal length vectors inot t adataframe but rather use a list.
ctrl <- rnorm(50)
x1 <- rnorm(30, mean=0.2)
x2 <- rnorm(100,mean=0.1)
x3 <- rnorm(100,mean=0.4)
> lapply( list(x1,x2,x3), function(x) t.test(x,ctrl)$p.value)
[[1]]
[1] 0.2464039
[[2]]
[1] 0.8576423
[[3]]
[1] 0.0144275
> p.adjust( .Last.value)
[1] 0.4928077 0.8576423 0.0432825

#BondedDust 's answer looks great. I provide a bit more complicated solution if you really need to work with dataframes.
library(dplyr)
ctrl <- rnorm(50)
x1 <- rnorm(30, mean=0.2)
x2 <- rnorm(100,mean=0.1)
x3 <- rnorm(100,mean=0.4)
x <- data.frame(data=c(ctrl,x1,x2,x3),
Group=c(
rep("ctrl", length(ctrl)),
rep("x1", length(x1)),
rep("x2", length(x2)),
rep("x3", length(x3))), stringsAsFactors = F )
# provide the combinations you want
# set1 with all from set2
set1 = c("ctrl")
set2 = c("x1","x2","x3")
dt_res =
data.frame(expand.grid(set1,set2)) %>% # create combinations
mutate(test_id = row_number()) %>% # create a test id
group_by(test_id) %>% # group by test id, so everything from now on is performed for each test separately
do({x_temp = x[(x$Group==.$Var1 | x$Group==.$Var2),] # for each test id keep groups of interest
x_temp = data.frame(x_temp)}) %>%
do(test = t.test(data~Group, data=.)) # perform the test and save it
# you create a dataset that has the test id and a column with t.tests results as elements
dt_res
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# test_id test
# 1 1 <S3:htest>
# 2 2 <S3:htest>
# 3 3 <S3:htest>
# get all tests as a list
dt_res$test
# [[1]]
#
# Welch Two Sample t-test
#
# data: data by Group
# t = -1.9776, df = 58.36, p-value = 0.05271
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.894829477 0.005371207
# sample estimates:
# mean in group ctrl mean in group x1
# -0.447213560 -0.002484425
#
#
# [[2]]
#
# Welch Two Sample t-test
#
# data: data by Group
# t = -2.3549, df = 100.68, p-value = 0.02047
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.71174095 -0.06087081
# sample estimates:
# mean in group ctrl mean in group x2
# -0.44721356 -0.06090768
#
#
# [[3]]
#
# Welch Two Sample t-test
#
# data: data by Group
# t = -5.4235, df = 101.12, p-value = 4.001e-07
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -1.2171386 -0.5652189
# sample estimates:
# mean in group ctrl mean in group x3
# -0.4472136 0.4439652
PS : It's always interesting to work with p-values and alpha corrections. It's a bit of a philosophical issue how to approach that and some people agree and other disagree. Personally, I tend to correct alpha based on all possible comparison I can do after an experiment, because you never know when you'll come back to investigate other pairs. Imagine what happens if in the future people decide that you have to go back and compare the winning group (let's say x1) with x2 and x3. You'll focus on those pairs and you'll again correct alpha based on those compariosns. But on the whole you performed all possible comparisons, apart from x2 vs x3! You may write your reports or publish findings that should have been a bit more strict on the alpha correction.

Using split function in R

I am trying to simulate three small datasets, which contains x1,x2,x3,x4, trt and IND.
However, when I try to split simulated data by IND using "split" in R I get Warning messages and outputs are correct. Could someone please give me a hint what I did wrong in my R code?
# Step 2: simulate data
Alpha = 0.05
S = 3 # number of replicates
x = 8 # number of covariates
G = 3 # number of treatment groups
N = 50 # number of subjects per dataset
tot = S*N # total subjects for a simulation run
# True parameters
alpha = c(0.5, 0.8) # intercepts
b1 = c(0.1,0.2,0.3,0.4) # for pi_1 of trt A
b2 = c(0.15,0.25,0.35,0.45) # for pi_2 of trt B
b = c(1.1,1.2,1.3,1.4);
##############################################################################
# Scenario 1: all covariates are independent standard normally distributed #
##############################################################################
set.seed(12)
x1 = rnorm(n=tot, mean=0, sd=1);x2 = rnorm(n=tot, mean=0, sd=1);
x3 = rnorm(n=tot, mean=0, sd=1);x4 = rnorm(n=tot, mean=0, sd=1);
###############################################################################
p1 = exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p2 = exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p3 = 1/(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
# To assign subjects to one of treatment groups based on response probabilities
tmp = function(x){sample(c("A","B","C"), 1, prob=x, replace=TRUE)}
trt = apply(cbind(p1,p2,p3),1,tmp)
IND=rep(1:S,each=N) #create an indicator for split simulated data
sim=data.frame(x1,x2,x3,x4,trt, IND)
Aset = subset(sim, trt=="A")
Bset = subset(sim, trt=="B")
Cset = subset(sim, trt=="C")
Anew = split(Aset, f = IND)
Bnew = split(Bset, f = IND)
Cnew = split(Cset, f = IND)
The warning message:
> Anew = split(Aset, f = IND)
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
and the output becomes
$`2`
x1 x2 x3 x4 trt IND
141 1.0894068 0.09765185 -0.46702047 0.4049424 A 3
145 -1.2953113 -1.94291045 0.09926239 -0.5338715 A 3
148 0.0274979 0.72971804 0.47194731 -0.1963896 A 3
$`3`
[1] x1 x2 x3 x4 trt IND
<0 rows> (or 0-length row.names)
I have checked my R code several times however, I can't figure out what I did wrong. Many thanks in advance

IND is the global variable for the full data, sim. You want to use the specific one for the subset, eg
Anew <- split(Aset, f = Aset$IND)

It's a warning, not an error, which means split executed successfully, but may not have done what you wanted to do.
From the "details" section of the help file:
f is recycled as necessary and if the length of x is not a multiple of
the length of f a warning is printed. Any missing values in f are
dropped together with the corresponding values of x.
Try checking the length of your IND against the size of your dataframe, maybe.

Not sure what your goal is once you have your data split, but this sounds like a good candidate for the plyr package.
> library(plyr)
> ddply(sim, .(trt,IND), summarise, x1mean=mean(x1), x2sum=sum(x2), x3min=min(x3), x4max=max(x4))
trt IND x1mean x2sum x3min x4max
1 A 1 -0.49356448 -1.5650528 -1.016615 2.0027822
2 A 2 0.05908053 5.1680463 -1.514854 0.8184445
3 A 3 0.22898716 1.8584443 -1.934188 1.6326763
4 B 1 0.01531230 1.1005720 -2.002830 2.6674931
5 B 2 0.17875088 0.2526760 -1.546043 1.2021935
6 B 3 0.13398967 -4.8739380 -1.565945 1.7887837
7 C 1 -0.16993037 -0.5445507 -1.954848 0.6222546
8 C 2 -0.04581149 -6.3230167 -1.491114 0.8714535
9 C 3 -0.41610973 0.9085831 -1.797661 2.1174894
>
Where you can substitute summarise and its following arguments for any function that returns a data.frame or something that can be coerced to one. If lists are the target, ldply is your friend.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Breakpoints and F statistics in strucchange package - r

Related

How to code the permutation equivalent of Mood's Median Test in R? (get the p values using permutation)

Error in my math formula for implementing CUSUM in R

Carrying out a PBIB.test

How can I perform multiple pairwise t.test in R using the same reference vector?

Using split function in R

Categories

Resources