I can do it for the two sample t test but not for Median test or Wilcoxon test or Hodges Lehmann test
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
mean(data_2000)
mean(data_2019)
mean(data_2019) - mean(data_2000)
combined_data <- c(data_2000, data_2019)
set.seed(123)
null_dist <- c()
for (i in 1:100000) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
null_dist[i] <- mean(shuffled_2019) - mean(shuffled_2000)
}
(p_value <- (sum(null_dist >= 49.57143) + sum(null_dist <=
`enter code here`-49.57143))/length(null_dist))
I think this is what you're trying to do. I altered your code as little as possible. There are packages like infer that will do this for you and the for loop is not the most efficient but it's plenty good enough and may help you learn. As long as we're looping I did mean and median at the same time since all other parts of the code are identical. ifelse is a nice easy way to make 1s and 0s to sum.
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
delta_mean <- mean(data_2019) - mean(data_2000)
delta_median <- median(data_2019) - median(data_2000)
combined_data <- c(data_2000, data_2019)
trials <- 100000
set.seed(123)
mean_diff <- c()
median_diff <- c()
for (i in 1:trials) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
mean_diff[i] <- mean(shuffled_2019) - mean(shuffled_2000)
median_diff[i] <- median(shuffled_2019) - median(shuffled_2000)
}
p_mean <- sum(ifelse(mean_diff > delta_mean | mean_diff < -1 * delta_mean, 1, 0)) / trials
p_median <- sum(ifelse(median_diff > delta_median | median_diff < -1 * delta_median, 1, 0)) / trials
p_mean
#> [1] 0.31888
p_median
#> [1] 0.24446
Following up on your question about HL test. Quoting Wikipedia
The Hodges–Lehmann statistic also estimates the difference between two populations. For two sets of data with m and n observations, the set of two-element sets made of them is their Cartesian product, which contains m × n pairs of points (one from each set); each such pair defines one difference of values. The Hodges–Lehmann statistic is the median of the m × n differences.
You could run it on your data with the following code...
Do NOT run it 100,000 times the answer is the same everytime because you're already making all 49 possible pairings
hl_df <- expand.grid(data_2019, data_2000)
hl_df$pair_diffs <- hl_df$Var1 - hl_df$Var2
median(hl_df$pair_diffs)
[1] 49
You can do the Wilcoxon test with wilcox.test in the stats package (loaded by default as part of R core). You need to set exact = FALSE because an exact p-value is not possible if there are ties.
wilcox.test(data_2019, data_2000, exact = FALSE)
Wilcoxon rank sum test with continuity correction
data: data_2019 and data_2000
W = 33.5, p-value = 0.2769
alternative hypothesis: true location shift is not equal to 0
I'll update this when I figure out how to do the other tests.
I'm trying to implement a check for decreasing values of avg temperatures to see when the temperature starts falling. See the chart of temperatures here:
Here is the formula I'm trying to implement:
Here is my code to implement that formula:
temps <- read.delim("temps.txt")
date_avgs <- rowMeans(temps[2:length(temps)], dims=1, na.rm=T)
mu <- 87
threshold <- 86
constant <- 3
date_avgs
S <- 0 * date_avgs
for (i in 2:length(date_avgs)) {
value <- S[i-1] + (mu - date_avgs[i] - constant)
cat("\nvalue", value, "si", date_avgs[i], i)
S[i] <- max(0, value)
if(S[i] >= threshold){
#Once I hit this for the first time, that indicates at this index the temp is decreasing
cat("\nDecreased past my threshold!!!", S[i] ,i)
}
}
But I'm not able to detect the change as I expect. My formula doesn't get over the threshold until index 108, when it should get there around index 60.
Here is the plot of my S (or CUSUM) values:
Any ideas what I'm doing wrong in my formula?
I think the problem is mu <- mean(date_avgs) basically means of all the observations. But mu should be "mean of X if no change". Thus mu should be about 87 but according your code and plotted data seems to be 80 or less.
# simulated data
set.seed(4422)
date_avgs <- c(runif(60, 84, 92), 88-(1:50)-rnorm(50,0,4))
plot(date_avgs)
# setting constants
mu <- 87
threshold <- 86
constant <- 3
# after running for cycle
Index <- match(S[S >= threshold][1], S)
Index
[1] 75
# for data
> date_avgs[74]
[1] 73.41981
# Considering a lower threshold
# (as maximum allowable difference to detect trend 2 * C)
mu <- 87
threshold <- 6 # arbitrary
constant <- 3
# after running for cycle
Index <- match(S[S >= threshold][1], S)
Index
[1] 66
So I think code is fine, maybe the interpretation is not
As far as I know the breakpoint should correspond to the observation which maximizes the F statistics, but I can't see any meaningful association between the F statistics and the timing of the break. What do I get wrong?
y <- c(rnorm(30), 2+rnorm(20)) # 1 breakpoint
f <- Fstats(y ~ 1) # calculate F statistics
f$breakpoint # breakpoint Fstats suggests
which(f$Fstats == max(f$Fstats)) # observation with max F statistics
order(f$Fstats) # observations ordered by F statistics
As can be seen the observation of the breakpoint is not the observation with the highest F statistics.
Your y is not class ts. So the output became a bit curious ts data and unfortunately you failed to interpret it.
set.seed(1)
y <- c(rnorm(30), 2+rnorm(20))
ts.y <- ts(y, start = 1, frequency = 1) # change `y` into class `ts`
ts.f <- Fstats(ts.y ~ 1)
ts.f$breakpoint # [1] 30
ts.f$Fstats
# Time Series:
# Start = 7 # this means ts.f$Fstats[1] is 7th
which(ts.f$Fstats == max(ts.f$Fstats)) # [1] 24 # ts.f$Fstats[24] is 30th
plot(ts.f)
lines(breakpoints(ts.f))
If I think I understand something I like to verify, so in this case I was trying to verify the calculation of the Partial Autocorrelation. pacf().
what I end up with is something a little different. My understanding is that the pacf would be the coefficient of the regression of the last/furthest lag given all of the previous lags. So to set up some code, I'm using Canadian employment data and the book Elements of Forecasting by F. Diebold (1998) Chapter6
#Obtain Canadian Employment dataset
caemp <- c(83.090255, 82.7996338824, 84.6344380294, 85.3774583529, 86.197605, 86.5788438824, 88.0497240294, 87.9249263529, 88.465131, 88.3984638824, 89.4494320294, 90.5563753529, 92.272335, 92.1496788824, 93.9564890294, 94.8114863529, 96.583434, 96.9646728824, 98.9954360294, 101.138164353, 102.882122, 103.095394882, 104.006386029, 104.777404353, 104.701732, 102.563504882, 103.558486029, 102.985774353, 102.098281, 101.471734882, 102.550696029, 104.021564353, 105.093652, 105.194954882, 104.594266029, 105.813184353, 105.149642, 102.899434882, 102.354736029, 102.033974353, 102.014299, 101.835654882, 102.018806029, 102.733834353, 103.134062, 103.263354882, 103.866416029, 105.393274353, 107.081242, 108.414274882, 109.297286029, 111.495994353, 112.680072, 113.061304882, 112.376636029, 111.244054353, 107.305192, 106.678644882, 104.678246029, 105.729204353, 107.837082, 108.022364882, 107.281706029, 107.016934353, 106.045452, 106.370704882, 106.049966029, 105.841184353, 106.045452, 106.650644882, 107.393676029, 108.668584353, 109.628702, 110.261894882, 110.920946029, 110.740154353, 110.048622, 108.190324882, 107.057746029, 108.024724353, 109.712692, 111.409654882, 108.765396029, 106.289084353, 103.917902, 100.799874882, 97.3997700294, 93.2438143529, 94.123068, 96.1970798824, 97.2754290294, 96.4561423529, 92.674237, 92.8536228824, 93.4304540294, 93.2055593529, 93.955896, 94.7296738824, 95.5665510294, 95.5459793529, 97.09503, 97.7573598824, 96.1609430294, 96.5861653529, 103.874812, 105.094384882, 106.804276029, 107.786744353, 106.596022, 107.310354882, 106.897156029, 107.210924353, 107.134682, 108.829774882, 107.926196029, 106.298904353, 103.365872, 102.029554882, 99.3000760294, 95.3045073529, 90.50099, 88.0984848824, 86.5150710294, 85.1143943529, 89.033584, 88.8229008824, 88.2666710294, 87.7260053529, 88.102896, 87.6546968824, 88.4004090294, 88.3618013529, 89.031151, 91.0202948824, 91.6732820294, 92.0149173529)
# create time series with the canadian employment dataset
caemp.ts<-ts(caemp, start=c(1961, 1), end=c(1994, 4), frequency=4)
caemp.ts2<-window(caemp.ts,start=c(1961,5), end=c(1993,4))
# set up max lag the book says use sqrt(T) but in this case i'm using 3 for the example
lag.max <- 3
# R Code using pacf()
pacf(caemp.ts2, lag.max=3, plot=F)
# initialize vector to capture the partial autocorrelations
pauto.corr <- rep(0, lag.max)
# Set up lagged data frame
pa.mat <- as.data.frame(caemp.ts2)
for(i in 1:lag.max){
a <- c(rep(NA, i), pa.mat[1:(length(caemp.ts2) - i),1])
pa.mat <- cbind(pa.mat, a)
}
names(pa.mat) <- c("0":lag.max)
# Set up my base linear model
base.lm <- lm(pa.mat[, 1] ~ 1)
### I could not get the for loop to work successfully here
i <- 1
base.lm <- update(base.lm, .~. + pa.mat[,2])
pauto.corr[i]<-base.lm$coefficients[length(base.lm$coefficients)]
i<-2
base.lm <-update(base.lm, .~. + pa.mat[,3])
pauto.corr[i]<-base.lm$coefficients[length(base.lm$coefficients)]
i<-3
base.lm <-update(base.lm, .~. + pa.mat[,4])
pauto.corr[i]<-base.lm$coefficients[length(base.lm$coefficients)]
# Compare results...
round(pauto.corr,3)
pacf(caemp.ts2, lag.max=3, plot=F)
For the output
> round(pauto.corr,3)
[1] 0.971 -0.479 -0.072
> pacf(caemp.ts2, lag.max=3, plot=F)
Partial autocorrelations of series ‘caemp.ts2’, by lag
0.25 0.50 0.75
0.949 -0.244 -0.100
Maybe it is because my example is quarterly and not monthly data, or I could just be wrong?
I am trying to generate 100 samples of Z, where Z is the summation of 8 independent uniformly distributed random variables in the interval [0;1]
I have the following code so far, but I'm not sure if it's correct. I am not sure if my loop is correct
eight<-runif(8,0,1) #Uses the runif function to generate 8 uniform 0-1 random variables
Z_1<-sum(eight) #caclulates the sum and stores it in the variable Z_1
sample <-NA
for (i in 1:100 ) { #Function continues the loop for 100 different values
eight<-runif(8,0,1); #Creates sum loop for 8 independent values uniform 0-1 random variables.
Z_1<-sum(eight); # stores in the sum loop in the variable Z
sample[i] = Z_1;
}`
Thanks
I would vectorize the whole thing. There is no real reason to run 100 iterations when you can just generate 800 observations in one run. Then just use matrix and colSums and you done
set.seed(123)
n <- 100
Z <- colSums(matrix(runif(8 * n), 8, n))
Z
# [1] 4.774425 4.671171 4.787691 4.804041 3.513257 2.330163 3.300135 3.568657 5.503481 2.861764 4.533283 3.353658
# [13] 4.230073 4.690739 4.364708 3.094156 4.933616 3.942834 3.712522 2.587036 3.731474 4.388749 4.484030 4.315968
# [25] 4.800758 4.252631 2.716972 5.159044 4.146881 3.244546 4.418618 4.350035 5.344182 3.176801 3.903337 2.261935
# [37] 3.646572 4.286075 3.074900 4.210506 3.172513 4.535665 4.245856 4.184848 4.532286 2.899883 4.473629 4.751224
# [49] 3.498038 3.337437 4.238989 3.349812 3.888696 4.748254 3.029479 4.246619 3.330434 3.879168 3.786216 3.839956
# [61] 3.878997 4.546531 2.863010 3.417072 4.266108 3.141875 4.960758 3.989613 4.373042 4.295742 4.459014 5.561066
# [73] 4.401990 4.121301 3.830575 3.412446 3.812347 5.591238 3.801587 4.454336 4.213343 5.222007 4.300991 2.765003
# [85] 3.293251 5.362586 2.954080 3.036312 3.655677 3.373603 5.575184 4.167740 3.904038 3.884440 2.901452 3.079311
# [97] 4.927770 3.930943 4.169907 2.922618