I'm performing a time series analysis with multiple breakpoints in R.
I managed to identify three breakpoints using the procedure suggested in strucchange package but I'm struggling to get the significance (p-value) for these break points.
Here there is a dummy dataset and the code I was working with.
the dataset:
x=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,
29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,
54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,
79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94)
z=c(128,103,29,117,53,49,84,67,76,111,81,38,36,-35,-12,21,121,38,84,173,153,99,
91,110,69,50,15,-50,15,-97,-2,13,107,47,137,25,-19,54,4,87,72,58,32,-4,75,50,
80,65,124,56,58,-3,30,42,55,212,245,18,106,128,88,216,205,234,120,171,195,230,
237,143,225,253,202,218,283,227,291,192,179,197,337,259,261,215,290,293,255,
316,355,312,337,341,388,338)
df=data.frame(z,x)
plot(x,z)
the code:
# https://www.marinedatascience.co/blog/2019/09/28/comparison-of-change-point-detection-methods/
library(strucchange)
library(sandwich)
library(fxregime)
# get best model
opt_bpts <- function(x) {
#x = bpts_sum$RSS["BIC",]
n <- length(x)
lowest <- vector("logical", length = n-1)
lowest[1] <- FALSE
for (i in 2:n) {
lowest[i] <- x[i] < x[i-1] & x[i] < x[i+1]
}
out <- as.integer(names(x)[lowest])
return(out)
}
#################################################################
#marinedatascience.co/blog/2019/09/28/comparison-of-change-point-detection-methods/
#Zeileis, A., Leisch, F., Hornik, K. & Kleiber, C. (2002), strucchange: An R Package for Testing for Structural Change in Linear Regression Models. J Stat Softw 7(2), 38p., doi: 10.18637/jss.v007.i02↩
z_ts <- as.ts(df$z) #crate time series
bpts <- breakpoints(z ~ x, data = df)
plot(bpts)
bpts_sum <- summary(bpts)
opt_brks <- opt_bpts(bpts_sum$RSS["BIC",])
opt_brks
# Nested syntax with 3 breaks:
ci=confint(bpts,breaks = 3)#, level = 0.99)
bpts <- breakpoints(breakpoints(z ~ x, data = df), breaks = 3)
Fst=Fstats(z_ts~1)
#here I get a p-value for the analysis with three breakpoints
plot(Fst)
sctest(Fst)
I get as Fst output:
supF test
data: Fst
sup.F = 203.23, p-value < 2.2e-16
I would like to obtain (if it's possible) the p-value of each breakpoint.
Something like this:
F test
data: Fst
breakpoint1:p-value < brk1.pvalue
breakpoint2:p-value < brk2.pvalue
breakpoint3:p-value < brk3.pvalue
Related
So, just a touch of backstory. I've been learning biostatistics in the past 4-5 months in university, 6 months of biomathematics before that. I only started deep diving into programming around 5 days ago.
I've been trying to redo t.test() with my own function.
test2 = function(t,u){
T = (mean(t) - u) / ( sd(t) / sqrt(length(t)))
t1=round(T, digits=5)
df=length(t)
cat(paste('t - value =', t1,
'\n','df =', df-1,
'\n','Alternative hipotézis: a minta átlag nem egyenlő a hipotetikus átlaggal'))
}
I tried searching the formula for the p-value, I found one, but when I used it, my value was different from the one within the t.test.
The t-value and the df do match t.test().
I highly appreciate any help, thank you.
P.s: Don't worry about the last line, it's in Hungarian.
The p-value can be derived from the probability function of the t distribution pt. Using this and making the notation more common with sample x and population mean mu we can use something like:
test2 <- function(x, u){
t <- (mean(x) - u) / (sd(x) / sqrt(length(x)))
df <- length(x) - 1
cat('t-value =', t, ', df =', df, ', p =', 2 * (1 - pt(q=t, df=df)), '\n')
}
set.seed(123) # remove this for other random values
## random sample
x <- rnorm(10, mean=5.5)
## population mean
mu <- 5
## own function
test2(x, mu)
## one sample t-test from R
t.test(x, mu=mu)
We get for the own test2:
t-value = 1.905175 , df = 9, p = 0.08914715
and for R's t.test
One Sample t-test
data: x
t = 1.9052, df = 9, p-value = 0.08915
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
4.892330 6.256922
sample estimates:
mean of x
5.574626
The definitive source of what R is doing is the source code. If you look at the source code for stats:::t.test.default (which you can get by typing stats:::t.test.default into the console, without parentheses at the end and hitting enter), you'll see that for a single-sample test like the one you're trying to do above, you would get the following:
nx <- length(x)
mx <- mean(x)
vx <- var(x)
df <- nx - 1
stderr <- sqrt(vx/nx)
tstat <- (mx - mu)/stderr
if (alternative == "less") {
pval <- pt(tstat, df)
}
else if (alternative == "greater") {
pval <- pt(tstat, df, lower.tail = FALSE)
}
else {
pval <- 2 * pt(-abs(tstat), df)
}
These are the relevant pieces (there's a lot more code in there, too).
I have run a multiple imputation (m=45, 10 iterations) using the MICE package, and want to calculate the cronbach's alpha for a number of ordinal scales in the data. Is there a function in r that could assist me in calculating the alpha coefficient across the imputed datasets in a manner that would satisfy Rubin's rules for pooling estimates?
We may exploit pool.scalar from the mice package, which performs pooling of univariate estimates according to Rubin's rules.
Since you have not provided a reproducible example yourself, I will provide one.
set.seed(123)
# sample survey responses
df <- data.frame(
x1 = c(1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3,
1,2,2,3,2,2,3,3,2,3),
x2 = c(1,1,1,2,3,3,2,3,3,3,
1,1,1,2,3,3,2,3,3,3,
1,2,2,3,2,2,3,3,2,3),
x3 = c(1,1,2,1,2,3,3,3,2,3,
1,1,2,1,2,3,3,3,2,3,
1,2,2,3,2,2,3,3,2,3)
)
# function to column-wise generate missing values (MCAR)
create_missings <- function(data, prob) {
x <- replicate(ncol(data),rbinom(nrow(data), 1, prob))
for(k in 1:ncol(data)) {
data[, k] <- ifelse(x[, k] == 1, NA, data[,k])
}
data
}
df <- create_missings(df, prob = 0.2)
# multiple imputation ----------------------------------
library(mice)
imp <- mice(df, m = 10, maxit = 20)
# extract the completed data in long format
implong <- complete(imp, 'long')
We need a function to compute cronbach's alpha and obtain an estimate of the standard error of alpha, which can be used in a call to pool.scalar() later on. Since there is no available formula with which we can analytically estimate the standard error of alpha, we also need to deploy a bootstrapping procedure to estimate this standard error.
The function cronbach_fun() takes the following arguments:
list_compl_data: a character string specifying the list of completed data from a mids object.
boot: a logical indicating whether a non-parametrical bootstrap should be conducted.
B: an integer specifying the number of bootstrap samples to be taken.
ci: a logical indicating whether a confidence interval around alpha should be estimated.
cronbach_fun <- function(list_compl_data, boot = TRUE, B = 1e4, ci = FALSE) {
n <- nrow(list_compl_data); p <- ncol(list_compl_data)
total_variance <- var(rowSums(list_compl_data))
item_variance <- sum(apply(list_compl_data, 2, sd)^2)
alpha <- (p/(p - 1)) * (1 - (item_variance/total_variance))
out <- list(alpha = alpha)
boot_alpha <- numeric(B)
if (boot) {
for (i in seq_len(B)) {
boot_dat <- list_compl_data[sample(seq_len(n), replace = TRUE), ]
total_variance <- var(rowSums(boot_dat))
item_variance <- sum(apply(boot_dat, 2, sd)^2)
boot_alpha[i] <- (p/(p - 1)) * (1 - (item_variance/total_variance))
}
out$var <- var(boot_alpha)
}
if (ci){
out$ci <- quantile(boot_alpha, c(.025,.975))
}
return(out)
}
Now that we have our function to do the 'heavy lifting', we can run it on all m completed data sets, after which we can obtain Q and U (which are required for the pooling of the estimates). Consult ?pool.scalar for more information.
m <- length(unique(implong$.imp))
boot_alpha <- rep(list(NA), m)
for (i in seq_len(m)) {
set.seed(i) # fix random number generator
sub <- implong[implong$.imp == i, -c(1,2)]
boot_alpha[[i]] <- cronbach_fun(sub)
}
# obtain Q and U (see ?pool.scalar)
Q <- sapply(boot_alpha, function(x) x$alpha)
U <- sapply(boot_alpha, function(x) x$var)
# pooled estimates
pool_estimates <- function(x) {
out <- c(
alpha = x$qbar,
lwr = x$qbar - qt(0.975, x$df) * sqrt(x$t),
upr = x$qbar + qt(0.975, x$df) * sqrt(x$t)
)
return(out)
}
Output
# Pooled estimate of alpha (95% CI)
> pool_estimates(pool.scalar(Q, U))
alpha lwr upr
0.7809977 0.5776041 0.9843913
I have an array of outputs from hundreds of segmented linear models (made using the segmented package in R). I want to be able to use these outputs on new data, using the predict function. To be clear, I do not have the segmented linear model objects in my workspace; I just saved and reimported the relevant outputs (e.g. the coefficients and breakpoints). For this reason I can't simply use the predict.segmented function from the segmented package.
Below is a toy example based on this link that seems promising, but does not match the output of the predict.segmented function.
library(segmented)
set.seed(12)
xx <- 1:100
zz <- runif(100)
yy <- 2 + 1.5*pmax(xx-35,0) - 1.5*pmax(xx-70,0) +
15*pmax(zz-0.5,0) + rnorm(100,0,2)
dati <- data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
o<-## S3 method for class 'lm':
segmented(out.lm,seg.Z=~x,psi=list(x=c(30,60)),
control=seg.control(display=FALSE))
# Note that coefficients with U in the name are differences in slopes, not slopes.
# Compare:
slope(o)
coef(o)[2] + coef(o)[3]
coef(o)[2] + coef(o)[3] + coef(o)[4]
# prediction
pred <- data.frame(x = 1:100)
pred$dummy1 <- pmax(pred$x - o$psi[1,2], 0)
pred$dummy2 <- pmax(pred$x - o$psi[2,2], 0)
pred$dummy3 <- I(pred$x > o$psi[1,2]) * (coef(o)[2] + coef(o)[3])
pred$dummy4 <- I(pred$x > o$psi[2,2]) * (coef(o)[2] + coef(o)[3] + coef(o)[4])
names(pred)[-1]<- names(model.frame(o))[-c(1,2)]
# compute the prediction, using standard predict function
# computing confidence intervals further
# suppose that the breakpoints are fixed
pred <- data.frame(pred, predict(o, newdata= pred,
interval="confidence"))
# Try prediction using the predict.segment version to compare
test <- predict.segmented(o)
plot(pred$fit, test, ylim = c(0, 100))
abline(0,1, col = "red")
# At least one segment not being predicted correctly?
Can I use the base r predict() function (not the segmented.predict() function) with the coefficients and break points saved from segmented linear models?
UPDATE
I figured out that the code above has issues (don't use it). Through some reverse engineering of the segmented.predict() function, I produced the design matrix and use that to predict values instead of directly using the predict() function. I do not consider this a full answer of the original question yet because predict() can also produce confidence intervals for the prediction, and I have not yet implemented that--question still open for someone to add confidence intervals.
library(segmented)
## Define function for making matrix of dummy variables (this is based on code from predict.segmented())
dummy.matrix <- function(x.values, x_names, psi.est = TRUE, nameU, nameV, diffSlope, est.psi) {
# This function creates a model matrix with dummy variables for a segmented lm with two breakpoints.
# Inputs:
# x.values: the x values of the segmented lm
# x_names: the name of the column of x values
# psi.est: this is legacy from the predict.segmented function, leave it set to 'TRUE'
# obj: the segmented lm object
# nameU: names (class character) of 3rd and 4th coef, which are "U1.x" "U2.x" for lm with two breaks. Example: names(c(obj$coef[3], obj$coef[4]))
# nameV: names (class character) of 5th and 6th coef, which are "psi1.x" "psi2.x" for lm with two breaks. Example: names(c(obj$coef[5], obj$coef[6]))
# diffSlope: the coefficients (class numeric) with the slope differences; called U1.x and U2.x for lm with two breaks. Example: c(o$coef[3], o$coef[4])
# est.psi: the estimated break points (class numeric); these are the estimated breakpoints from segmented.lm. Example: c(obj$psi[1,2], obj$psi[2,2])
#
n <- length(x.values)
k <- length(est.psi)
PSI <- matrix(rep(est.psi, rep(n, k)), ncol = k)
newZ <- matrix(x.values, nrow = n, ncol = k, byrow = FALSE)
dummy1 <- pmax(newZ - PSI, 0)
if (psi.est) {
V <- ifelse(newZ > PSI, -1, 0)
dummy2 <- if (k == 1)
V * diffSlope
else V %*% diag(diffSlope)
newd <- cbind(x.values, dummy1, dummy2)
colnames(newd) <- c(x_names, nameU, nameV)
} else {
newd <- cbind(x.values, dummy1)
colnames(newd) <- c(x_names, nameU)
}
# if (!x_names %in% names(coef(obj.seg)))
# newd <- newd[, -1, drop = FALSE]
return(newd)
}
## Test dummy matrix function----------------------------------------------
set.seed(12)
xx<-1:100
zz<-runif(100)
yy<-2+1.5*pmax(xx-35,0)-1.5*pmax(xx-70,0)+15*pmax(zz-.5,0)+rnorm(100,0,2)
dati<-data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
#1 segmented variable, 2 breakpoints: you have to specify starting values (vector) for psi:
o<-segmented(out.lm,seg.Z=~x,psi=c(30,60),
control=seg.control(display=FALSE))
slope(o)
plot.segmented(o)
summary(o)
# Test dummy matrix fn with the same dataset
newdata <- dati
nameU1 <- c("U1.x", "U2.x")
nameV1 <- c("psi1.x", "psi2.x")
diffSlope1 <- c(o$coef[3], o$coef[4])
est.psi1 <- c(o$psi[1,2], o$psi[2,2])
test <- dummy.matrix(x.values = newdata$x, x_names = "x", psi.est = TRUE,
nameU = nameU1, nameV = nameV1, diffSlope = diffSlope1, est.psi = est.psi1)
# Predict response variable using matrix multiplication
col1 <- matrix(1, nrow = dim(test)[1])
test <- cbind(col1, test) # Now test is the same as model.matrix(o)
predY <- coef(o) %*% t(test)
plot(predY[1,])
lines(predict.segmented(o), col = "blue") # good, predict.segmented gives same answer
I have constructed a logistic regression and now I want to calculate the predictive accuracy for various cutoff values ranging from 0 to 1. This is the for loop I have been using. But I am getting
subscript out of bounds
Here predtrain contains 300 predicted output probabilities each ranging from 0 to 1. Each value is compared to cutoff eff. Finally a table/confusion matrix has to be generated comparing the original values (train$CAN) with f1. Some thing like this:
tab
# pred2
# 0 1
# 0 1 93
# 1 0 206
code I have written is this:
predtrain <- predict(logreg1, newdata = train, type = 'response')
eff<-seq(0,1,by = 0.05)
for (i in 1:length(eff) {
f1 <- ifelse(predtrain > eff[i], 1, 0)
t1 <- table(train$CAN, f1)
effy <- (t1[1,1]+t1[2,2])/(t1[1,1]+t1[1,2]+t1[2,2]+t1[2,1])
eff[[i]] <-effy
}
The reason you're getting subscript out of bounds errors is that you're trying to create confusion matrices with cutoffs like 0 and 1 -- this will create a confusion matrix with a single column (all predictions are either positive or negative), causing code like t1[2,2] to cause your error.
In reality all you're trying to do is to compute the predictive accuracy at different cutoffs, which can be accomplished without creating tables at all with something like:
cutoffs <- seq(0, 1, by=0.05)
eff <- sapply(cutoffs, function(cutoff) {
sum((predtrain > cutoff) == train$CAN) / length(predtrain)
})
To see this in action, let's consider a small example model:
set.seed(144)
x <- runif(100)
train <- data.frame(x, CAN=as.numeric(runif(100)+x >= 1))
logreg1 <- glm(CAN~x, data=train, family="binomial")
predtrain <- predict(logreg1, newdata = train, type = 'response')
Now we can get the predictive accuracy at each cutoff:
eff <- sapply(cutoffs, function(cutoff) {
sum((predtrain > cutoff) == train$CAN) / length(predtrain)
})
plot(cutoffs, eff)
You could alternately use a package like the ROCR package to grab metrics. For instance, here is how you could grab the sensitivity at each cutoff:
library(ROCR)
pred <- prediction(predtrain, train$CAN)
perf <- performance(pred, "sens")
eff <- sapply(cutoffs, function(cutoff) max(perf#y.values[[1]][perf#x.values[[1]] >= cutoff]))
plot(cutoffs, eff)
But to calculate something like specificity and sensitivity doesn't it become more difficult? I have written using two for loops, I know it is not very effective but I do get the table from which I can calculate performance variables. Can this method be improved?
enter code here
z <- seq(0,1,by = 0.05)
t1 <- vector(mode = "list", length = length(z))
for(i in 1:length(z)) {
predtrain <- predict(logreg1, newdata = train, type = 'response')
for(j in 1:length(predtrain)){
predtrain[j] <- ifelse(predtrain[j]>z[i], 1, 0)
}
t1[[i]] <- table(train$CAN, predtrain)
} t1
I am trying to build a rolling regression function based on the example here, but in addition to returning the predicted values, I would like to return the some rolling model diagnostics (i.e. coefficients, t-values, and mabye R^2). I would like the results to be returned in discrete objects based on the type of results. The example provided in the link above sucessfully creates thr rolling predictions, but I need some assistance packaging and writing out the rolling model diagnostics:
In the end, I would like the function to return three (3) objects:
Predictions
Coefficients
T values
R^2
Below is the code:
require(zoo)
require(dynlm)
## Create Some Dummy Data
set.seed(12345)
x <- rnorm(mean=3,sd=2,100)
y <- rep(NA,100)
y[1] <- x[1]
for(i in 2:100) y[i]=1+x[i-1]+0.5*y[i-1]+rnorm(1,0,0.5)
int <- 1:100
dummydata <- data.frame(int=int,x=x,y=y)
zoodata <- as.zoo(dummydata)
rolling.regression <- function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# 1) Make Predictions
predicted <- predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
attributes(predicted) <- NULL
c(predicted=predicted,square.res <-(predicted-zoodata[nextOb,'y'])^2)
# 2) Extract coefficients
#coefficients <- coef(mod)
# 3) Extract rolling coefficient t values
#tvalues <- ????(mod)
# 4) Extract rolling R^2
#rsq <-
}
}
rolling.window <- 20
results.z <- rollapply(zoodata, width=rolling.window, FUN=rolling.regression, by.column=F, align='right')
So after figuring out how to extract t values from model (i.e. mod) , what do I need to do to make the function return three (3) seperate objects (i.e. Predictions, Coefficients, and T-values)?
I am fairly new to R, really new to functions, and extreemly new to zoo, and I'm stuck.
Any assistance would be greatly appreciated.
I hope I got you correctly, but here is a small edit of your function:
rolling.regression <- function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# 1) Make Predictions
predicted=predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
attributes(predicted)<-NULL
#Solution 1; Quicker to write
# c(predicted=predicted,
# square.res=(predicted-zoodata[nextOb,'y'])^2,
# summary(mod)$coef[, 1],
# summary(mod)$coef[, 3],
# AdjR = summary(mod)$adj.r.squared)
#Solution 2; Get column names right
c(predicted=predicted,
square.res=(predicted-zoodata[nextOb,'y'])^2,
coef_intercept = summary(mod)$coef[1, 1],
coef_Ly = summary(mod)$coef[2, 1],
coef_Lx = summary(mod)$coef[3, 1],
tValue_intercept = summary(mod)$coef[1, 3],
tValue_Ly = summary(mod)$coef[2, 3],
tValue_Lx = summary(mod)$coef[3, 3],
AdjR = summary(mod)$adj.r.squared)
}
}
rolling.window <- 20
results.z <- rollapply(zoodata, width=rolling.window, FUN=rolling.regression, by.column=F, align='right')
head(results.z)
predicted square.res coef_intercept coef_Ly coef_Lx tValue_intercept tValue_Ly tValue_Lx AdjR
20 10.849344 0.721452 0.26596465 0.5798046 1.049594 0.38309211 7.977627 13.59831 0.9140886
21 12.978791 2.713053 0.26262820 0.5796883 1.039882 0.37741499 7.993014 13.80632 0.9190757
22 9.814676 11.719999 0.08050796 0.5964808 1.073941 0.12523824 8.888657 15.01353 0.9340732
23 5.616781 15.013297 0.05084124 0.5984748 1.077133 0.08964998 9.881614 16.48967 0.9509550
24 3.763645 6.976454 0.26466039 0.5788949 1.068493 0.51810115 11.558724 17.22875 0.9542983
25 9.433157 31.772658 0.38577698 0.5812665 1.034862 0.70969330 10.728395 16.88175 0.9511061
To see how it works, make a small example with a regression:
x <- rnorm(1000); y <- 2*x + rnorm(1000)
reg <- lm(y ~ x)
summary(reg)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02694322 0.03035502 0.8876033 0.374968
x 1.97572544 0.03177346 62.1816310 0.000000
As you can see, calling summary first and then getting the coefficients of it (coef(summary(reg)) works as well) gives you a table with estimates, standard errors, and t-values. So estimates are saved in column 1 of that table, t-values in column 3. And that's how I obtain them in the updated rolling.regression function.
EDIT
I updated my solution; now it also contains the adjusted R2. If you just want the normal R2, get rid of the .adj.
EDIT 2
Quick and dirty hack how to name the columns:
rolling.regression <- function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# 1) Make Predictions
predicted=predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
attributes(predicted)<-NULL
#Get variable names
strVar <- c("Intercept", paste0("L", 1:(nrow(summary(mod)$coef)-1)))
vec <- c(predicted=predicted,
square.res=(predicted-zoodata[nextOb,'y'])^2,
AdjR = summary(mod)$adj.r.squared,
summary(mod)$coef[, 1],
summary(mod)$coef[, 3])
names(vec)[4:length(vec)] <- c(paste0("Coef_", strVar), paste0("tValue_", strVar))
vec
}
}