I need some guidance regarding how changepoints work in time series. I am trying to detect some changepoints using R, and the package called "changepoint" (https://cran.r-project.org/web/packages/changepoint/changepoint.pdf).
There are options for how to detect when the variance (cpt.var) and the mean (cpt.mean) change, but what I'm trying to look for is when the time series changes trend.
Maybe I'm confused with what changepoints really are, but is there any way to get this information?
I am showing the result of using cpt.var() function, and I have added some arrows, showing what I would like to achieve.
Is there any way to achieve this? I guess should be somehow like inflection points...
I would appreciate any light on this.
Thanks beforehand,
Jon
EDIT
I have tried with the approach of using diff(), but is not detecting the change correctly:
The data I am using is the following:
[1] 10.695 10.715 10.700 10.665 10.830 10.830 10.800 11.070 11.145 11.270 11.015 11.060 10.945 10.965 10.780 10.735 10.705 10.680 10.600 10.335 10.220 10.125
[23] 10.370 10.595 10.680 11.000 10.980 11.065 11.060 11.355 11.445 11.415 11.350 11.310 11.330 11.360 11.445 11.335 11.275 11.300 11.295 11.470 11.445 11.325
[45] 11.300 11.260 11.200 11.210 11.230 11.240 11.300 11.250 11.285 11.215 11.260 11.395 11.410 11.235 11.320 11.475 11.470 11.685 11.740 11.740 11.700 11.905
[67] 11.720 12.230 12.285 12.505 12.410 11.995 12.110 12.005 11.915 11.890 11.820 11.730 11.700 11.660 11.685 11.615 11.360 11.425 11.185 11.275 11.265 11.375
[89] 11.310 11.250 11.050 10.880 10.775 10.775 10.805 10.755 10.595 10.700 10.585 10.510 10.290 10.255 10.395 10.290 10.425 10.405 10.365 10.010 10.305 10.185
[111] 10.400 10.700 10.725 10.875 10.750 10.760 10.905 10.680 10.670 10.895 10.790 10.990 10.925 10.980 10.975 11.035 10.895 10.985 11.035 11.295 11.245 11.535
[133] 11.510 11.430 11.450 11.390 11.520 11.585
And when I do diff() I get this data:
[1] 0.020 -0.015 -0.035 0.165 0.000 -0.030 0.270 0.075 0.125 -0.255 0.045 -0.115 0.020 -0.185 -0.045 -0.030 -0.025 -0.080 -0.265 -0.115 -0.095 0.245
[23] 0.225 0.085 0.320 -0.020 0.085 -0.005 0.295 0.090 -0.030 -0.065 -0.040 0.020 0.030 0.085 -0.110 -0.060 0.025 -0.005 0.175 -0.025 -0.120 -0.025
[45] -0.040 -0.060 0.010 0.020 0.010 0.060 -0.050 0.035 -0.070 0.045 0.135 0.015 -0.175 0.085 0.155 -0.005 0.215 0.055 0.000 -0.040 0.205 -0.185
[67] 0.510 0.055 0.220 -0.095 -0.415 0.115 -0.105 -0.090 -0.025 -0.070 -0.090 -0.030 -0.040 0.025 -0.070 -0.255 0.065 -0.240 0.090 -0.010 0.110 -0.065
[89] -0.060 -0.200 -0.170 -0.105 0.000 0.030 -0.050 -0.160 0.105 -0.115 -0.075 -0.220 -0.035 0.140 -0.105 0.135 -0.020 -0.040 -0.355 0.295 -0.120 0.215
[111] 0.300 0.025 0.150 -0.125 0.010 0.145 -0.225 -0.010 0.225 -0.105 0.200 -0.065 0.055 -0.005 0.060 -0.140 0.090 0.050 0.260 -0.050 0.290 -0.025
[133] -0.080 0.020 -0.060 0.130 0.065
What I get is the next results:
> cpt =cpt.mean(diff(vector), method="PELT")
> (cpt.pts <- attributes(cpt)$cpts)
[1] 137
Appearly does not make sense... Any clue?
In R, there are many packages available for time series changepoint detection. changepoint is definitely a very useful one. A partial list of the packages is summarized in CRAN Task View:
Change point detection is provided in strucchange (using linear regression models), and in trend (using nonparametric tests). The changepoint package provides many popular changepoint methods, and ecp does nonparametric changepoint detection for univariate and multivariate series. changepoint.np implements the nonparametric PELT algorithm, while changepoint.mv detects changepoints in multivariate time series. InspectChangepoint uses sparse projection to estimate changepoints in high-dimensional time series. robcp provides robust change-point detection using Huberized cusum tests, and Rbeast provides Bayesian change-point detection and time series decomposition.
Here is also a great blog comparing several alternative packages: https://www.marinedatascience.co/blog/2019/09/28/comparison-of-change-point-detection-methods/. Another impressive comparison is from Dr. Jonas Kristoffer Lindeløv who developed the mcp package: https://lindeloev.github.io/mcp/articles/packages.html.
Below I used your sample time series to generate some quick results using the Rbeast package developed by myself (chosen here apparently for ego of self-promoting as well as perceived relvance). Rbeast is a Baysian changepoint detection algorithm and it can estimate the probability of changepoint occurrence. It can also be used for decomposing time series into seasonality and trend, but apparently, your time series is trend-only, so in the beast function below, season='none' is specified.
y = c(10.695,10.715,10.700,10.665,10.830,10.830,10.800,11.070,11.145,11.270,11.015,11.060,10.945,10.965,10.780,10.735,10.705,
10.680,10.600,10.335,10.220,10.125,10.370,10.595,10.680,11.000,10.980,11.065,11.060,11.355,11.445,11.415,11.350,11.310,11.330,
11.360,11.445,11.335,11.275,11.300,11.295,11.470,11.445,11.325,11.300,11.260,11.200,11.210,11.230,11.240,11.300,11.250,11.285,
11.215,11.260,11.395,11.410,11.235,11.320,11.475,11.470,11.685,11.740,11.740,11.700,11.905,11.720,12.230,12.285,12.505,12.410,
11.995,12.110,12.005,11.915,11.890,11.820,11.730,11.700,11.660,11.685,11.615,11.360,11.425,11.185,11.275,11.265,11.375,11.310,
11.250,11.050,10.880,10.775,10.775,10.805,10.755,10.595,10.700,10.585,10.510,10.290,10.255,10.395,10.290,10.425,10.405,10.365,
10.010,10.305,10.185,10.400,10.700,10.725,10.875,10.750,10.760,10.905,10.680,10.670,10.895,10.790,10.990,10.925,10.980,10.975,
11.035,10.895,10.985,11.035,11.295,11.245,11.535 ,11.510,11.430,11.450,11.390,11.520,11.585)
library(Rbeast)
out=beast(y, season='none')
plot(out)
print(out)
In the figure above, dashed vertical lines mark the most likely locations of changepoints; the green curve of Pr(tcp) shows the point-wise probability of changepoint occurrence over time. The order_t curve gives the estimated mean order of the piecewise polynomials needed to adequately fit the trend (the 0-th order is constant and the 1st order is linear): An average order toward 0 means that the trend is more likely to be flat and an order close to 1 means that the trend is linear. The output can be also printed as some ascii outputs, as shown below. Again, it says that the time series is most likely to have 8 changepoints; their most probable locations are given in out$trend$cp.
Result for time series #1 (total number of time series in 'out': 1)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ SEASONAL CHANGEPOINTS +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
No seasonal/periodic component present (i.e., season='none')
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ TREND CHANGEPOINTS +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
An ascii plot of the probability dist for number of chgpts(ncp)
---------------------------------------------------------------
Pr(ncp=0 )=0.000|* |
Pr(ncp=1 )=0.000|* |
Pr(ncp=2 )=0.000|* |
Pr(ncp=3 )=0.000|* |
Pr(ncp=4 )=0.000|* |
Pr(ncp=5 )=0.000|* |
Pr(ncp=6 )=0.055|***** |
Pr(ncp=7 )=0.074|****** |
Pr(ncp=8 )=0.575|******************************************** |
Pr(ncp=9 )=0.240|******************* |
Pr(ncp=10)=0.056|***** |
---------------------------------------------------------------
Max ncp : 10 | A parameter you set (e.g., maxTrendKnotNum) |
Mode ncp: 8 | Pr(ncp= 8)=0.57; there is a 57.5% probability|
| that the trend componet has 8 chngept(s). |
Avg ncp : 8.17 | Sum[ncp*Pr(ncp)] |
---------------------------------------------------------------
List of most probable trend changepoints (avg number of changpts: 8.17)
--------------------------------.
tcp# |time (cp) |prob(cpPr)|
-----|---------------|----------|
1 |8.0000 | 0.92767|
2 |112.0000 | 0.91433|
3 |68.0000 | 0.84213|
4 |21.0000 | 0.80188|
5 |32.0000 | 0.78171|
6 |130.0000 | 0.76938|
7 |101.0000 | 0.66404|
8 |62.0000 | 0.61171|
--------------------------------'
If the signal isn't too noisy, you could use diff to detect changepoints in slope instead of mean:
library(changepoint)
set.seed(1)
slope <- rep(sample(10,10)-5,sample(100,10))
sig <- cumsum(slope)+runif(n=length(slope),min = -1, max = 1)
cpt =cpt.mean(diff(sig),method="PELT")
# Show change point
(cpt.pts <- attributes(cpt)$cpts)
#> [1] 58 109 206 312 367 440 447 520 599
plot(sig,type="l")
lines(x=cpt.pts,y=sig[cpt.pts],type="p",col="red",cex=2)
Another option which seems to work better with the data you provided is to use piecewise linear segmentation:
library(ifultools)
changepoints <- linearSegmentation(x=1:length(data),y=data,angle.tolerance = 90,n.fit=10,plot=T)
changepoints
#[1] 13 24 36 58 72 106
I am currently trying to obtain equivalent results with the proc princomp command in SAS and the princomp() command in R (in the stats package). The results I am getting are very similar, leading me to suspect that this isn't a problem with different options settings in the two commands. However, the outpus are also different enough that the component scores for each data row are notably different. They are also sign-reversed, but this doesn't matter, of course.
The end goal of this analysis is to produce a set of coefficients from the PCA to score data outside the PCA routine (i.e. a formula that can be applied to new datasets to easily produce scored data).
Without posting all my data, I'm hoping someone can provide some information on how these two commands may differ in their calculations. I don't know enough about the PCA math to determine if this is a conceptual difference in the processes or just something like an internal rounding difference. For simplicity, I'll post the eigenvectors for PC1 and PC2 only.
In SAS:
proc princomp data=climate out=pc_out outstat=pc_outstat;
var MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD_5 NFFD;
run;
returns
Eigenvectors
Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9
MAT 0.372 0.257 -.035 -.033 -.106 0.270 -.036 0.216 -.811
MWMT 0.381 0.077 0.160 -.261 0.627 0.137 -.054 0.497 0.302
MCMT 0.341 0.324 -.229 0.046 -.544 0.421 0.045 0.059 0.493
logMAP -.184 0.609 -.311 -.357 -.041 -.548 0.183 0.183 0.000
logMSP -.205 0.506 0.747 -.137 -.040 0.159 -.156 -.266 0.033
CMI -.336 0.287 -.451 0.096 0.486 0.499 0.050 -.318 -.031
cmiJJA -.365 0.179 0.112 0.688 -.019 0.012 0.015 0.588 0.018
DD_5 0.379 0.142 0.173 0.368 0.183 -.173 0.725 -.282 0.007
NFFD 0.363 0.242 -.136 0.402 0.158 -.351 -.637 -.264 0.052
In R:
PCA.model <- princomp(climate[,c("MAT","MWMT","MCMT","logMAP","logMSP","CMI","cmiJJA","DD.5","NFFD")], scores=T, cor=T)
PCA.model$loadings
returns
Eigenvectors
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
MAT -0.372 -0.269 0.126 -0.250 0.270 0.789
MWMT -0.387 -0.171 0.675 0.494 -0.325
MCMT -0.339 -0.332 0.250 0.164 -0.500 -0.414 -0.510
logMAP 0.174 -0.604 0.309 0.252 0.619 -0.213 0.125
logMSP 0.202 -0.501 -0.727 0.223 -0.162 0.175 -0.268
CMI 0.334 -0.293 0.459 -0.222 0.471 -0.495 -0.271
cmiJJA 0.365 -0.199 -0.174 -0.612 -0.247 0.590
DD.5 -0.382 -0.143 -0.186 -0.421 -0.695 -0.360
NFFD -0.368 -0.227 -0.487 0.309 0.655 -0.205
As you can see, the values are similar (sign reversed), but not identical. The differences matter in the scored data, the first row of which looks like this:
Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9
SAS -1.95 1.68 -0.54 0.72 -1.07 0.10 -0.66 -0.02 0.05
R 1.61 -1.99 0.52 -0.42 -1.13 -0.16 0.79 0.12 -0.09
If I use a GLM (in SAS) or lm() (in R) to calculate the coefficients from the scored data, I get very similar numbers (inverse sign), with the exception of the intercept. Like so:
in SAS:
proc glm order=data data=pc_out;
model Prin1 = MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD_5 NFFD;
run;
in R:
scored <- cbind(PCA.model$scores, climate)
pca.lm <- lm(Comp.1~MAT+MWMT+MCMT+logMAP+logMSP+CMI+cmiJJA+DD.5+NFFD, data=scored)
returns
Coefficients:
(Int) MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD.5 NFFD
SAS 0.42 0.04 0.06 0.03 -0.65 -0.69 -0.003 -0.01 0.0002 0.004
R -0.59 -0.04 -0.06 -0.03 0.62 0.68 0.004 0.02 -0.0002 -0.004
So it would seem that the model intercept is changing the value in the scored data. Any thoughts on why this happens (why the intercept is different) would be appreciated.
Thanks again to all those who commented. Embarrassingly, the differences I found between the SAS proc princomp and R princomp() procedures was actually a product of a data error that I made. Sorry to those who took time to help answer.
But rather than let this question go to waste, I will offer what I found to be statistically equivalent procedures for SAS and R when running a principal component analysis (PCA).
The following procedures are statistically equivalent, with data named 'mydata' and variables named 'Var1', 'Var2', and 'Var3'.
In SAS:
* Run the PCA on your data;
proc princomp data=mydata out=pc_out outstat=pc_outstat;
var Var1 Var2 Var3;
run;
* Use GLM on the individual components to obtain the coefficients to calculate the PCA scoring;
proc glm order=data data=pc_out;
model Prin1 = Var1 Var2 Var3;
run;
In R:
PCA.model <- princomp(mydata[,c("Var1","Var2","Var3")], scores=T, cor=T)
scored <- predict(PCA.model, mydata)
scored <- cbind(PCA.model$scores, mydata)
lm(Comp.1~Var1+Var2+Var3, data=scored)