I have a single time series where you can clearly see a process change (denoted by the manually drawn lines). I am trying to detect and cluster these changes so that I can be notified when a new cluster is beginning. I have already attempted K-means clustering, agglomerative clustering and they do a decent job but do not seem to cluster based on time, only the value. I expect to have 6 clusters in the timeseries. You can see the algorithm typically ignores time.
I have googled a lot and discovered DTW however every article I read is comparing multiple time series instead of detecting changes within a single time series.
Does anyone have any references I can read up on this or have any solutions?
I am unable to provide actual data however here is some example data that you can use:
library(tidyverse)
example_data <- tibble(
date_seq = 1:300,
value = c(
rnorm(65, .1, .1),
rnorm(65, -.25, .1),
rnorm(20, 4, .25),
rnorm(80, -.25, .1),
rnorm(20, 4, .25),
rnorm(50, 0, .1)
)
)
Thank you!
I needed to solve a problem similar to yours. However, I used Markov to identify regime change moments instead of opting for a clustering method.
Here are good articles about it:
[RPubs by Majeed Simaan] [https://rpubs.com/simaan84/regime_switching]
[R-bloggers by Andrej Pivcevic] [https://www.r-bloggers.com/2019/02/switching-regressions-cluster-time-series-data-and-understand-your-development/]
Related
I have about 50 datasets that include all trades within a timeframe of 30 days for about 10 pairs on 5 exchanges. All pairs are of the same asset class, meaning they are strongly correlated and expect to have similar properties, but are on different scales. An example of this data would be
set.seed(1)
n <- 1000
dates <- seq(as.POSIXct("2019-08-05 00:00:00", tz="UTC"), as.POSIXct("2019-08-05 23:59:00", tz="UTC"), by="1 min")
x <- data.frame("t" = sort(sample(dates, 1000)),"p" = cumsum(sample(c(-1, 1), n, TRUE)))
Roughly, I need to identify the relevant local minima and maxima, which happen daily. The yellow marks are my points of interest. Unlike this example, there is usually only one such point per day and I consider each day separately. However, it is hard to filter out noise from my actual points of interest.
My actual goal is to find the exact point, at which the pair started to make a jump and the exact point, at which the jump is over. This needs to be as accurate as possible, as I want to observe which asset moved first and which asset followed at which point in time (as said, they are highly correlated).
Between two extreme values, I want to minimize the distance and maximize the relative/absolute change, as my points of interest are usually close to each other and their difference is quite large.
I already looked at other questions like
Finding local maxima and minima and Algorithm to locate local maxima and also this algorithm that has the same goal. However, my dataset is extremely noisy. I already reduced the dataset to 5-minute intervals, however, this has led to omitting the relevant points in the functions to identify local minima & maxima. Therefore, this was a not good solution given my goal.
How can I achieve my goal with a quite accurate algorithm? Manually skimming through all the time-series is not an option, since this would require me to evaluate 50 * 30 time-series manually, which is too time-consuming. I'm really puzzled and trying to find a suitable solution for a week.
If more code snippets are demanded, I'm happy to share, however they didn't give me meaningful results, which would be opposed to the idea of providing a minimum working example, therefore I decided to leave them out for now.
EDIT:
First off, I updated the plot and added timestamps to the dataset to give you an idea (the actual resolution). Ideally, the algorithm would detect both jumps on the left. The inner two dots because they're closer together and jump without interception, and the outer dots because they're more extreme in values. In fact, this maybe answers the question whether the algorithm is allowed to look into the future. Yes, if there's another local extrema in the range of, say, 30 observations (or 30 minutes), then ignore the intermediate local extrema.
In my data, jumps have been from 2% - ~ 15%, such that a jump needs to be at least 2% to be considered. And only if a threshold of 15 (this might be adaptable) consecutive steps in the same direction before / after the peaks and valleys is reached.
A very naive approach was to subset the data around the global minimum and maximum of a day. In most cases, this has denoised data and worked as an indicator. However, this is not robust when the global extrema are not in the range of the jump.
Hope this clarifies why this isn't a statistical question (there are some tests to determine whether a jump has happened, but not for jump arrival time afaik).
In case anyone wants a real example:
this is a corresponding graph, this is the raw data of the relevant period and this is the reduced dataset.
Perhaps as a starting point, look at function streaks
in package PMwR (which I maintain). A streak is
defined as a move of a specified size that is
uninterrupted by a countermove of the same size. The
function works with returns, not differences, so I add
100 to your data.
For instance:
set.seed(1)
n <- 1000
x <- 100 + cumsum(sample(c(-1, 1), n, TRUE))
plot(x, type = "l")
s <- streaks(x, up = 0.12, down = -0.12)
abline(v = s[, 1])
abline(v = s[, 2])
The vertical lines show the starts and ends of streaks.
Perhaps you can then filter the identified streaks by required criteria such as length. Or
you may play around with different thresholds for up
and down moves (though this is not really recommended
in the current implementation, but perhaps the results
are good enough). For instance, up streaks might look as follows. A green vertical shows the start of a streak; a red line shows its end.
plot(x, type = "l")
s <- streaks(x, up = 0.12, down = -0.05)
s <- s[!is.na(s$state) & s$state == "up", ]
abline(v = s[, 1], col = "green")
abline(v = s[, 2], col = "red")
I'm working with some data that has a few major outliers, mostly due to the technology used to capture the data. I removed these to normalize the data; however, for the nature of the work, I've been asked to visualize every participant's results in a series of graphs in order to compare performances. I'm a little new to R, so while the normalization wasn't difficult, I'm a little stumped as to how I might go about re-introducing these outliers to the scale of the normalized data. Is there a way to scale outliers to previously normalized data (mean=0) without skewing the data?
EDIT: I realize I left a lot of info out (still new to asking questions here), so here's an example of what my process looks like right now:
#example data of 20 participants, 18 of which are normal-range and 2 of which
#are outliers in a data frame
time <- rnorm (18, mean = 30, sd = 10)
distance <- rnorm(18, mean = 100, sd = 20)
time <- c(time, 2, 100)
distance <- c(distance, 30, 1000)
df <- data.frame(time, distance)
The outliers were mostly known due to the nature of the data collection, so removed them:
dfClean <- df[-c(19, 20),]
And plotted the data to check for normalcy after (step skipped here because data was generated to be normal).
From there, I normalized the columns in the data set so that each variable would have a mean of 0 and a st of 1 so they could be plotted together. The goal is to use this as a "normal" range to be able to visualize spread and outliers in future data (accent on visualization).
#using package clusterSim
dfNorm <- data.Normalization(dfClean, type="n13", normalization = "column")
The problem is, I'm not sure how to scale outliers to this range afterwards...or if I'm even understanding the scale function correctly. So, how do I plot all the subjects in the original df, including outliers, on a normalized mean=0 scale?
I am not sure if we can provide any external links to solve stackoverflow's issue.
Still you can refer below links to relove your problem-
https://www.r-bloggers.com/identify-describe-plot-and-remove-the-outliers-from-the-dataset/
I used this many times and found it useful.
So i'm trying to analyze a square and cluster all of the points of certain groups together. I'm thinking that pdfCluster is the best way to go since I need to measure the density of the points through a kernel density estimator to get the correct clusters and then I need to actually group them together to create a plot (I have the long/lat of the points). I'm really stuck on this and any help would be greatly appreciated.
I’m running across an issue with my code while trying to use a Kernel Density Estimator to cluster points. I am working with my data in two different ways trying to find the most optimal. First, I have my data in the form of a matrix. An example of this is below, and I have my latitude and longitude in my code attached to the columns and rows in the matrix.
m <- c(
c(8.83,8.89,8.81,8.87,8.9,8.87),
c(8.89,8.94,8.85,8.94,8.96,8.92),
c(8.84,8.9,8.82,8.92,8.93,8.91),
c(8.79,8.85,8.79,8.9,8.94,8.92),
c(8.79,8.88,8.81,8.9,8.95,8.92),
c(8.8,8.82,8.78,8.91,8.94,8.92),
c(8.75,8.78,8.77,8.91,8.95,8.92),
c(8.8,8.8,8.77,8.91,8.95,8.94),
c(8.74,8.81,8.76,8.93,8.98,8.99),
c(8.89,8.99,8.92,9.1,9.13,9.11),
c(8.97,8.97,8.91,9.09,9.11,9.11),
c(9.04,9.08,9.05,9.25,9.28,9.27),
c(9,9.01,9,9.2,9.23,9.2),
c(8.99,8.99,8.98,9.18,9.2,9.19),
c(8.93,8.97,8.97,9.18,9.2,9.18)
)
dim(m) <- c(15,6)
I also have my data in a data table where column 1 is my latitude, column 2 is my longitude, and column 3 is the value.
z <- c(
c(8.83,8.89, 2),
c(8.89,8.94, 4),
c(8.84,8.9, 1),
c(8.79,8.852, 4),
c(8.79,8.88, 5),
c(8.8,8.82, 2),
c(8.75,8.78, 1),
c(8.8,8.8, 2),
c(8.74,8.81, 7),
c(8.89,8.99, 1),
c(8.97,8.97, 6),
c(9.04,9.08, 8),
c(9,9.01, 1),
c(8.99,8.99, 8),
c(8.93,8.97, 2)
)
dim(z) <- c(15,3)
The actual data I am using is from larger rasters and shapefiles.
The raster is from http://beta.sedac.ciesin.columbia.edu/data/set/gpw-v4-population-count/data-download.
And the shapefiles are from http://www.gadm.org/download — I am using Nigeria.
The main question of this post is clustering and the optimal data format for clustering functions. I currently have all of the grid points of the entire country with their (Lat, Long, Value). I want to run a Kernel Density Estimator across all of the points and then cluster based on certain values. Looking at the pdfCluster package it seems to do just that except i’m not sure how to allow it to accept (lat/long) values and run across a geographic plane. Since my data is across a geographic area and isn’t completely continuous i’m running in to errors. Any hints for how to modify the pdfCluster package for accepting such values or what dataset is best to use would be greatly appreciated.
I have a series of data that I'm going to use clustering on, and I want to see how this data clusters over time.
So essentially everyone starts in a single group, as they have done nothing, but over time as they do different things they will be put into different groups based on their behavior, and I want to track this.
I've been looking for a way to do this in R (with some preprocessing of data in Python), and represent it graphically. The only way I can currently think of doing this is breaking the time period into say, 3 weeks, and then clustering each of the 3 weeks. The only problem with this is I don't really know how to track movements of people between clusters over those 3 weeks (e.g. to see if someones actions moves them from group A to group B). I could put it in a table, but it would be nice to somehow show it graphically (like red lines between cluster over time or something).
Any ideas on how to do this would be much appreciated, or if there is a good way to track clusters over time that I've been missing please point me towards it.
I've used the Mfuzz in R for clustering time-course microarray data sets. Mfuzz uses "soft-clustering". Basically, individuals can appear in more than one group. Here is an example with some simulated data:
library(Mfuzz)
tps = 6;cases = 90
d = rnorm(tps*cases, 1) ##Poisson distribution with mean 1
m = matrix(d, ncol=tps, nrow=cases)
##First 30 individuals have increasing trends
m[1:30,] = t(apply(m[1:30,], 1, cumsum))
##Next 30 have decreasing trends
##A bit hacky, sorry
m[31:60,] = t(apply(t(apply(m[31:60,], 1, cumsum)), 1, rev))
##Last 30 individuals have random numbers from a Po(1)
##Create an expressionSet object
tmp_expr = new('ExpressionSet', exprs=m)
##Specify c=3 clusters
cl = mfuzz(tmp_expr, c=3, m=1.25)
mfuzz.plot(tmp_expr,cl=cl, mfrow=c(2, 2))
This gives:
Your comments, suggestions, or solutions are/will be greatly appreciated, thank you.
I'm using the fpc package in R to do a dbscan analysis of some very dense data (3 sets of 40,000 points between the range -3, 6).
I've found some clusters, and I need to graph just the significant ones. The problem is that I have a single cluster (the first) with about 39,000 points in it. I need to graph all other clusters but this one.
The dbscan() creates a special data type to store all of this cluster data in. It's not indexed like a data frame would be (but maybe there is a way to represent it as such?).
I can graph the dbscan type using a basic plot() call. But, like I said, this will graph the irrelevant 39,000 points.
tl;dr:
how do I graph only specific clusters of a dbscan data type?
If you look at the help page (?dbscan) it is organized like all others into sections labeled Description, Usage, Arguments, Details and Value. The Value section describes what the function dbscan returns. In this case it is simply a list (a standard R data type) with a few components.
The cluster component is simply an integer vector whose length it equal to the number of rows in your data that indicates which cluster each observation is a member of. So you can use this vector to subset your data to extract only those clusters you'd like and then plot just those data points.
For example, if we use the first example from the help page:
set.seed(665544)
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
sd=0.2))
ds <- dbscan(x, 0.2)
we can then use the result, ds to plot only the points in clusters 1-3:
#Plot only clusters 1, 2 and 3
plot(x[ds$cluster %in% 1:3,])
Without knowing the specifics of dbscan, I can recommend that you look at the function smoothScatter. It it very useful for examining the main patterns in a scatterplot when you otherwise would have too many points to make sense of the data.
The probably most sensible way of plotting DBSCAN results is using alpha shapes, with the radius set to the epsilon value. Alpha shapes are closely related to convex hulls, but they are not necessarily convex. The alpha radius controls the amount of non-convexity allowed.
This is quite closely related to the DBSCAN cluster model of density connected objects, and as such will give you a useful interpretation of the set.
As I'm not using R, I don't know about the alpha shape capabilities of R. There supposedly is a package called alphahull, from a quick check on Google.