I have a series of data that I'm going to use clustering on, and I want to see how this data clusters over time.
So essentially everyone starts in a single group, as they have done nothing, but over time as they do different things they will be put into different groups based on their behavior, and I want to track this.
I've been looking for a way to do this in R (with some preprocessing of data in Python), and represent it graphically. The only way I can currently think of doing this is breaking the time period into say, 3 weeks, and then clustering each of the 3 weeks. The only problem with this is I don't really know how to track movements of people between clusters over those 3 weeks (e.g. to see if someones actions moves them from group A to group B). I could put it in a table, but it would be nice to somehow show it graphically (like red lines between cluster over time or something).
Any ideas on how to do this would be much appreciated, or if there is a good way to track clusters over time that I've been missing please point me towards it.
I've used the Mfuzz in R for clustering time-course microarray data sets. Mfuzz uses "soft-clustering". Basically, individuals can appear in more than one group. Here is an example with some simulated data:
library(Mfuzz)
tps = 6;cases = 90
d = rnorm(tps*cases, 1) ##Poisson distribution with mean 1
m = matrix(d, ncol=tps, nrow=cases)
##First 30 individuals have increasing trends
m[1:30,] = t(apply(m[1:30,], 1, cumsum))
##Next 30 have decreasing trends
##A bit hacky, sorry
m[31:60,] = t(apply(t(apply(m[31:60,], 1, cumsum)), 1, rev))
##Last 30 individuals have random numbers from a Po(1)
##Create an expressionSet object
tmp_expr = new('ExpressionSet', exprs=m)
##Specify c=3 clusters
cl = mfuzz(tmp_expr, c=3, m=1.25)
mfuzz.plot(tmp_expr,cl=cl, mfrow=c(2, 2))
This gives:
Related
I'm using DTWCLUST package in r for multivariate time series clustering. Here's my code.
data("uciCT")
mvc <- tsclust(CharTrajMV, k = 4L, distance = "gak", seed = 390L)
plot(mvc)
The CharTrajMV data set has 100 observations with 3 variables. As I understand, clusters are determined based on 3 variables as opposed to univariate time series clustering.
Each cluster graph shows several similarly patterned time series (observations) belonging to that cluster. How is this graph drawn? There are 3 time series variables used for clustering, how does one pattern graph come out? I mean the input is 3-dimentional(variables) dataset, but the output is 1-dimentional.
Moreover, I can get the 3 variables's centroid for each cluster (using mvc#centroids)
plot(mvc, labels = list(nudge_x = -10, nudge_y = 1), type="centroids")
this code shows only one centroid for each cluster. Can I get 3 variables' centroid graphs for each cluster with plot option? or is this right approach?
This is covered in the documentation. Plotting so many different series in separate panes would get very congested, so, for multivariate plots, the variables are appended one after the other, and you get vertical dotted lines to see the place where that happened, maybe injecting some missing values in some places to account for differences in length. This does mean the x axis isn't so meaningful anymore, but it's only meant to be a quick visualization aid.
I am trying to perform DBSCAN clustering on the data https://www.kaggle.com/arjunbhasin2013/ccdata. I have cleaned the data and applied the algorithm.
data1 <- read.csv('C:\\Users\\write\\Documents\\R\\data\\Project\\Clustering\\CC GENERAL.csv')
head(data1)
data1 <- data1[,2:18]
dim(data1)
colnames(data1)
head(data1,2)
#to check if data has empty col or rows
library(purrr)
is_empty(data1)
#to check if data has duplicates
library(dplyr)
any(duplicated(data1))
#to check if data has NA values
any(is.na(data1))
data1 <- na.omit(data1)
any(is.na(data1))
dim(data1)
Algorithm was applied as follows.
#DBSCAN
data1 <- scale(data1)
library(fpc)
library(dbscan)
set.seed(500)
#to find optimal eps
kNNdistplot(data1, k = 34)
abline(h = 4, lty = 3)
The figure shows the 'knee' to identify the 'eps' value. Since there are 17 attributes to be considered for clustering, I have taken k=17*2 =34.
db <- dbscan(data1,eps = 4,minPts = 34)
db
The result I obtained is "The clustering contains 1 cluster(s) and 147 noise points."
No matter whatever values I change for eps and minPts the result is same.
Can anyone tell where I have gone wrong?
Thanks in advance.
You have two options:
Increase the radius of your center points (given by the epsilon parameter)
Decrease the minimum number of points (minPts) to define a center point.
I would start by decreasing the minPts parameter, since I think it is very high and since it does not find points within that radius, it does not group more points within a group
A typical problem with using DBSCAN (and clustering in general) is that real data typically does not fall into nice clusters, but forms one connected point cloud. In this case, DBSCAN will always find only a single cluster. You can check this with several methods. The most direct method would be to use a pairs plot (a scatterplot matrix):
plot(as.data.frame(data1))
Since you have many variables, the scatterplot pannels are very small, but you can see that the points are very close together in almost all pannels. DBSCAN will connect all points in these dense areas into a single cluster. k-means will just partition the dense area.
Another option is to check for clusterability with methods like VAT or iVAT (https://link.springer.com/chapter/10.1007/978-3-642-13657-3_5).
library("seriation")
## calculate distances for a small sample
d <- dist(data1[sample(seq(nrow(data1)), size = 1000), ])
iVAT(d)
You will see that the plot shows no block structure around the diagonal indicating that clustering will not find much.
To improve clustering, you need to work on the data. You can remove irrelevant variables, you may have very skewed variables that should be transformed first. You could also try non-linear embedding before clustering.
I have a made up dataset of polling stations in Wales and I've attached a date column to it. We can imagine this date is the date this polling station was visited to check the facilities (for example).
What I'd like to do is work out :
I would like to work out whether geographic points are within a certain distance
This I've managed by self_joining and using st_buffer and st_within to calculate within 1000 m and then calculated the number of neighbours.
and also the interval between the sample dates
this I'm having a bit of a problem with
What I'd like to do, I think, is
for each polling station
calculate the number of neighbours (so far so easy)
for each neighbour determine the interval between the sampling dates
return a spatial object (for plotting in tmaps probably)
Here's some test code that I've got that generates the sf dataset, calculates the number of neighbours and returns that.
It's really the date interval that's stumping me. It's not so much the calculation of the date interval but it's the way to generate these clusters of polling stations with date intervals.
Is it better to generate the (in this case) 108 polling station clusters?
What I'm trying to do in my larger dataset is calculate clusters of points over time.
I have ~2000 records with a date. I'd like to say :
for each of these 2000 records calculate the number of neighbours within a distance and within a timeframe.
I think it's probably better to
calculate each cluster of neighbouring points and visualise
then
remove neighbours from the cluster that are outside of the time frame and visualise that
Although, on typing this, I wonder if excluding points that didn't fall within a timeframe first and then calculating neighbours would be more efficient?
polls<-st_as_sf(read.csv(url("https://www.caerphilly.gov.uk/CaerphillyDocs/FOI/Datasets_polling_stations_csv.aspx")),
coords = c("Easting","Northing"),crs = 27700)%>%
mutate(date = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/31'), by="day"), 147))
test_stack<-polls%>%st_join(polls%>%st_buffer(dist=1000),join=st_within)%>%
filter(Ballot.Box.Polling.Station.x!=Ballot.Box.Polling.Station.y)%>%
add_count(Ballot.Box.Polling.Station.x)%>%
rename(number_of_neighbours = n)%>%
mutate(interval_date = date.x-date.y)%>%
subset(select = -c(6:8,10,11,13:18))## removing this comment will summarise the data so that only number of neighbours is returned %>%
distinct(Ballot.Box.Polling.Station.x,number_of_neighbours,date.x)%>%
filter(number_of_neighbours >=2)
I think it might be as simple as
tm_shape(test_stack)+tm_dots(col = "number_of_neighbours", clustering =T, size = 0.5)
I'm not sure how clustering works in leaflet, but that works quite nicely on this test data.
I have about 50 datasets that include all trades within a timeframe of 30 days for about 10 pairs on 5 exchanges. All pairs are of the same asset class, meaning they are strongly correlated and expect to have similar properties, but are on different scales. An example of this data would be
set.seed(1)
n <- 1000
dates <- seq(as.POSIXct("2019-08-05 00:00:00", tz="UTC"), as.POSIXct("2019-08-05 23:59:00", tz="UTC"), by="1 min")
x <- data.frame("t" = sort(sample(dates, 1000)),"p" = cumsum(sample(c(-1, 1), n, TRUE)))
Roughly, I need to identify the relevant local minima and maxima, which happen daily. The yellow marks are my points of interest. Unlike this example, there is usually only one such point per day and I consider each day separately. However, it is hard to filter out noise from my actual points of interest.
My actual goal is to find the exact point, at which the pair started to make a jump and the exact point, at which the jump is over. This needs to be as accurate as possible, as I want to observe which asset moved first and which asset followed at which point in time (as said, they are highly correlated).
Between two extreme values, I want to minimize the distance and maximize the relative/absolute change, as my points of interest are usually close to each other and their difference is quite large.
I already looked at other questions like
Finding local maxima and minima and Algorithm to locate local maxima and also this algorithm that has the same goal. However, my dataset is extremely noisy. I already reduced the dataset to 5-minute intervals, however, this has led to omitting the relevant points in the functions to identify local minima & maxima. Therefore, this was a not good solution given my goal.
How can I achieve my goal with a quite accurate algorithm? Manually skimming through all the time-series is not an option, since this would require me to evaluate 50 * 30 time-series manually, which is too time-consuming. I'm really puzzled and trying to find a suitable solution for a week.
If more code snippets are demanded, I'm happy to share, however they didn't give me meaningful results, which would be opposed to the idea of providing a minimum working example, therefore I decided to leave them out for now.
EDIT:
First off, I updated the plot and added timestamps to the dataset to give you an idea (the actual resolution). Ideally, the algorithm would detect both jumps on the left. The inner two dots because they're closer together and jump without interception, and the outer dots because they're more extreme in values. In fact, this maybe answers the question whether the algorithm is allowed to look into the future. Yes, if there's another local extrema in the range of, say, 30 observations (or 30 minutes), then ignore the intermediate local extrema.
In my data, jumps have been from 2% - ~ 15%, such that a jump needs to be at least 2% to be considered. And only if a threshold of 15 (this might be adaptable) consecutive steps in the same direction before / after the peaks and valleys is reached.
A very naive approach was to subset the data around the global minimum and maximum of a day. In most cases, this has denoised data and worked as an indicator. However, this is not robust when the global extrema are not in the range of the jump.
Hope this clarifies why this isn't a statistical question (there are some tests to determine whether a jump has happened, but not for jump arrival time afaik).
In case anyone wants a real example:
this is a corresponding graph, this is the raw data of the relevant period and this is the reduced dataset.
Perhaps as a starting point, look at function streaks
in package PMwR (which I maintain). A streak is
defined as a move of a specified size that is
uninterrupted by a countermove of the same size. The
function works with returns, not differences, so I add
100 to your data.
For instance:
set.seed(1)
n <- 1000
x <- 100 + cumsum(sample(c(-1, 1), n, TRUE))
plot(x, type = "l")
s <- streaks(x, up = 0.12, down = -0.12)
abline(v = s[, 1])
abline(v = s[, 2])
The vertical lines show the starts and ends of streaks.
Perhaps you can then filter the identified streaks by required criteria such as length. Or
you may play around with different thresholds for up
and down moves (though this is not really recommended
in the current implementation, but perhaps the results
are good enough). For instance, up streaks might look as follows. A green vertical shows the start of a streak; a red line shows its end.
plot(x, type = "l")
s <- streaks(x, up = 0.12, down = -0.05)
s <- s[!is.na(s$state) & s$state == "up", ]
abline(v = s[, 1], col = "green")
abline(v = s[, 2], col = "red")
I'm working with some data that has a few major outliers, mostly due to the technology used to capture the data. I removed these to normalize the data; however, for the nature of the work, I've been asked to visualize every participant's results in a series of graphs in order to compare performances. I'm a little new to R, so while the normalization wasn't difficult, I'm a little stumped as to how I might go about re-introducing these outliers to the scale of the normalized data. Is there a way to scale outliers to previously normalized data (mean=0) without skewing the data?
EDIT: I realize I left a lot of info out (still new to asking questions here), so here's an example of what my process looks like right now:
#example data of 20 participants, 18 of which are normal-range and 2 of which
#are outliers in a data frame
time <- rnorm (18, mean = 30, sd = 10)
distance <- rnorm(18, mean = 100, sd = 20)
time <- c(time, 2, 100)
distance <- c(distance, 30, 1000)
df <- data.frame(time, distance)
The outliers were mostly known due to the nature of the data collection, so removed them:
dfClean <- df[-c(19, 20),]
And plotted the data to check for normalcy after (step skipped here because data was generated to be normal).
From there, I normalized the columns in the data set so that each variable would have a mean of 0 and a st of 1 so they could be plotted together. The goal is to use this as a "normal" range to be able to visualize spread and outliers in future data (accent on visualization).
#using package clusterSim
dfNorm <- data.Normalization(dfClean, type="n13", normalization = "column")
The problem is, I'm not sure how to scale outliers to this range afterwards...or if I'm even understanding the scale function correctly. So, how do I plot all the subjects in the original df, including outliers, on a normalized mean=0 scale?
I am not sure if we can provide any external links to solve stackoverflow's issue.
Still you can refer below links to relove your problem-
https://www.r-bloggers.com/identify-describe-plot-and-remove-the-outliers-from-the-dataset/
I used this many times and found it useful.