if I have 2 lists of time intervals :
List1 :
1. 2010-06-06 to 2010-12-12
2. 2010-05-04 to 2010-11-02
3. 2010-02-04 to 2010-10-08
4. 2010-04-01 to 2010-08-02
5. 2010-01-03 to 2010-02-02
and
List2 :
1. 2010-06-08 to 2010-12-14
2. 2010-04-04 to 2010-10-10
3. 2010-02-02 to 2010-12-16
What would be the best way to calculate some sort of correlation or similarity factor between the two lists?
Thanks!
Is that the extent of the data or just a sample to give an idea of the structure you have?
Just a few ideas about how to look at this... My apologies if it is redundant to your current state in looking at this set.
Two basic ideas come to mind for comparing interval like this: absolute or relative. A relative comparison would ignore absolute time for the interval data and look for repeating structures or signature that occur in both groups but not necessarily at the same time. The absolute version would consider simultaneous events to be relevant and and it doesn't matter if something happens every week if they are separated by a year... You can maybe make this distinction by knowing something about the origin of the data.
If it is the grand total of data available for your decision about associations it will come down to some assumptions about what constitutes "correlation". For instance, if you have a specific model for what is going on - e.g. a time to start, time to stop (failure) model you could evaluate the likelihood of observing one sequence given the other. However, without more example data it seems unlikely you'd be able to make any firm conclusions.
The first interval in the two groups are nearly identical so they will contribute strongly to any correlation measure I can think of for the two groups. If there is a random model for this set, I would expect that many models would show these two observations and "unlikely" just because of that.
One way to asses "similarity" would be to ask what portion of the time-axis is covered (possibly generalized to multiple coverage) and compare the two groups on that basis.
Another possibility is to assign a function that adds one for each sequence that occurs during any particular day in the overall interval of these events. That way you have a continuous function with a rudimentary description of multiple events covering the same date. Calculating a correlation between the two groups might give you suggestions of structural similarity, but again you would need more groups of data to make any conclusions.
Ok that was a little rambling. Good luck with your project!
You may try with Cross-Correlation.
However, you should be aware that you have vector data (start, length), and the algorithms suppose a functional dependency between them. That depends on the semantic of your data, which is not clear from the question.
HTH!
A more useful link for your current problem here.
Related
I have a very specific datasets with 50 people. Each person has a response (sex) and ~2000 measurements of some biological stuff.
We have three independent replicates from each person, so 3 rows pr. person.
I can easily use caret and groupKFold() to keep each person in either training or test sets - that works fine.
Then I simply predict each replicate separately (so 3 prediction pr person).
I want to use these three prediction together and make a combined prediction pr. person either using majority vote and/or some other scheme.
I.e. - so for each person I get the 3 predictions and predict the response to be the one with most votes. That's pretty easy to do for the final model, but it should also be used in the tuning step (i.e. in the cross validation picking parameter values).
I think I can do that in the summaryFunction=... when calling caret::trainControl() but I would simply like to ask:
Is there a simpler way of doing this?
I have googled around - but I keep failing in finding people with similar problems. And I really hope someone can point me in the right direction.
I have recently posted a "very newie to R" question about the correct way of doing this, if you are interested in it you can find it [here].1
I have now managed to develop a simple R script that does the job, but now the results are what troubles me.
Long story short I'm using R to analyze lpp (Linear Point Pattern) with mad.test.That function performs an hypothesis test where the null hypothesis is that the points are randomly distributed. Currently I have 88 lpps to analyze, and according to the p.value 86 of them are randomly distributed and 2 of them are not.
These are the two not randomly distributed lpps.
Looking at them you can see some kind of clusters in the first one, but the second one only has three points, and seems to me that there is no way one can assure only three points are not corresponding to a random distribution. There are other tracks with one, two, three points but they all fall into the "random" lpps category, so I don't know why this one is different.
So here is the question: how many points are too little points for CSR testing?
I have also noticed that these two lpps have a much lower $statistic$rank than the others. I have tried to find what that means but I'm clueless righ now, so here is another newie question: Is the $statistic$rank some kind of quality analysis indicator, and thus can I use it to group my lpp analysis into "significant ones" and "too little points" ones?
My R script and all the shp files can be downloaded from here(850 Kb).
Thank you so much for your help.
It is impossible to give an universal answer to the question about how many points is needed for an analysis. Usually 0, 1 and 2 are too few for a standalone analysis. However, if they are part of repeated measurements of the same thing they might be interesting still. Also, I would normally say that your example with 3 points is too few to say anything interesting. However, an extreme example would be if you have a single long line segment where one point occurs close to one end and two other occur close to each other at the other end. This is not so likely to happen for CSR and you may be inclined to not believe that hypothesis. This appears to be what happened in your case.
Regarding your question about the rank you might want to read a bit more up on the Monte Carlo test you are preforming. Basically, you summarise the point pattern by a single number (maximum absolute deviation of linear K) and then you look at how extreme this number is compared to numbers generated at random from CSR. Assuming you use 99 simulations of CSR you have 100 numbers in total. If your data ranks as the most extreme ($statistic$rank==1) among these it has p-value 1%. If it ranks as the 50th number the p-value is 50%. If you used another number of simulations you have to calculate accordingly. I.e. with 199 simulations rank 1 is 0.5%, rank 2 is 1%, etc.
There is a fundamental problem here with multiple testing. You are applying a hypothesis test 88 times. The test is (by default) designed to give a false positive in 5 percent (1 in 20) of applications, so if the null hypothesis is true, you should expect 88 /20 = 4.4 false positives to have occurred your 88 tests. So getting only 2 positive results ("non-random") is entirely consistent with the null hypothesis that ALL of the patterns are random. My conclusion is that the patterns are random.
We have hourly time series data having 2 columns, one is the timestamp and other is the error rate. We used H2O deep-learning model to learn and predict future error-rate but looks like it requires at least 2 features (except timestamp) for creating the model.
Is there any way h2o can learn this type of data (time, value) having only one feature and predict the value given future time?
Not in the current release of H2O, but ARIMA models are in development. You can follow the progress here.
Interesting question,
I read about to declare other variables which represent previous values of the time series, similar to the methodology of regression in ARIMA models. But I'm not sure if this is a possible way to do it, so please correct me if I am wrong.
Consequently you could try to extend your dataset to something like this:
t value(t) value(t-1) value(t-2) value(t-3) ...
1 10 NA NA NA ...
2 14 10 NA NA ...
3 27 14 10 NA ...
...
After this, value(t) is your response (output neuron) and the others are your predictor variables, each refering to an input neuron.
I have tried to use many of the default methods inside H2O with time series data. If you treat the system as a state machine where the state variables are a series of lagged prior states, it's possible, but not entirely effective as the prior states don't maintain their causal order. One way to alleviate this is to assign weights to each lagged state set based on time past, similar to how an EMA gives precedence to more recent data.
If you are looking to see how easy or effective the DL/ML can be for a non-linear time series model, I would start with an easy problem to validate the DL approach gives any improvement over a simple 1 period ARIMA/GARCH type process.
I have used this technique, with varying success. What I have had success with is taking well known non linear time series models and improving their predictive qualities with additional factors using the the handcrafted non linear model as an input into the DL method. It seems that certain qualities that I haven't manually worked out about the entire parameter space are able to supplement a decent foundation.
The real question at that point is there is now an introduction of immense complexity that isn't entirely understood. Is that complexity warranted in the compiled landscape when the nonlinear model encapsulates about 95% of the information between the two stages?
I have time-series data of 12 consumers. The data corresponding to 12 consumers (named as a ... l) is
I want to cluster these consumers so that I may know which of the consumers have utmost similar consumption behavior. Accordingly, I found clustering method pamk, which automatically calculates the number of clusters in input data.
I assume that I have only two options to calculate the distance between any two time-series, i.e., Euclidean, and DTW. I tried both of them and I do get different clusters. Now the question is which one should I rely upon? and why?
When I use Eulidean distance I got following clusters:
and using DTW distance I got
Conclusion:
How will you decide which clustering approach is the best in this case?
Note: I have asked the same question on Cross-Validated also.
none of the timeseries above look similar to me. Do you see any pattern? Maybe there is no pattern?
the clustering visualizations indicate that there are no clusters, too. b and l appear to be the most unusual outliers; followed by d,e,h; but there are no clusters there.
Also try hierarchical clustering. The dendrogram may be more understandable.
But in either way, there may be no clusters. You need to be prepared for this outcome, and consider it a valid hypothesis. Double-check any result. As you have seen, pam will always return a result, and you have absolutely no means to decide which result is more "correct" than the other (most likely, neither is correct, and you should rely on neither, to answer your question).
I'm making a project connected with identifying dynamic of sales. That's how the piece of my database looks like http://imagizer.imageshack.us/a/img854/1958/zlco.jpg. There are three columns:
Product - present the group of product
Week - time since launch the product (week), first 26 weeks
Sales_gain - how the sales of product change by week
In the database there is 3302 observations = 127 time series
My aim is to cluster time series in groups which are going to show me different dynamic of sales. Before clustering I want to use Fast Fourier Transform to change time series on vectors and take into consideration amplitude etc and then use a distance algorithm and group products.
It's my first time I deal with FFT and clustering, so I would be grateful if anybody would point steps, which I have to do before/after using FFT to group dynamics of sales. I want to do all steps in R, so it would be wonderful if somebody type which procedures should I use to do all steps.
That's how my time series look like now http://imageshack.com/a/img703/6726/sru7.jpg
Please note that I am relatively new to time series analysis (that's why I cannot put here my code) so any clarity you could provide in R or any package you could recommend that would accomplish this task efficiently would be appreciated.
P.S. Instead of FFT I found the code for DWT here -> www.rdatamining.com/examples/time-series-clustering-classification but cannot use it on my data base and time series (suggest R to analyze new time series after 26 weeks). Can sb explain it to me?
You may have too little data for FFT/DWT to make sense. DTW may be better, but I also don't think it makes sense for sales data - why would there be a x-week temporal offset from one location to another? It's not as if the data were captured at unknown starting weeks.
FFT and DWT are good when your data will have interesting repetitive patterns, and you have A) a good temporal resolution (for audio data, e.g. 16000 Hz - I am talking about thousands of data points!) and B) you have no idea of what frequencies to expect. If you know e.g. you will have weekly patterns (e.g. no sales on sundays) then you should filter them with other algorithms instead.
DTW (dynamic time-warping) is good when you don't know when the event starts and how they align. Say you are capturing heart measurements. You cannot expect to have the hearts of two subjects to beat in synchronization. DTW will try to align this data, and may (or may not) succeed in matching e.g. an anomaly in the heart beat of two subjects. In theory...
Maybe you don't need specialized time methods here at all.
A) your data has too low temporal resolution
B) your data is already perfectly aligned
Maybe all you need is spend more time in preprocessing your data, in particular normalization, to be able to capture similarity.