Detect patterns in data set - r

Given a data set with several records, that are similar to this one:
I want to detect the green dots. This pattern is recurring in a lot of the data records but is not completely identical (sd, variance, min, max etc). This data points are near the minimum and are showing a low variance.
I tried clustering (kmeans, dbscan, mclust) but the result was not very good.
How can I solve this problem? Any ideas?

Dare I say a simple threshold based on the minimum and a percentage?

Related

Define sample size using simple random sampling

I am trying to run a PCA, but I have too much data (20k observations) the resolution is too low.
I am using sample_n(df, replace = TRUE, n) [from dplyr] to reduce the size and have a better fit.
My question is: what is the best technique to define (or estimate) the sample size (n)?
If I have 20k observations (different sites, different times of the year, relatively well homogeneous), which cutoff should I use: 5%, 10%, 20%?
Could you give me a reference to your suggestion?
Thank you in advance for your comments.
I would make a loop with different sample sizes, I dont believe there is a clear cut/off just you could do with train/test (although we have piplines, but you know what I mean the 70/30 cutoff). The only thing I would check is if sample_n is still not too clustered and values are relatively equally represented.
If you are firm with k-means clustering, there we have the "elbow method", which is a little bit subjective where is the best amount of clusters (although we measure RSS), you just have to try a lot of iterations and loops.
You know with neural networks when you have e.g. a million observations you can reduce test set to e.g. 5 or 10 % because in absolute values you still have a lot of cases.
In summary:
I think that it needs a practical test like the elbow method in clustering. Becaue its can be very specific to your data.
I hope my answer is to at least to some value to you, I have no journal reference atm.

How to sort many time series' by how trending each series is

Hi I am recording data for around 150k items in influx. I have tried grouping by item id and using some of the functions from the docs but they don't seem to show "trend".
As there are a lot of series' to group by. I am currently performing a query on each series to calculate a value, storing it and sorting by that.
I have tried to use Linear Regression (the average angle of the line) but it's not quite meant for this as the X axis are timestamps, which do not correlate to the Y axis values, so end up with a near vertical line. Maybe i can calculate the X values to be something else?
The other issue i have is some series' are much higher values than others, so one series jumping up by 1000 might be huge (very trending) and not a big deal for other series that are always much higher.
Is there a way i can generate a single value from a series that represents how trending the series is, eg its just jumped up quite a lot compared to normal.
Here is an example of one series that is not trending and one that was trending a couple days ago. So the latter would have a higher trend value than the first:
Thanks!
I think similar problems arise naturally in the stock market and in general when detecting outliers.
So there are different way to move. Probably 1 is good enough.
It looks like you have a moving average in the graphs. You could just take the difference to the moving average and see the distribution to evaluate the the appropriate thresholds for you to pay attention. It looks like in the first graph you have an event perhaps relevant. You could just place a threshold like two standard deviations of the average of the difference between the real series and the moving average.
De-trend each series. Even 1) could be good enough (I mean just substraction of real value for the series minus the average for the last X days), you could de-trend using more sophisticated ideas. But that could need more attention for each case, for instance you should be careful with seasonality and so on. Perhaps something line Hodrick Prescott or inline with this: https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/.
Perhaps the idea from 1) is more formally described as Bollinger Bands. That help you to know where the time series should be with some probability.
There are more sophisticated ways to identify outliers in time series (as in here: https://towardsdatascience.com/effective-approaches-for-time-series-anomaly-detection-9485b40077f1) or here for a literature review: https://arxiv.org/pdf/2002.04236.pdf

Find the radius of a cluster, given that its center is the average of the centers of two other clusters

I do not know if it is possible to find it, but I am using Kmeans clustering with Mahout, and I am stuck to the following.
In my implementation, I create with two different threads the following clusters:
CL-1{n=4 c=[1.75] r=[0.82916]}
CL-1{n=2 c=[4.5] r=[0.5]}
So, I would like to finally combine these two clusters into one final cluster.
In my code, I manage to find that for the final cluster the total points are n=6, the new average of the centers is c=2.666 but I am not able to find the final combined radius.
I know that the radius is the Population Standard Deviation, and I can calculate it if I previously know each point that belongs to the cluster.
However, in my case I do not have previous knowledge of the points, so I need the "average" of the 2 radius I mentioned before, in order to finally have this: CL-1{n=6 c=[2.666] r=[???]}.
Any ideas?
Thanks for your help.
It's not hard. Remember how the "radius" (not a very good name) is computed.
It's probably the standard deviation; so if you square this value and multiply it by the number of objects, you get the sum of squares. You can aggregate the sum of squares, and then reverse this process to get a standard deviation again. It's pretty basic statistic knowledge; you want to compute the weighted quadratic mean, just like you computed the weighted arithmetic mean for the center.
However, since your data is 1 dimensional, I'm pretty sure it will fit into main memory. As long as your data fits into memory, stay away from Mahout. It's slooooow. Use something like ELKI instead, or SciPy, or R. Run benchmarks. Mahout will perform several orders of magnitude slower than all the others. You won't need all of this Canopy-thing then either.

How to aggregate / roll up percentile measures

There is a dataset which contains aggregated data - aggregated to various dimensions, and down to the hourly level. The main measure is speed which is simply the file size divided by the duration.
The requirement is to see Percentile, Median and Average/Mean summaries.
Mean is simple because we simply create a calculated measure in the MDX and then it works at all aggregation levels i.e. daily/monthly etc.
However Percentile and median are hard. Is there any way in which it is possible to have a calculation for these functions which will roll up correctly? We could add the percentile speed as a column in the ETL when we're reading the raw data, but we'd still need to find a way to then roll it up further?
What is the proper way to roll up these types of measures? It's not uncommon to ask for percentile numbers, so I'm surprised to not see much information on this when I look around.
Maybe the only approach is to have various aggregated tables at the right level, with the right calculation, and then make mondrian use them as agg tables? Or worse case have multiple cubes (!)
OK, so it turns out you cannot roll up percentiles ( and therefore medians which is just a 50th Percentile ) I understand others have had this problem, see this tweet from Kasper here: https://twitter.com/kaspersor/status/308189242788560896
So our solution was a couple of different agg tables to store the relevant stats, and on the main (already aggregated) fact table to store the pre-computed percentile and median stats.

Correct way to standardize/scale/normalize multiple variables following power law distribution for use in linear combination

I'd like to combine a few metrics of nodes in a social network graph into a single value for rank ordering the nodes:
in_degree + betweenness_centrality = informal_power_index
The problem is that in_degree and betweenness_centrality are measured on different scales, say 0-15 vs 0-35000 and follow a power law distribution (at least definitely not the normal distribution)
Is there a good way to rescale the variables so that one won't dominate the other in determining the informal_power_index?
Three obvious approaches are:
Standardizing the variables (subtract mean and divide by stddev). This seems it would squash the distribution too much, hiding the massive difference between a value in the long tail and one near the peak.
Re-scaling variables to the range [0,1] by subtracting min(variable) and dividing by max(variable). This seems closer to fixing the problem since it won't change the shape of the distribution, but maybe it won't really address the issue? In particular the means will be different.
Equalize the means by dividing each value by mean(variable). This won't address the difference in scales, but perhaps the mean values are more important for the comparison?
Any other ideas?
You seem to have a strong sense of the underlying distributions. A natural rescaling is to replace each variate with its probability. Or, if your model is incomplete, choose a transformation that approximately acheives that. Failing that, here's a related approach: If you have a lot of univariate data from which to build a histogram (of each variate), you could convert each to a 10 point scale based on whether it is in the 0-10% percentile or 10-20%-percentile ...90-100% percentile. These transformed variates have, by construction, a uniform distribution on 1,2,...,10, and you can combine them however you wish.
you could translate each to a percentage and then apply each to a known qunantity. Then use the sum of the new value.
((1 - (in_degee / 15) * 2000) + ((1 - (betweenness_centrality / 35000) * 2000) = ?
Very interesting question. Could something like this work:
Lets assume that we want to scale both the variables to a range of [-1,1]
Take the example of betweeness_centrality that has a range of 0-35000
Choose a large number in the order of the range of the variable. As an example lets choose 25,000
create 25,000 bins in the original range [0-35000] and 25,000 bins in the new range [-1,1]
For each number x-i find out the bin# it falls in the original bin. Let this be B-i
Find the range of B-i in the range [-1,1].
Use either the max/min of the range of B-i in [-1,1] as the scaled version of x-i.
This preserves the power law distribution while also scaling it down to [-1,1] and does not have the problem as experienced by (x-mean)/sd.
normalizing to [0,1] would be my short answer recommendation to combine the 2 values as it will maintain the distribution shape as you mentioned and should solve the problem of combining the values.
if the distribution of the 2 variables is different which sounds likely this won't really give you what i think your after, which is a combined measure of where each variable is within its given distribution. you would have to come up with a metric which determines where in the given distribution the value lies, this could be done many ways, one of which would be to determine how many standard deviations away from the mean the given value is, you could then combine these 2 values in some way to get your index. (addition may no longer be sufficient)
you'd have to work out what makes the most sense for the data sets your looking at. standard deviations may well be meaningless for your application, but you need to look at statistical measures that related to the distribution and combine those, rather than combing absolute values, normalized or not.

Resources