So I have given a sequence of numbers (~600) in the range of 0-255.
Is there a way to predict whether a new number x is an anomaly given such a sequence.
I was thinking about treating the sequence as a gaussian distribution and calculating the probability of the value x occuring within, this lead to determining a threshold which is difficult to set.
The sequences are changing and can't be manually investigated this means it has to adapt quickly. Labeling data within in the sequence is also not possible, this means the algorithm has to detect anomalies only provided non anomal data in the sequence
Does anybod have an idea or some ressources to dig in?
Related
I am using R (RStudio) to construct an index/synthetic indicator to evaluate, say, commercial efficiency. I am using the PCA() command from factorMineR package, and using 7 distinct variables. I have previously created similar indexes by calculating the weight of each particular variable over the first component (which can be obtained through PCA()$var$coord[,1]), with no problems, since each variable has a positive weight. However, there is one particular variable that has a weight with an undesired sign: negative. The variable is ‘delivery speed’ and this sign would imply that the greater the speed the less efficient the process. Then, what is going on? How would you amend this issue, preferably still using PCA?
The sign of variable weights shouldn't matter in PCA. Since on the whole, all of the components perfectly represent the original data (when p < n), for some components it is natural that there will be some positive weights and some negative weights. That doesn't mean that that particular variable has an undesired weight, rather that for that particular extracted signal (say, first principal component) the variable weight is negative.
For a better understanding, let's take the classical 2 dimensional example, which I took from this very useful discussion:
Can you see from the graph that one of the weights will necessary be negative for the 2nd principal component?
Finally, if that variable does actually disturb your analysis, one possible solution would be to apply Sparse PCA. Under cross-validated regularization that method is able to make some of the weights equal to zero. If in your case that negative weight is not significant enough, it might get reduced to zero under SPCA.
I am using genetic matching in R using GenMatch in order to find comparable treatment and control groups to estimate a treatment effect. The default code for matching looks as follows:
GenMatch(Tr, X, BalanceMatrix=X, estimand="ATT", M=1, weights=NULL,
pop.size = 100, max.generations=100,...)
The description for the pop.size argument in the package is:
Population Size. This is the number of individuals genoud uses to
solve the optimization problem. The theorems proving that genetic
algorithms find good solutions are asymptotic in population size.
Therefore, it is important that this value not be small. See genoud
for more details.
Looking at gnoud the additional description is:
...There are several restrictions on what the value of this number can
be. No matter what population size the user requests, the number is
automatically adjusted to make certain that the relevant restrictions
are satisfied. These restrictions originate in what is required by
several of the operators. In particular, operators 6 (Simple
Crossover) and 8 (Heuristic Crossover) require an even number of
individuals to work on—i.e., they require two parents. Therefore, the
pop.size variable and the operators sets must be such that these three
operators have an even number of individuals to work with. If this
does not occur, the population size is automatically increased until
this constraint is satisfied.
I want to know how gnoud (resp. GenMatch) incorporates the population size argument. Does the algorithm randomly select n individuals from the population for the optimization?
I had a look at the package description and the source code, but did not find a clear answer.
The word "individuals" here does not refer to individuals in the sample (i.e., individual units your dataset), but rather to virtual individuals that the genetic algorithm uses. These individuals are individual draws of a set of the variables to be optimized. They are unrelated to your sample.
The goal of genetic matching is to choose a set of scaling factors (which the Matching documentation calls weights), one for each covariate, that weight the importance of that covariate in a scaled Euclidean distance match. I'm no expert on the genetic algorithm, but my understanding of what it does is that it makes a bunch of guesses at the optimal values of these scaling factors, keeps the ones that "do the best" in the sense of optimizing the criterion (which is determined by fit.func in GenMatch()), and creates new guesses as slight perturbations of the kept guesses. It then repeats this process many times, simulating what natural selection does to optimize traits in living things. Each guess is what the word "individual" refers to in the description for pop.size, which corresponds to the number of guesses at each generation of the algorithm.
GenMatch() always uses your entire sample (unless you have provided a restriction like a caliper, exact matching requirement, or common support rule); it does not sample units from your sample to form each guess (which is what bagging in is other machine learning contexts).
Results will change over many runs because the genetic algorithm itself is a stochastic process. It may converge to a solution asymptotically, but because it is optimizing over a lumpy surface, it will find different solutions each time in finite samples with finite generations and a finite population size (i.e., pop.size).
I have a data set with 20 classes, and it has a pretty non-uniform distribution. Is there any functionality in R that allows us to balance the data set (weighted perhaps)?
I want to use the balanced data with Weka for classification. Since my class distribution is skewed, I am hoping to get better results if there's no single majority class.
I have tried to use the SMOTE filter and Resample filter but they don't quite do what I want.
I dont want any instances to be removed, repetition is fine.
I think there's a misunderstanding in your terminology. Your question's title refers to sampling, and yet the question text involves weighting.
To clarify:
With sampling, you either have fewer, the same, or more instances than in the original set; the unique membership of a sample can be either a strict subset of the original set or can be identical to the original set (with replacement - i.e., duplicates).
By weighting, you simply adjust weights that may be used for some further purpose (e.g. sampling, machine learning) to address or impose some (im)balance relative to a uniform weighting.
I believe that you are referring to weighting, but the same answer should work in both cases. If the total # of observations is N and the frequency of each class is an element of the 20-long vector freq (e.g. the count of items in class 1 is freq[1]*N), then simply use a weight vector of 1/freq to normalize the weights. You can scale it by some constant, e.g. N, though it wouldn't matter. In case any frequency is 0 or very close to it, you might address this by using a vector of smoothed counts (e.g. Good-Turing smoothing).
As a result, each set will have an equal proportion of the total weight.
I am working with a large amount of time series.
These time series are basically network measurements coming every 10 minutes, and some of them are periodic (i.e. the bandwidth), while some other aren't (i.e. the amount of routing traffic).
I would like a simple algorithm for doing an online "outlier detection". Basically, I want to keep in memory (or on disk) the whole historical data for each time series, and I want to detect any outlier in a live scenario (each time a new sample is captured).
What is the best way to achieve these results?
I'm currently using a moving average in order to remove some noise, but then what next? Simple things like standard deviation, mad, ... against the whole data set doesn't work well (I can't assume the time series are stationary), and I would like something more "accurate", ideally a black box like:
double outlier_detection(double* vector, double value);
where vector is the array of double containing the historical data, and the return value is the anomaly score for the new sample "value" .
This is a big and complex subject, and the answer will depend on (a) how much effort you want to invest in this and (b) how effective you want your outlier detection to be. One possible approach is adaptive filtering, which is typically used for applications like noise cancelling headphones, etc. You have a filter which constantly adapts to the input signal, effectively matching its filter coefficients to a hypothetical short term model of the signal source, thereby reducing mean square error output. This then gives you a low level output signal (the residual error) except for when you get an outlier, which will result in a spike, which will be easy to detect (threshold). Read up on adaptive filtering, LMS filters, etc, if you're serious about this kind of technique.
I suggest the scheme below, which should be implementable in a day or so:
Training
Collect as many samples as you can hold in memory
Remove obvious outliers using the standard deviation for each attribute
Calculate and store the correlation matrix and also the mean of each attribute
Calculate and store the Mahalanobis distances of all your samples
Calculating "outlierness":
For the single sample of which you want to know its "outlierness":
Retrieve the means, covariance matrix and Mahalanobis distances from training
Calculate the Mahalanobis distance "d" for your sample
Return the percentile in which "d" falls (using the Mahalanobis distances from training)
That will be your outlier score: 100% is an extreme outlier.
PS. In calculating the Mahalanobis distance, use the correlation matrix, not the covariance matrix. This is more robust if the sample measurements vary in unit and number.
I'd like to combine a few metrics of nodes in a social network graph into a single value for rank ordering the nodes:
in_degree + betweenness_centrality = informal_power_index
The problem is that in_degree and betweenness_centrality are measured on different scales, say 0-15 vs 0-35000 and follow a power law distribution (at least definitely not the normal distribution)
Is there a good way to rescale the variables so that one won't dominate the other in determining the informal_power_index?
Three obvious approaches are:
Standardizing the variables (subtract mean and divide by stddev). This seems it would squash the distribution too much, hiding the massive difference between a value in the long tail and one near the peak.
Re-scaling variables to the range [0,1] by subtracting min(variable) and dividing by max(variable). This seems closer to fixing the problem since it won't change the shape of the distribution, but maybe it won't really address the issue? In particular the means will be different.
Equalize the means by dividing each value by mean(variable). This won't address the difference in scales, but perhaps the mean values are more important for the comparison?
Any other ideas?
You seem to have a strong sense of the underlying distributions. A natural rescaling is to replace each variate with its probability. Or, if your model is incomplete, choose a transformation that approximately acheives that. Failing that, here's a related approach: If you have a lot of univariate data from which to build a histogram (of each variate), you could convert each to a 10 point scale based on whether it is in the 0-10% percentile or 10-20%-percentile ...90-100% percentile. These transformed variates have, by construction, a uniform distribution on 1,2,...,10, and you can combine them however you wish.
you could translate each to a percentage and then apply each to a known qunantity. Then use the sum of the new value.
((1 - (in_degee / 15) * 2000) + ((1 - (betweenness_centrality / 35000) * 2000) = ?
Very interesting question. Could something like this work:
Lets assume that we want to scale both the variables to a range of [-1,1]
Take the example of betweeness_centrality that has a range of 0-35000
Choose a large number in the order of the range of the variable. As an example lets choose 25,000
create 25,000 bins in the original range [0-35000] and 25,000 bins in the new range [-1,1]
For each number x-i find out the bin# it falls in the original bin. Let this be B-i
Find the range of B-i in the range [-1,1].
Use either the max/min of the range of B-i in [-1,1] as the scaled version of x-i.
This preserves the power law distribution while also scaling it down to [-1,1] and does not have the problem as experienced by (x-mean)/sd.
normalizing to [0,1] would be my short answer recommendation to combine the 2 values as it will maintain the distribution shape as you mentioned and should solve the problem of combining the values.
if the distribution of the 2 variables is different which sounds likely this won't really give you what i think your after, which is a combined measure of where each variable is within its given distribution. you would have to come up with a metric which determines where in the given distribution the value lies, this could be done many ways, one of which would be to determine how many standard deviations away from the mean the given value is, you could then combine these 2 values in some way to get your index. (addition may no longer be sufficient)
you'd have to work out what makes the most sense for the data sets your looking at. standard deviations may well be meaningless for your application, but you need to look at statistical measures that related to the distribution and combine those, rather than combing absolute values, normalized or not.