I am experimenting with using Opt4J to solve an optimization problem. The optimization problem involves a collection of dates, and I need an objective function that favors solutions whose dates are evenly spaced over a fixed range.
What would be a good objective function for this?
Thoughts so far...
I thought taking the intervals between all the neighboring dates and defining my objective function as the statistical variance of those intervals. In a perfect solution, all the intervals would be equal, so their variance would be zero. The more spread out the intervals, the worse the solution, and the more positive their variance.
But I noticed that there are a couple problems with variance.
If the dates are spaced evenly but they aren't spread across the fixed range, the objective function still rates them quite highly.
If I add another piece to the function that penalizes solutions without dates close to the boundaries of the fixed range, I'm not sure how to "balance" that with the variance...
Related
Hi I am recording data for around 150k items in influx. I have tried grouping by item id and using some of the functions from the docs but they don't seem to show "trend".
As there are a lot of series' to group by. I am currently performing a query on each series to calculate a value, storing it and sorting by that.
I have tried to use Linear Regression (the average angle of the line) but it's not quite meant for this as the X axis are timestamps, which do not correlate to the Y axis values, so end up with a near vertical line. Maybe i can calculate the X values to be something else?
The other issue i have is some series' are much higher values than others, so one series jumping up by 1000 might be huge (very trending) and not a big deal for other series that are always much higher.
Is there a way i can generate a single value from a series that represents how trending the series is, eg its just jumped up quite a lot compared to normal.
Here is an example of one series that is not trending and one that was trending a couple days ago. So the latter would have a higher trend value than the first:
Thanks!
I think similar problems arise naturally in the stock market and in general when detecting outliers.
So there are different way to move. Probably 1 is good enough.
It looks like you have a moving average in the graphs. You could just take the difference to the moving average and see the distribution to evaluate the the appropriate thresholds for you to pay attention. It looks like in the first graph you have an event perhaps relevant. You could just place a threshold like two standard deviations of the average of the difference between the real series and the moving average.
De-trend each series. Even 1) could be good enough (I mean just substraction of real value for the series minus the average for the last X days), you could de-trend using more sophisticated ideas. But that could need more attention for each case, for instance you should be careful with seasonality and so on. Perhaps something line Hodrick Prescott or inline with this: https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/.
Perhaps the idea from 1) is more formally described as Bollinger Bands. That help you to know where the time series should be with some probability.
There are more sophisticated ways to identify outliers in time series (as in here: https://towardsdatascience.com/effective-approaches-for-time-series-anomaly-detection-9485b40077f1) or here for a literature review: https://arxiv.org/pdf/2002.04236.pdf
I tried several packages in R and I am really lost in which one I should be using. I just need help in general direction and I can find my way myself for the exact code.
I am trying portfolio optimization in R. I need weights vector to be calculated where each weight in the vector represents percentage of that stock.
Given the weights, I calculate total return, variance and sharpe ratio (function of return and variance).
There could be constraints like total weights should be equal to 1 (100%) and may be some others on case by case basis.
I am trying to get my code to be flexible that I can optimize with different objectives (one at a time though). For example, I could want minimum variance in one simulation or maximum return in other and even max. sharpe ration in other.
This is pretty straight forward in excel with solver package. Once I have formulas entered, whichever cell I pick for objective function, it will calculate weights based on that and then calculate other parameters based on those weights. (Eg, if I optimize based on min variance, then it calculate weights for min variance and then calculate return and sharpe based on those weights).
I am wondering how to go about it in R? I am lost in reading documetation of several R packages or functions (Lpsolve, Optim, constrOptim, portfoiloAnalytics, etc) but not able to find the starting point. My specific questions are
Which would be the right R package for this kind of analysis?
Do I need to define separate functions for each possible objective, like variance, return and sharpe and optimize those functions? This is little tricky because sharpe depends on variance and returns. So if I want to optimize sharpe functions, then do I need to nest it within the variance and return functions?
I just need some ideas on how to start and I can give it a try. If I at least get the right package and right example to use, it would be great. I searched a lot on the web but I am really lost.
I have thousands of factors (categorical variables) that I am applying a classification on using Naive Bayes.
My problem is that I have many factors that appear very few times in my dataset so it seems they decrease the performance of my prediction.
Indeed, I noticed that if I removed the categorical variables that were happening very few times, I had a signicant improvement of my accuracy. But ideally I would like to keep all my factors, do you know what is the best practice to do so ?
Big Thanks.
This is too long for a comment.
The lowest frequency terms may be adversely affecting the accuracy simply because there is not enough data to make an accurate prediction. Hence, the observations in the training set may say nothing about the validation set.
You could combine all the lowest frequency observations into a single value. Off-hand, I don't know what the right threshold is. You can start by taking everything that occurs 5 or fewer times and lumping them together.
I have two different density plots in R- one of them is the observed data (x1), and the other is randomly generated data from a Poisson distribution with the observed mean (x2). I would like to approximate the curves, i.e. make the expected curve look more like the observed data as it is over and under-estimated in certain areas. How do I go about doing this? I know you can get the absolute value between the curves by using
abs (x1 - x2)
However I'm not too sure how to proceed. Anybody have any ideas?
I think if you want to find an analytical solution, you might just have to play with the functions for a while. Otherwise, it seems that you could use calculus of variations to do this. That is, you take the difference between the area under both of your functions, and then minimize that (take the derivative). Formally, you need to take the second derivative to find if it's a max, min, or inflection point. However, you don't need to in this case if the function fits the data. I'm not sure what the best program would be for finding an analytical solution, but maybe that will put you on the right track. Just an idea to bounce around
I'd like to combine a few metrics of nodes in a social network graph into a single value for rank ordering the nodes:
in_degree + betweenness_centrality = informal_power_index
The problem is that in_degree and betweenness_centrality are measured on different scales, say 0-15 vs 0-35000 and follow a power law distribution (at least definitely not the normal distribution)
Is there a good way to rescale the variables so that one won't dominate the other in determining the informal_power_index?
Three obvious approaches are:
Standardizing the variables (subtract mean and divide by stddev). This seems it would squash the distribution too much, hiding the massive difference between a value in the long tail and one near the peak.
Re-scaling variables to the range [0,1] by subtracting min(variable) and dividing by max(variable). This seems closer to fixing the problem since it won't change the shape of the distribution, but maybe it won't really address the issue? In particular the means will be different.
Equalize the means by dividing each value by mean(variable). This won't address the difference in scales, but perhaps the mean values are more important for the comparison?
Any other ideas?
You seem to have a strong sense of the underlying distributions. A natural rescaling is to replace each variate with its probability. Or, if your model is incomplete, choose a transformation that approximately acheives that. Failing that, here's a related approach: If you have a lot of univariate data from which to build a histogram (of each variate), you could convert each to a 10 point scale based on whether it is in the 0-10% percentile or 10-20%-percentile ...90-100% percentile. These transformed variates have, by construction, a uniform distribution on 1,2,...,10, and you can combine them however you wish.
you could translate each to a percentage and then apply each to a known qunantity. Then use the sum of the new value.
((1 - (in_degee / 15) * 2000) + ((1 - (betweenness_centrality / 35000) * 2000) = ?
Very interesting question. Could something like this work:
Lets assume that we want to scale both the variables to a range of [-1,1]
Take the example of betweeness_centrality that has a range of 0-35000
Choose a large number in the order of the range of the variable. As an example lets choose 25,000
create 25,000 bins in the original range [0-35000] and 25,000 bins in the new range [-1,1]
For each number x-i find out the bin# it falls in the original bin. Let this be B-i
Find the range of B-i in the range [-1,1].
Use either the max/min of the range of B-i in [-1,1] as the scaled version of x-i.
This preserves the power law distribution while also scaling it down to [-1,1] and does not have the problem as experienced by (x-mean)/sd.
normalizing to [0,1] would be my short answer recommendation to combine the 2 values as it will maintain the distribution shape as you mentioned and should solve the problem of combining the values.
if the distribution of the 2 variables is different which sounds likely this won't really give you what i think your after, which is a combined measure of where each variable is within its given distribution. you would have to come up with a metric which determines where in the given distribution the value lies, this could be done many ways, one of which would be to determine how many standard deviations away from the mean the given value is, you could then combine these 2 values in some way to get your index. (addition may no longer be sufficient)
you'd have to work out what makes the most sense for the data sets your looking at. standard deviations may well be meaningless for your application, but you need to look at statistical measures that related to the distribution and combine those, rather than combing absolute values, normalized or not.