TSFRESH: Get N most relevant features - tsfresh

Is there any way to get the N most relevant features in TSFRESH? Currently, the method extract_relevant_features has a parameter fdr_level, but for a big amount of time series (>1000), the function with a very low fdr_level parameter (< 0.01) returns more than 400 features. I would like to return the 20 or 40 most relevant features.

You could use the function calculate_relevance_table (link to the docu) (which is called internally in the select_features method, which in turn is called in the extract_relevant_features method) to get the p-value for each of the features and then only use the TOP-N sorted by p-value.
So the general flow would be:
extract all features with extract_features
call calculate_relevance_table
sort by p-value
get only the top N
You could even tell tsfresh the next time to only extract those features (to save a lot of computation time) following this.

Related

compute the tableau's nonbasic term in SCIP separator

In traditional Simplex Algorithm notation, we have x at the current basis selection B as so:
xB = AB-1b - AB-1ANxN. How can I compute the AB-1AN term inside a separator in SCIP, or at least iterate over its columns?
I see three helpful methods: getLPColsData, getLPRowsData, getLPBasisInd. I'm just not sure exactly what data those methods represent, particularly the last one, with its negative row indexes. How do I use those to get the value I want?
Do those methods return the same data no matter what LP algorithm is used? Or do I need to account for dual vs primal? How does the use of the "revised" algorithm play into my calculation?
Update: I discovered the getLPBInvARow and getLPBInvRow. That seems to be much closer to what I'm after. I don't yet understand their results; they seem to include more/less dimensions than expected. I'm still looking for understanding at how to use them to get the rays away from the corner.
you are correct that getLPBInvRow or getLPBInvARow are the methods you want. getLPBInvARow directly returns you a of the simplex tableau, but it is not more efficient to use than getLPBInvRow and doing the multiplication yourself since the LP solver needs to also compute the actual tableau first.
I suggest you look into either sepa_gomory.c or sepa_gmi.c for examples of how to use these methods. How do they include less dimensions than expected? They both return sparse vectors.

what if the FD steps varied w.r.t output/input

I am using the finite difference scheme to find gradients.
Lets say i have 2 outputs (y1,y2) and 1 input (x) in a single component. And in advance I know that the sensitivity of y1 with respect to x is not same as the sensitivity of y2 to x. And thus i could potentially have two different steps for those as in ;
self.declare_partials(of=y1, wrt=x, method='fd',step=0.01, form='central')
self.declare_partials(of=y2, wrt=x, method='fd',step=0.05, form='central')
There is nothing that stops me (algorithmically) but it is not clear what would openmdao gradient calculation exactly do in this case?
does it exchange information from the case where the steps are different by looking at the steps ratios or simply treating them independently and therefore doubling computational time ?
I just tested this, and it does the finite difference twice with the two different step sizes, and only saves the requested outputs for each step. I don't think we could do anything with the ratios as you suggested, as the reason for using different stepsizes to resolve individual outputs is because you don't trust the accuracy of the outputs at the smaller (or large) stepsize.
This is a fair question about the effect of the API. In typical FD applications you would get only 1 function call per design variable for forward and backward difference and 2 function calls for central difference.
However in this case, you have asked for two different step sizes for two different outputs, both with central difference. So here, you'll end up with 4 function calls to compute all the derivatives. dy1_dx will be computed using the step size of .01 and dy2_dx will be computed with a step size of .05.
There is no crosstalk between the two different FD calls, and you do end up with more function calls than you would have if you just specified a single step size via:
self.declare_partials(of='*', wrt=x, method='fd',step=0.05, form='central')
If the cost is something you can bear, and you get improved accuracy, then you could use this method to get different step sizes for different outputs.

How to improve igraph single_paths calculation efficiency

I'm using the function all_simple_paths from the igraph R package: (1) to generate the list of all simple paths in networks (object List_paths_Mp); and (2) to calculate the total number of simple paths (object n_paths).
I'm using the function in the form:
pathsMp <- unlist(lapply(V(graphMp), function(x)all_simple_paths(graphMp, from = x)), recursive =FALSE)
List_paths_Mp <- lapply(1:length(pathsMp), function(x)as_ids(pathsMp[[x]]))
n_paths<-length(List_paths_Mp)
Where:
Mp is a square matrix with either 1 or 0 values, and graphMp is the igraph graph objected obtained through the function graph_from_adjacency_matrix.
The function does what I need, but with the increase in the number of variables and interactions the processing time to identify and store the different single paths in the network grows too much and it takes very long to get the results.
In particular, using a network with 11 variables and 60 interactions, there is a total of 146338 possible simple paths. And this already takes a long time to compute. Using a bigger network, with 13 variables and 91 interactions, causes the program to take even longer times to process (after 2 hours the function still didn't run its course, and when called to stop it crashed R).
Is there a way to increase the efficiency of the task (i.e. to get results in a faster way)? Has anyone ever encountered a similar problem and found a solution? And, I know, I could use a CPU with higher processing power, but the point is to have the function to run efficiently (as much as possible) in a normal personal computer.
Edit: here I do the calculations from the graph object, but if someone else has any idea of doing the same from the adjacency matrix, I would welcome it too!

Turning a random number generating loop into a one equation?

Here's some pseudocode:
count = 0
for every item in a list
1/20 chance to add one to count
This is more or less my current code, but there could be hundreds of thousands of items in that list; therefore, it gets inefficient fast. (isn't this called like, 0(n) or something?)
Is there a way to compress this into one equation?
Let's look at the properties of the random variable you've described. Quoting Wikipedia:
The binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.
Let N be the number of items in the list, and C be a random variable that represents the count you're obtaining from your pseudocode. C will follow a binomial probability distribution (as shown in the image below), with p = 1/20:
The remaining problem is how to efficently poll a random variable with said probability distribution. There are a number of libraries that allow you to draw samples from random variables with a specified PDF. I've never had to implement it myself, so I don't exactly know the details, but many are open source and you can refer to the implementation for yourself.
Here's how you would calculate count with the numpy library in Python:
n, p = 10, 0.05 # 10 trials, probability of success is 0.05
count = np.random.binomial(n, p) # draw a single sample
Apparently the OP was asking for a more efficient way to generate random numbers with the same distribution this will give. I though the question was how to do the exact same operation as the loop, but as a one liner (and preferably with no temporary list that exists just to be iterated over).
If you sample a random number generator n times, it's going to have at best O(n) run time, regardless of how the code looks.
In some interpreted languages, using more compact syntax might make a noticeable difference in the constant factors of run time. Other things can affect the run time, like whether you store all the random values and then process them, or process them on the fly with no temporary storage.
None of this will allow you to avoid having your run time scale up linearly with n.

number of k when build a yai object in yaiImpute package

I'm trying to find the best number of nearest neighbor to use in mahalanobis method in yai function offered by yaImpute package (1.0-19). I tried to run the yai function with 'mal' method with different number of k:
mal<-yai(x=x,y=y,method="mahalanobis", k=5, noTrgs= FALSE, nVec=NULL, pVal=.05, ann=F)
mal<-yai(x=x,y=y,method="mahalanobis", k=20, noTrgs= FALSE, nVec=NULL, pVal=.05, ann=F)
But, when I'm looking the rmsd (root mean square distance) of each, there are exactly the same. The process found effectively the number of k that I asked (when I print the 'mal' object), but it seems like it does not use them.
My aim is to use AsciiGridImpute function to impute values on my entire map. But I don't understand what is the utility of the k number in my yai object. How the AscciGridImpute use them?
Thank you
Sorry for my bad english!!
I finally found why the RMSD was the same whatever the number of k used in the yai object.
The function rmsd.yai called automatically an other function call impute.yai. That function is supposed to allow methods as mean or dstWeighted to compute the imputed values for continuous variables. The use of these methods changed effectively the rmsd values depending on the number of k.
But the automatically call of this function (impute.yai) compute imputed values with the default method:closest. So only one k is used.
I think it's the same thing that happen with the function AsciiGridImpute.

Resources