Filter - Calculated fields relation in Tableau - count

I have 20 lists of servers. Suppose we have 50 servers and everyday (for 20 days) we get a list of active servers.
Having this list, I want to calculate the number of times each server has appeared in the lists. Suppose that Server1 has appeared in 16 out of these 20 lists. Here's how I'm doing it:
new calculated field: {FIXED [Server]:COUNT([Server])}
move this calculated field to columns
calculate CNTD (count distinct) and put it in rows
here's the results:
Now here comes the question:
What if I want to draw the very same chart, but only according to the last 5 lists (lists we've got the last 5 days)? If I filter based on paths and take the last 5 lists, the numbers calculated in calculated fields won't update. they're gonna still be 6,8,...16 while there are only 5 lists (the maximum number of appearance should be 5). Any ideas?

Instead of using the FIXED level-of-detail (LOD), use INCLUDE. The order of operations for LOD calculations will run FIXED calculations run before applying any filtering. INCLUDE/EXCLUDE are applied after filtering.
{INCLUDE [Server]:COUNT([Server])}
This image from the online help shows the order of operations for LOD calculations and filtering.
See https://onlinehelp.tableau.com/current/pro/desktop/en-us/calculations_calculatedfields_lod_overview.html for more details.

Related

Need to get combination of records from Data Frame in R that satisfies a specific target in R

Let me say that I have a below Data Frame in R with 500 player records with the following columns
PlayerID
TotalRuns
RunRate
AutionCost
Now out of the 500 players, I want my code to give me multiple combinations of 3 players that would satisfy the following criteria. Something like a Moneyball problem.
The sum of auction cost of all the 3 players shouldn't exceed X
They should have a minimum of Y TotalRuns
Their RunRate must be higher than the average run rate of all the players.
Kindly help with this. Thank you.
So there are choose(500,3) ways to choose 3 players which is 20,708,500. It's not impossible to generate all these combinations combn might do it for you, but I couldn't be bothered waiting to find out. If you do this with player IDs and then test your three conditions, this would be one way to solve your problem. An alternative would be a Monte Carlo method. Select three players that initially satisfy your conditions. Randomly select another player who doesn't belong to the current trio, if he satisfies the conditions save the combination and repeat. If you're optimizing (it's not clear but your question has optimization in the tag), then the new player has to result in a new trio that's better than the last, so if he doesn't improve your objective function (whatever it might be), then you don't accept the trade.
choose(500,3)
Shows there are almost 21,000,000 combinations of 3 players drawn from a pool of 500 which means a complete analysis of the entire search space ought to be reasonably doable in a reasonable time on a modern machine.
You can generate the indeces of these combinations using iterpc() and getnext() from the iterpc package. As in
# library(iterpc) # uncomment if not loaded
I <- iterpc(5, 3)
getnext(I)
You can also drastically cut the search space in a number of ways by setting up initial filtering criteria and/or by taking the first solution (while loop with condition = meeting criterion). Or, you can get and rank order all of them (loop through all combinations) or some intermediate where you get n solutions. And preprocessing can help reduce the search space. For example, ordering salaries in ascending order first will give you the cheapest salary solution first. Ordering the file by descending runs will give you the highest runs solutions first.
NOTE: While this works fine, I see iterpc now is superseded by the arrangements package where the relevant iterator is icombinations(). getnext() is still the access method for succeeding iterators.
Thanks, I used a combination of both John's and James's answers.
Filtered out all the players who don't satisfy the criteria and that boiled down only to 90+ players.
Then I used picked up players in random until all the variations got exhausted
Finally, I computed combined metrics for each variation (set) of players to arrive at the optimized set.
The code is a bit messy and doesn't wanna post it here.

Using Predefined Splits in PCR function R PLS package

In order to to ensure a good population representation I have created custom validation sets from my training data. However, I am not sure how I interface this in PCR in R
I have tried to add a list in the segments argument with each index similar to what you do in python predefined splits cv iterator, which runs but takes forever. So I feel I must be making an error somewhere
pcr(y~X,scale=FALSE,data=tdata,validation="CV",segments=test_fold)
where test fold is a list containing the validation set which belongs in the index
For example if the training data is composed on 9 samples and I want to use the first three as the first validation set on son
test_fold<-c(1,1,1,2,2,2,3,3,3)
This runs but it is very slow where if I do regular "CV" it runs in minutes. So far the results look okay but I have a over a thousand runs I need to do and it took 1 hr to get through one. So if anybody knows how I can speed this up I would be grateful.
So the segments parameters needs to be a list of multiple vectors. So going again with 9 samples if I want the first three to be in the first validation set, the next three in the second validation set and so on it should be
test_vec<-list(c(1,2,3),c(4,5,6),c(7,8,9))

Queries on the same big data dataset

Lets say I have a very big dataset (billions of records), one that doesnt fit on a single machine and I want to have multiple unknown queries (its a service where a user can choose a certain subset of the dataset and I need to return the max of that subset).
For the computation itself I was thinking about Spark or something similar, problem is Im going to have a lot of IO/network activity since Spark is going to have to keep re-reading the data set from the disk and distributing it to the workers, instead of, for instance, having Spark divide the data among the workers when the cluster goes up and then just ask from each worker to do the work on certain records (by their number, for example).
So, to the big data people here, what do you usually do? Just have Spark redo the read and distribution for every request?
If I want to do what I said above I have no choice but to write something of my own?
If the queries are known but the subsets unknown, you could precalculate the max (or whatever the operator) for many smaller windows / slices of the data. This gives you a small and easily queried index of sorts, which might allow you to calculate the max for an arbitrary subset. In case a subset does not start and end neatly where your slices do, you just need to process the ‘outermost’ partial slices to get the result.
If the queries are unknown, you might want to consider storing the data in a MPP database or use OLAP cubes (Kylin, Druid?) depending on the specifics; or you could store the data in a columnar format such as Parquet for efficient querying.
Here's a precalculating solution based on the problem description in the OP's comment to my other answer:
Million entries, each has 3k name->number pairs. Given a subset of the million entries and a subset of the names, you want the average for each name for all the entries in the subset. So each possible subset (of each possible size) of a million entries is too much to calculate and keep.
Precalculation
First, we split the data into smaller 'windows' (shards, pages, partitions).
Let's say each window contains around 10k rows with roughly 20k distinct names and 3k (name,value) pairs in each row (choosing the window size can affect performance, and you might be better off with smaller windows).
Assuming ~24 bytes per name and 2 bytes for the value, each window contains 10k*3k*(24+2 bytes) = 780 MB of data plus some overhead that we can ignore.
For each window, we precalculate the number of occurrences of each name, as well as the sum of the values for that name. With those two values we can calculate the average for a name over any set of windows as:
Average for name N = (sum of sums for N)/(sum of counts for N)
Here's a small example with much less data:
Window 1
{'aaa':20,'abcd':25,'bb':10,'caca':25,'ddddd':50,'bada':30}
{'aaa':12,'abcd':31,'bb':15,'caca':24,'ddddd':48,'bada':43}
Window 2
{'abcd':34,'bb':8,'caca':22,'ddddd':67,'bada':9,'rara':36}
{'aaa':21,'bb':11,'caca':25,'ddddd':56,'bada':17,'rara':22}
Window 3
{'caca':20,'ddddd':66,'bada':23,'rara':29,'tutu':4}
{'aaa':10,'abcd':30,'bb':8,'caca':42,'ddddd':38,'bada':19,'tutu':6}
The precalculated Window 1 'index' with sums and counts:
{'aaa':[32,2],'abcd':[56,2],'bb':[25,2],'caca':[49,2],'ddddd':[98,2],'bada':[73,2]}
This 'index' will contain around 20k distinct names and two values for each name, or 20k*(24+2+2 bytes) = 560 KB of data. That's one thousand times less than the data itself.
Querying
Now let's put this in action: given an input spanning 1 million rows, you'll need to load (1M/10k)=100 indices or 56 MB, which fits easily in memory on a single machine (heck, it would fit in memory on your smartphone).
But since you are aggregating the results, you can do even better; you don't even need to load all of the indices at once, you can load them one at a time, filter and sum the values, and discard the index before loading the next. That way you could do it with just a few megabytes of memory.
More importantly, the calculation should take no more than a few seconds for any set of windows and names. If the names are sorted alphabetically (another worthwhile pre-optimization) you get the best performance, but even with unsorted lists it should run more than fast enough.
Corner cases
The only thing left to do is handle the case where the input span doesn't line up exactly with the precalculated windows. This requires a little bit of logic for the two 'ends' of the input span, but it can be easily built into your code.
Say each window contains exactly one week of data, from Monday through Sunday, but your input specifies a period starting on a Wednesday. In that case you would have to load the actual raw data from Wednesday through Sunday of the first week (a few hundred megabytes as we noted above) to calculate the (count,sum) tuples for each name first, and then use the indices for the rest of the input span.
This does add some processing time to the calculation, but with an upper bound of 2*780MB it still fits very comfortably on a single machine.
At least that's how I would do it.

A very complex combinations task that has been bugging me for 7 months

Some seven months ago I went to a job interview at a very big company. They gave me this task to solve, and for the past 7 months I can't seem to be able to find a solution.
Here is the task:
Some database has A entries. How many combinations (without repetition) with B amount (B < A) elements made out of A are there, that for any given B (contained in those A) different elements always contain at least X% of C entries (C < B) out of B given (C/B)? Include pattern for obtaining all of them. In short, we need:
Formula for calculation of how many combinations satisfy the above conditions
Formula for listing all of them. (any programming language or just detailed descriptive and any format)
Note: Both are mandatory, as those need to be set in a separate table in db.
After 2 hours of being totally clueless I was given a simplified version:
Some database has 50 entries. How many combinations (without repetition) with 9 elements made out of those 50 are there, that for any given 9 different elements (contained in those 50) always contain at least 15% of 6 entries out of given 9 (6/9)? Include pattern for obtaining all of them. In short, we need:
Formula for calculation of how many combinations satisfy the above conditions
Formula for listing all of them. (any programming language or just detailed descriptive and any format)
Note: Both are mandatory, as those need to be set in a separate table in db.
Edit: To explain further. Let us say the result of (1.) is D possible subsets (combinations without repetition) with 9 elements from A. And some user of the database (or software using it) enters random 9 elements (from |A| = 50 set). This always needs to result in, that at least 15% of those D subsets has 6 out of 9 that user entered.
It doesn't matter how many of those D has 1/9, 2/9, 3/9, 4/9, 5/9, 7/9, 8/9 and 9/9, the only thing that matters is that 15% and above have 6/9, for any 9/50 entered. Oh and D needs to be the minimal possible.
Edit2: Even further. Example: Set of A=50 entries is given. We need minimal amount of possible combinations/subsets without repetition with B=9 elements from those 50, that satisfy the following: When a user enters random 9 entries, 15%+ of resulting subsets must have 6 out of 9 that user entered. And the resulting subset must be uniform for any 9 that user can enter.
And I still failed. And I am still clueless of how to solve something like this.
I explain the simplified version: Let's name your database A, with |A|=50 elements in it. Now 6 elements of these 50 are special somehow and we want to keep track of them. We call the set of these 6 elements C.
Now to our job: We should count all subsets X of A with exactly 9 elements and at least 15% of their elements should come from C. Since 15% of 9 is 1.35 we need at least 2 elements of C in our sets X.
We know that there are binomial(50,9)=2505433700 subsets of A with 9 elements. Now lets count how many of them violate your criteria: there are 44 elements in A which are not in C, so there are binomial(44,9) subsets of A that contain no elements from C. Next we count how many 9-element-subsets of A contain exactly one element of C: We take a random 8-element-subset from A without C and put exactly one element from C to it, so we get 6*binomial(44,8) possibilities.
Now we can write our result, by taking all 9-element-subsets from A and subtracting those, that violate your criteria:
binomial(50,9) - binomial(44,9) - 6*binomial(44,8) = 733107430.
Ok... now we know how much there are. But how do we list them?
Let's use some pseude-code to do it:
AminC := setminus(A,C)
for j in 2..6 do
for X1 in subsets(C, j) do
for X2 in subsets(AminC, 9-j) do
print(setadd(X1,X2))
This algorithm yields an alternative way of counting your sets:
binomial(6,2)*binomial(44,7) +...+ binomial(6,6)*binomial(44,3)=733107430.
Hope this helps..

A Neverending cforest

how can I decouple the time cforest/ctree takes to construct a tree from the number of columns in the data?
I thought the option mtry could be used to do just that, i.e. the help says
number of input variables randomly sampled as candidates at each node for random forest like algorithms.
But while that does randomize the output trees it doesn't decouple the CPU time from the number of columns, e.g.
p<-proc.time()
ctree(gs.Fit~.,
data=Aspekte.Fit[,1:60],
controls=ctree_control(mincriterion=0,
maxdepth=2,
mtry=1))
proc.time()-p
takes twice as long as the same with Aspekte.Fit[,1:30] (btw. all variables are boolean). Why? Where does it scale with the number of columns?
As I see it the algorithm should:
At each node randomly select two columns.
Use them to split the response. (no scaling because of mincriterion=0)
Proceed to the next node (for a total of 3 due to maxdepth=2)
without being influenced by the column total.
Thx for pointing out the error of my ways

Resources