Ab Initio graph : partioning by key behavior with Replicate - bigdata

I am asking myself a question concerning
Let's suppose I have a flow F which is replicated X times.
All the replicated flows are then Join on the same key but with different datasets each time.
I want the joins to be run in a parallel layout. For this particular case, do I need to use X time the "Partition by key" component or can I put only one at the input of the replicate (instead of 1 per replicate output) ?
TLDR :
Is this graph
https://ibb.co/hHmk5e
equivalent to
https://ibb.co/i2NNJz
supposing all joins occur on same key
Thank you,

Use Replicate into multiple Partition By Keys. Pay caution to the checkpoints, if you have 3 checkpoints after the replicate consider removing them and placing a single checkpoint before the replicate.

Related

Need to get combination of records from Data Frame in R that satisfies a specific target in R

Let me say that I have a below Data Frame in R with 500 player records with the following columns
PlayerID
TotalRuns
RunRate
AutionCost
Now out of the 500 players, I want my code to give me multiple combinations of 3 players that would satisfy the following criteria. Something like a Moneyball problem.
The sum of auction cost of all the 3 players shouldn't exceed X
They should have a minimum of Y TotalRuns
Their RunRate must be higher than the average run rate of all the players.
Kindly help with this. Thank you.
So there are choose(500,3) ways to choose 3 players which is 20,708,500. It's not impossible to generate all these combinations combn might do it for you, but I couldn't be bothered waiting to find out. If you do this with player IDs and then test your three conditions, this would be one way to solve your problem. An alternative would be a Monte Carlo method. Select three players that initially satisfy your conditions. Randomly select another player who doesn't belong to the current trio, if he satisfies the conditions save the combination and repeat. If you're optimizing (it's not clear but your question has optimization in the tag), then the new player has to result in a new trio that's better than the last, so if he doesn't improve your objective function (whatever it might be), then you don't accept the trade.
choose(500,3)
Shows there are almost 21,000,000 combinations of 3 players drawn from a pool of 500 which means a complete analysis of the entire search space ought to be reasonably doable in a reasonable time on a modern machine.
You can generate the indeces of these combinations using iterpc() and getnext() from the iterpc package. As in
# library(iterpc) # uncomment if not loaded
I <- iterpc(5, 3)
getnext(I)
You can also drastically cut the search space in a number of ways by setting up initial filtering criteria and/or by taking the first solution (while loop with condition = meeting criterion). Or, you can get and rank order all of them (loop through all combinations) or some intermediate where you get n solutions. And preprocessing can help reduce the search space. For example, ordering salaries in ascending order first will give you the cheapest salary solution first. Ordering the file by descending runs will give you the highest runs solutions first.
NOTE: While this works fine, I see iterpc now is superseded by the arrangements package where the relevant iterator is icombinations(). getnext() is still the access method for succeeding iterators.
Thanks, I used a combination of both John's and James's answers.
Filtered out all the players who don't satisfy the criteria and that boiled down only to 90+ players.
Then I used picked up players in random until all the variations got exhausted
Finally, I computed combined metrics for each variation (set) of players to arrive at the optimized set.
The code is a bit messy and doesn't wanna post it here.

Randomise matching methodology R

I am looking to run a model that matches randomly matches 2 datasets (person and vacancy) based on matching characteristics of both datasets.
A person can have a role type, location, other and the vacancies will be looking for these characteristics.
Current methodology is using a for loop to work through the vacancies, subset the person table based on the matching characteristics and randomly pick a person.
Rough outline of current code:
for (I 1:dim(Vacancy)){
individual_vacancy = Vacancy[1]
available_person <- person[...matchingconditionsfromindividualvacancy....andavailable=1]
Vacancy$personid[I] = randomsampleofavailableperson
person$Available[personid == randomsampleofavailableperson] = 0
}
This is very slow and computationally expensive due to the size of the dataset and from what I can assume is the looping and writing back to the original datasets.
Are there any methodologies for this kind of problem/ R packages I'd be able to take advantage of?
Edit: This is a 1:1 matching problem which is where the problem is occurring. i.e. 1 person must be allocated to one vacancy. The Available flag update ensures this at the moment.

Multiplying two complex SeriesLists by a specific node

Let's say I have 4 different Serieslists:
foo.total
foo.succesful
bar.total
bar.succesful
I have generated a complex graphite query for both of them, so it goes like
function1(function2(foo.total))
function3(function4(foo.succesful))
I want to multiply these by each other. Well, that's not very difficult:
multiplySeries(function1(function2(foo.total)),function3(function4(foo.succesful)))
This draws one graph and works as intended.
The problem I am facing when trying to wildcard the foo-part, so I can do *.total. In this case I want to draw 2 graphs, because there are 2 wildcarded variables.
So my question is, how can I generalize the above query to not only work with foo but with n number of variables?
Thank you!
You need some kind of template function, the only one graphite provides is applyByNode and it should be sufficient in your case:
applyByNode(
*.{total,succesful},
'multiplySeries(f1(f2(%.total)), f3(f4(%.succesful)))',
'% some line'
)

data.table and hash -- speed and flexibility to handle multiple values per key

I have 2 questions:
Is hash faster than data.table for Big Data?
How can I deal with multiple values per key, if I want to use a hash-based approach?
I looked at the vignette of the related packages and Googled some potential solutions, but I'm still not sure about the answers to the questions above.
Considering the following post,
R fast single item lookup from list vs data.table vs hash
it seems that a single lookup in a data.table object is actually quite slow, even slower than in a list in Base R?
However a lookup using a hash object from hash is very speedy, based on this benchmark -- is that accurate?
However, it looks like the object hash is handling only unique keys?
In the following only 2 (key,value) pairs are created.
library(hash)
> h <- hash(c("A","B","A"),c(1,2,3))
> h
<hash> containing 2 key-value pair(s).
A : 3
B : 2
So, if i have a table with (key,values) where a key can have different values, and i want to do a (quick) lookup for the values corresponding to this key, what is the best object/data structure in R to do that ?
Can we still use the hash object or is data.table the most appropriate in this case ?
Let's say we are in the context of dealing a problem with very large tables, otherwise this discussion is irrelevant.
Related link:
http://www.r-bloggers.com/hash-table-performance-in-r-part-i/
You're referring to the question in the SO post that you refer to, not the answer.
As you'll see in the answer there, the results you get from the benchmark can change a lot depending on how you utilize data.table, or any given big data package.
You are correct that with the simplest implementation of hash() each key has 1 value. There are, of course, work-arounds for this. One would be to have a value which is a string and append the string with your multiple values:
h <- hash(c("Key 1","Key 2","Key 3"),c("1","2","1 and 2"))
h
<hash> containing 3 key-value pair(s).
Key 1 : 1
Key 2 : 2
Key 3 : 1 and 2
Another map be to use a hash table via a hashed environment in R, or perhaps via hashmap().
I do not know that there is a single, definitive proof that hash or data.table will always be faster. It could always vary by your use case, data, and how you implement them in your code.
In general, I'd say that data.table might be a more common solution if your use case does not involve a true key-value pair, and no work-around would be needed to address the issue of multiple values per key.

Wrapping Tableau R Calls

According to tableau there is a way to optimize R querying.
By addressing the partitioning of data: http://kb.tableau.com/articles/knowledgebase/r-implementation-notes
The solution is not clear to me. Does anyone know of an example of this as I would love to see how this works
The recommendation is to pass the values as a vector (column / row in Tableau) instead of a single cell in order to reduce the number of RServe calls. If your table in Tableau is structured to do calculations along cells, each cell becomes a partition. In order to compute the result of a calculation that would apply to a whole column, Tableau calls Rserve for each cell.
Here's what is happening (from the official documentation):
If your table calculations are set to Cell, Tableau makes one call to Rserve per partition:
Cell
This option sets the addressing to the individual cells in the table.
All fields become partitioning fields. This option is generally most
useful when computing a percent of total calculation.
Instead of a call for every row / column:
Optimizing R scripts
SCRIPT_ functions in Tableau are table calculations functions, so
addressing and partitioning concepts apply. Tableau makes one call to
Rserve per partition. Because connecting to Rserve involves some
overhead, try to pass values as vectors rather than as individual
values whenever possible. For example if you set addressing to Cell
(that is, Set Calculate the difference along in the Table Calculation
dialog box to Cell), Tableau will make a separate call per row to
Rserve; depending on the size of the data, this can result in a very
large number of individual Rserve calls. If you instead use a column
that identifies each row that you would use in level of detail, you
could "compute along" that column so that Tableau could pass those
values in a single call.

Resources