According to tableau there is a way to optimize R querying.
By addressing the partitioning of data: http://kb.tableau.com/articles/knowledgebase/r-implementation-notes
The solution is not clear to me. Does anyone know of an example of this as I would love to see how this works
The recommendation is to pass the values as a vector (column / row in Tableau) instead of a single cell in order to reduce the number of RServe calls. If your table in Tableau is structured to do calculations along cells, each cell becomes a partition. In order to compute the result of a calculation that would apply to a whole column, Tableau calls Rserve for each cell.
Here's what is happening (from the official documentation):
If your table calculations are set to Cell, Tableau makes one call to Rserve per partition:
Cell
This option sets the addressing to the individual cells in the table.
All fields become partitioning fields. This option is generally most
useful when computing a percent of total calculation.
Instead of a call for every row / column:
Optimizing R scripts
SCRIPT_ functions in Tableau are table calculations functions, so
addressing and partitioning concepts apply. Tableau makes one call to
Rserve per partition. Because connecting to Rserve involves some
overhead, try to pass values as vectors rather than as individual
values whenever possible. For example if you set addressing to Cell
(that is, Set Calculate the difference along in the Table Calculation
dialog box to Cell), Tableau will make a separate call per row to
Rserve; depending on the size of the data, this can result in a very
large number of individual Rserve calls. If you instead use a column
that identifies each row that you would use in level of detail, you
could "compute along" that column so that Tableau could pass those
values in a single call.
Related
I connect Tableau to R and execute an R function for recommending products. When R ends, the return value is a string which will have all products details, like below:
ID|Existing_Prod|Recommended_Prod\nC001|NA|PROD008\nC002|PROD003|NA\nF003|NA|PROD_ABC\nF004|NA|PROD_ABC1\nC005|PROD_ABC2|NA\nC005|PRODABC3|PRODABC4
(Each line separated by \n indicating end of line)
On Tableau, I display the calculated field which is as below:
ID|Existing_Prod|Recommended_Prod
C001|NA|PROD008
C002|PROD003|NA
F003|NA|PROD_ABC
F004|NA|PROD_ABC1
C005|PROD_ABC2|NA
C005|PRODABC3|PRODABC4
Above data reaches Tableau through a calculated field as a single string which I want to split based on pipeline ('|'). Now, I need to split this into three columns, separated by the pipeline.
I used Split function on the calculated field :
SPLIT([R_Calculated_Field],'|',1)
SPLIT([R_Calculated_Field],'|',2)
SPLIT([R_Calculated_Field],'|',3)
But the error says "SPLIT function cannot be applied on Table calculations", which is self explanatory. Are there any alternatives to solve this ?? I googled to check for best practices to handle integration between R and Tableau and all I could find was simple kmeans clustering codes.
Make sure you understand how partitioning and addressing work for table calcs. Table calcs pass vectors of arguments to the R script, and receive a single vector in response. The cardinality of those vectors depends on the partitioning of the table calc. You can view that by editing the table calc, clicking specific dimensions. The fields that are not checked determine the partitioning - and thus the cardinality of the arguments you send and receive from R
This means it might be tricky to map your problem onto this infrastructure. Not necessarily impossible. It was designed to send a series of vector arguments with one cell per partitioning dimension, say, Manufacturer and get back one vector with one result per Manufacturer (or whatever combination of fields partition your data for the table calc). Sounds like you are expecting an arbitrary length list of recommendations. It shouldn’t be too hard to have your R script turn the string into a vector before returning, but the size of the vector has to make sense.
As an example of an approach that fits this model more easily, say you had a Tableau view that had one row per Product (and you had N products) - and some other aggregated measure fields in the view per Product. (In Tableau speak, the view’s level of detail is at the Product level.)
It would be straightforward to pass those measures as a series of argument vectors to R - each vector having N values, and then have R return a vector of reals of length N where the value returned at each location was a recommender score for the product at that position. (Which is why the ordering aka addressing of the vectors also matters)
Then you could filter out low scoring products from the view and visually distinguish highly recommended products.
So the first step to understanding R integration is to understand how table calcs operate with partitioning and addressing and to think in terms of vectors of fixed lengths passed in both directions.
If this model doesn’t support your use case well, you might be able to do something useful with URL actions or the JavaScript API.
Lets say I have a very big dataset (billions of records), one that doesnt fit on a single machine and I want to have multiple unknown queries (its a service where a user can choose a certain subset of the dataset and I need to return the max of that subset).
For the computation itself I was thinking about Spark or something similar, problem is Im going to have a lot of IO/network activity since Spark is going to have to keep re-reading the data set from the disk and distributing it to the workers, instead of, for instance, having Spark divide the data among the workers when the cluster goes up and then just ask from each worker to do the work on certain records (by their number, for example).
So, to the big data people here, what do you usually do? Just have Spark redo the read and distribution for every request?
If I want to do what I said above I have no choice but to write something of my own?
If the queries are known but the subsets unknown, you could precalculate the max (or whatever the operator) for many smaller windows / slices of the data. This gives you a small and easily queried index of sorts, which might allow you to calculate the max for an arbitrary subset. In case a subset does not start and end neatly where your slices do, you just need to process the ‘outermost’ partial slices to get the result.
If the queries are unknown, you might want to consider storing the data in a MPP database or use OLAP cubes (Kylin, Druid?) depending on the specifics; or you could store the data in a columnar format such as Parquet for efficient querying.
Here's a precalculating solution based on the problem description in the OP's comment to my other answer:
Million entries, each has 3k name->number pairs. Given a subset of the million entries and a subset of the names, you want the average for each name for all the entries in the subset. So each possible subset (of each possible size) of a million entries is too much to calculate and keep.
Precalculation
First, we split the data into smaller 'windows' (shards, pages, partitions).
Let's say each window contains around 10k rows with roughly 20k distinct names and 3k (name,value) pairs in each row (choosing the window size can affect performance, and you might be better off with smaller windows).
Assuming ~24 bytes per name and 2 bytes for the value, each window contains 10k*3k*(24+2 bytes) = 780 MB of data plus some overhead that we can ignore.
For each window, we precalculate the number of occurrences of each name, as well as the sum of the values for that name. With those two values we can calculate the average for a name over any set of windows as:
Average for name N = (sum of sums for N)/(sum of counts for N)
Here's a small example with much less data:
Window 1
{'aaa':20,'abcd':25,'bb':10,'caca':25,'ddddd':50,'bada':30}
{'aaa':12,'abcd':31,'bb':15,'caca':24,'ddddd':48,'bada':43}
Window 2
{'abcd':34,'bb':8,'caca':22,'ddddd':67,'bada':9,'rara':36}
{'aaa':21,'bb':11,'caca':25,'ddddd':56,'bada':17,'rara':22}
Window 3
{'caca':20,'ddddd':66,'bada':23,'rara':29,'tutu':4}
{'aaa':10,'abcd':30,'bb':8,'caca':42,'ddddd':38,'bada':19,'tutu':6}
The precalculated Window 1 'index' with sums and counts:
{'aaa':[32,2],'abcd':[56,2],'bb':[25,2],'caca':[49,2],'ddddd':[98,2],'bada':[73,2]}
This 'index' will contain around 20k distinct names and two values for each name, or 20k*(24+2+2 bytes) = 560 KB of data. That's one thousand times less than the data itself.
Querying
Now let's put this in action: given an input spanning 1 million rows, you'll need to load (1M/10k)=100 indices or 56 MB, which fits easily in memory on a single machine (heck, it would fit in memory on your smartphone).
But since you are aggregating the results, you can do even better; you don't even need to load all of the indices at once, you can load them one at a time, filter and sum the values, and discard the index before loading the next. That way you could do it with just a few megabytes of memory.
More importantly, the calculation should take no more than a few seconds for any set of windows and names. If the names are sorted alphabetically (another worthwhile pre-optimization) you get the best performance, but even with unsorted lists it should run more than fast enough.
Corner cases
The only thing left to do is handle the case where the input span doesn't line up exactly with the precalculated windows. This requires a little bit of logic for the two 'ends' of the input span, but it can be easily built into your code.
Say each window contains exactly one week of data, from Monday through Sunday, but your input specifies a period starting on a Wednesday. In that case you would have to load the actual raw data from Wednesday through Sunday of the first week (a few hundred megabytes as we noted above) to calculate the (count,sum) tuples for each name first, and then use the indices for the rest of the input span.
This does add some processing time to the calculation, but with an upper bound of 2*780MB it still fits very comfortably on a single machine.
At least that's how I would do it.
I am benchmarking spark in R via "sparklyr" and "SparkR". I test different functions on different Testdata. In two particular cases, where I count the amount of zeros in a column and the amount of NA's in a column, I realized that no matter how big the data is, the result is there in less than a second. All the other computations scale with the size of the data.
So I don't think that Spark computes anything there, but that those cases are stored somewhere in the meta data, and that it computed these results while loading the data. I tested my functions and they always give me the right result.
Can anyone confirm whether the number of zeros and number of nulls in a column is stored in a dataframe's metadata, and if not, why does it return so quickly with the correct value?
There is no metadata associated to a Spark DataFrame that would contain columnar data; therefore, my guess is that the performance difference you measured is caused by something else. Hard to tell without a reproducible example.
I have extensively read and re-read the Troubleshooting R Connections and Tableau and R Integration help documents, but as a new Tableau user they just aren't helping me.
I need to be able to calculate Kaplan-Meier survival probabilities across any dimensions that are dragged onto the sheet. Ideally, I would be able to retrieve this in a tabular format at multiple time points, but for now, I would be happy just to get it at a single time point.
My data in Tableau have columns for [event-boolean] and [time to event]. Let's say I also have columns for Gender and District.
Currently, I have a calculated field [surv] as:
SCRIPT_REAL('
library(survival);
fit <- summary(survfit(Surv(.arg2,.arg1) ~ 1), times=365);
fit$surv'
, min([event-boolean])
, min([time to event])
)
I have messed with Computed Using, Addressing, Partitions, Aggregate Measures, and parameters to the R function, but no combination I have tried has worked.
If [District] is in Columns, do I need to change my SCRIPT_REAL call or do I just need to change some other combination of levers?
I used Andrew's solution to solve this problem. Essentially,
- Turn off Aggregate Measures
- In the Measure Values shelf, select Compute Using > Cell
- In the calculated field, start with If FIRST() == 0 script_*() END
- Ctrl+drag the measure to the Filters shelf and use a Special > Non-null filter.
I want to test some of the newer sparse linear solvers and I want to know if there is a fast way of filling in the matrix. The format I'm interested is CSR (http://goo.gl/hLXYd). Let's say the matrix, in CSR format, is given by:
values(num non-zero elements)
columns(num non-zero elements)
rowIndex(num rows + 1)
The sparse matrix under consideration derives from networks. So, I have thousands of nodes and some of them are connected between them by lines. So, the matrix is structurally symmetric. Each connection (i,j) adds something to the diagonal terms (i,i) and (j,j) and to the off-diagonal (i,j) and (j,i). I could have several connections between the same nodes (i,j,1), (i,j,2)... So, I might need to revisit the 2 diagonal and 2 off-diagonal elements more than once.
I know I can get the beginning of the row by doing rowIndex(i). Then, I would have to run through the elements columns(rowIndex(i):rowIndex(i+1)-1) to find where is j situated.
The question:
Is there a way of accessing the elements faster, while in CSR format, without having to do a search every time I want to update an element?
Some clarifications:
I just need to fill in the matrix from scratch. The matrix is structurally symmetric and not really symmetric. The values saved have to do with network data (impedances, resistances etc), they have real values. In general Value(i,j)<>Value(j,i). I have tuples of the form (name1,i1,j1,value1), (name2,i2,j2,value2) etc. These tuples are not sorted, and 2 tuples can refer to the same i,j values, meaning they need to be added
Thanks in advance!
What you have is so called triplet sparse format. Creation of CRS, including removing duplicate entries and summing the values, can be implemented very efficiently. Before programing it yourself, have a look at the SuiteSparse library. It is written in C, but I'm sure you will understand the principle. What interests you is the cholmod_triplet.c file, which implements the functionality you need.
Essentially, the conversion is performed using two phase bucket sort on your row and column indices. This algorithm has linear complexity, which is important if you are interested in processing large data sets.
Edit If you want to skip explicit creation of the triplet format all together, you can do that by generating the (row, col) connectivities on the fly and adding them to a dynamic sparse structure. I usually do it using insertion sort and sorted lists, which is in practice the fastest. It is also faster than triplet to CRS conversion, and uses much less memory. The method goes as follows:
if you know approximately, how many non-zero entries there are in every row, for every row you pre-allocate an array of (empty) column indices, and a separate array for the values (not linked list, but a simple array) of that size. Something like
static_lists_cols[row] = malloc(sizeof(int)*expected_number_of_non_zeros)
static_lists_vals[row] = malloc(sizeof(double)*expected_number_of_non_zeros)
If you do not know that, you choose an initial size and reallocate as needed (using some block size large enough to avoid reallocation overhead) when the row lists are full.
for every (row, col) pair you insert the col into the sorted list corresponding to row using insertion sort. For small (up to a few hundred) non-zeros per row linear search is the fastest. For larger number of non-zeros per row you can use bisection to locate the correct place to insert the col index.
col is inserted into rowth sorted list by moving the non-zero entries with higher column index in the sorted list. This is cache-friendly, since the rows are in practice small enough to fit into any cache nowadays.
After you finish you need to assemble the individual sorted lists into a valid CRS structure by copying the individual row lists into the final columns. The same with values.
You could actually avoid the last step by pre-allocating a static 'array of lists' if you are ok that some of the rows can have zero entries. You will hence have a constant number of entries per row, some of which might be zero. Sometimes that is ok.
This method is faster than using triplet to sparse conversion, at least for FEM models, for which I use it. The general reason is that memory bandwidth is the bottleneck here, and the above scheme uses much less memory:
creating the triplet format takes time, and you need to write the triplets to memory
conversion to CRS requires reading and writing the triplets at least once to sort them (actually a bit more than once, if you look at the algorithm. You sort twice, and you need auxiliary data structures.)
depending on the connectivity structure, you may end up having a large number of (row, col) duplicates in the triplet format, which are removed during the assembly by adding the corresponding values. This overhead does not exist in the method above - if the col already exists in the row list, you simply update the corresponding value.
updating the sorted lists can be done in parallel if you assign row ranges to individual workers. No communication, nor synchronization is needed. Assuring load balancing is another story...
Have a look at a performance comparison of using those two methods (Figure 1) for triangular elements in 2D. Note that the performance difference depends on the ratio of the number of entries in the triplet to assembled sparse matrix format (Table 2). But in general, the method is never worse than triplet to crs conversion, and triplets need to be created in the first place. You can also download a MATLAB MEX function sparse_create, which is a part of mutils package (see the downloads section).
Your question seems to confuse 2 rather different questions:
What is a fast way of creating a matrix in CSR form ?
Is there a faster way of reading values from a matrix already stored in CSR form ? (Faster, that is, than the straightforward approach you describe)
So here are 2 answers:
In general, read the network data from whatever form it is in into something like a dictionary of keys (other intermediate forms are available and may be more appealing to you for speed or other reasons); then turn that intermediate structure into the CSR form of the matrix. More on this below.
I don't believe so, not with a matrix stored in CSR form. This relative slowness of access is part of the price you pay for saving space. You've traded time for space, or space for time, depending on your point of view.
Your description of your input data suggests that you should consider devising your own intermediate form into which to marshal the raw data. Since your adjacency matrix is symmetric you only need to store, in any form, half of it. Further, you probably don't need to store the elements along the main diagonal -- I'm guessing either that node i is always connected to node i or never so that the nature of the network determines the value stored at (i,i). I'm a little uncertain of the information you want to store at each node of the matrix, is it the number of connections between i and j or something else ?