Using Predefined Splits in PCR function R PLS package - r

In order to to ensure a good population representation I have created custom validation sets from my training data. However, I am not sure how I interface this in PCR in R
I have tried to add a list in the segments argument with each index similar to what you do in python predefined splits cv iterator, which runs but takes forever. So I feel I must be making an error somewhere
pcr(y~X,scale=FALSE,data=tdata,validation="CV",segments=test_fold)
where test fold is a list containing the validation set which belongs in the index
For example if the training data is composed on 9 samples and I want to use the first three as the first validation set on son
test_fold<-c(1,1,1,2,2,2,3,3,3)
This runs but it is very slow where if I do regular "CV" it runs in minutes. So far the results look okay but I have a over a thousand runs I need to do and it took 1 hr to get through one. So if anybody knows how I can speed this up I would be grateful.

So the segments parameters needs to be a list of multiple vectors. So going again with 9 samples if I want the first three to be in the first validation set, the next three in the second validation set and so on it should be
test_vec<-list(c(1,2,3),c(4,5,6),c(7,8,9))

Related

Does column order matter in RNN?

My question is somewhat similar to this one. But I want to ask whether the column order matters or not. I have some time series data. For each cycle I computed some features (let's call them var1, var2,.... I now train the model using the following column order which of course will be consistent for the test set.
X_train=data['var1','var2','var3',var4']
After watching this video I've concluded that the order in which the columns appear is significant i.e. swapping var 1 and var 3 as:
X_train=data['var3','var2','var1',var4']
I would get a different loss function.
If the above is true, then how does one figure out the correct feature order to minimize the loss function, especially when the number of features could be in dozens.

Handle a string return from R to Tableau and SPLIT it

I connect Tableau to R and execute an R function for recommending products. When R ends, the return value is a string which will have all products details, like below:
ID|Existing_Prod|Recommended_Prod\nC001|NA|PROD008\nC002|PROD003|NA\nF003|NA|PROD_ABC\nF004|NA|PROD_ABC1\nC005|PROD_ABC2|NA\nC005|PRODABC3|PRODABC4
(Each line separated by \n indicating end of line)
On Tableau, I display the calculated field which is as below:
ID|Existing_Prod|Recommended_Prod
C001|NA|PROD008
C002|PROD003|NA
F003|NA|PROD_ABC
F004|NA|PROD_ABC1
C005|PROD_ABC2|NA
C005|PRODABC3|PRODABC4
Above data reaches Tableau through a calculated field as a single string which I want to split based on pipeline ('|'). Now, I need to split this into three columns, separated by the pipeline.
I used Split function on the calculated field :
SPLIT([R_Calculated_Field],'|',1)
SPLIT([R_Calculated_Field],'|',2)
SPLIT([R_Calculated_Field],'|',3)
But the error says "SPLIT function cannot be applied on Table calculations", which is self explanatory. Are there any alternatives to solve this ?? I googled to check for best practices to handle integration between R and Tableau and all I could find was simple kmeans clustering codes.
Make sure you understand how partitioning and addressing work for table calcs. Table calcs pass vectors of arguments to the R script, and receive a single vector in response. The cardinality of those vectors depends on the partitioning of the table calc. You can view that by editing the table calc, clicking specific dimensions. The fields that are not checked determine the partitioning - and thus the cardinality of the arguments you send and receive from R
This means it might be tricky to map your problem onto this infrastructure. Not necessarily impossible. It was designed to send a series of vector arguments with one cell per partitioning dimension, say, Manufacturer and get back one vector with one result per Manufacturer (or whatever combination of fields partition your data for the table calc). Sounds like you are expecting an arbitrary length list of recommendations. It shouldn’t be too hard to have your R script turn the string into a vector before returning, but the size of the vector has to make sense.
As an example of an approach that fits this model more easily, say you had a Tableau view that had one row per Product (and you had N products) - and some other aggregated measure fields in the view per Product. (In Tableau speak, the view’s level of detail is at the Product level.)
It would be straightforward to pass those measures as a series of argument vectors to R - each vector having N values, and then have R return a vector of reals of length N where the value returned at each location was a recommender score for the product at that position. (Which is why the ordering aka addressing of the vectors also matters)
Then you could filter out low scoring products from the view and visually distinguish highly recommended products.
So the first step to understanding R integration is to understand how table calcs operate with partitioning and addressing and to think in terms of vectors of fixed lengths passed in both directions.
If this model doesn’t support your use case well, you might be able to do something useful with URL actions or the JavaScript API.

Integration of Time series model of R in Tableau

I am trying to integrate Time series Model of R in Tableau and I am new to integration. Please help me in resolving below mentioned Error. Below is my code in tableau for integration with R. Calculation is Valid bur getting an error.
SCRIPT_REAL(
"library(forecast);
cln_count_ts <- ts(.arg1,frequency = 7);
arima.fit <- auto.arima(log10(cln_count_ts));
forecast_ts <- forecast(arima.fit, h =10);",
SUM([Count]))
Error : Error in auto.arima(log10(cln_count_ts)) : No suitable ARIMA model found
When Tableau calls R, Python, or another tool, it does so as a "table calc". That means it sends the external system one or more vectors as arguments and expects a single vector in response.
Depending on your data and calculation, you may want to send all your data to R in a single call, passing a very large vector, or call it several times with different vectors - say forecasting each region separately. Or even call R multiple times with many vectors of size one (aka scalars).
So with table calcs, you have other decisions to make beyond just choosing the function to invoke. Chiefly, you have to decide how to partition your data for analysis. And in some cases, you also need to determine the order that the data appears in the vectors you send to R - say if the order implies a time series.
The Tableau terms for specifying how to divide and order data for table calculations are "partitioning and addressing". See the section on that topic in the online help. You can change those settings by using the "Edit Table Calc" menu item.

Netlogo create plot of the average of various runs

I have been trying to create average plot of various runs (if possible with their variation).
So far the only way I found was by using the xls from Behavior Space and do it externally.
Is there a way to do this in Netlogo?
Many thanks for your help!
It is possible, but it is not entierly convenient. To get a start you could look at the "Simple Birth Rates" model from the NetLogo models library. In this model, the setup procedure is split up into a basic setup, which is executed once, when the model is initialized for the first time. And then a second "setup-experiment", which is executed in-between multiple runs. This allows you to control, which things get cleared in between runs (turtles, patches, plots, ...).
For performing multiple runs, the model uses a second go-procedure, named go-experiment. This procedure runs the model (go) until a stop condition is true. Then it calls the setup-experiment procedure and continues with the next simulation run (go).
To store data for a plot, you only would need to store the final results of interest of each run in a global list (right after the stop condition became true and right before the setup-experiment of the next run is performed). This can then be used by a plot on your interface to summarize the data of various runs. You only have to make sure, that you dont clear the global variables in your setup-experiment procedure and that your setup-experiment procedure resets all other global variables (if any) to their initial-state.

A Neverending cforest

how can I decouple the time cforest/ctree takes to construct a tree from the number of columns in the data?
I thought the option mtry could be used to do just that, i.e. the help says
number of input variables randomly sampled as candidates at each node for random forest like algorithms.
But while that does randomize the output trees it doesn't decouple the CPU time from the number of columns, e.g.
p<-proc.time()
ctree(gs.Fit~.,
data=Aspekte.Fit[,1:60],
controls=ctree_control(mincriterion=0,
maxdepth=2,
mtry=1))
proc.time()-p
takes twice as long as the same with Aspekte.Fit[,1:30] (btw. all variables are boolean). Why? Where does it scale with the number of columns?
As I see it the algorithm should:
At each node randomly select two columns.
Use them to split the response. (no scaling because of mincriterion=0)
Proceed to the next node (for a total of 3 due to maxdepth=2)
without being influenced by the column total.
Thx for pointing out the error of my ways

Resources