Correlation calculation in Tableau and R - r

I have huge a dataset and have to calculate the correlation matrix for different indices based on the user's selection (Filter). I applied formula in both Tableau calculation field and R.
Both are successful if the indices have the required data and the output results are also the same. In case if any of the indices that I choose have less data than I required (i.e., Only 2 years data available, whereas i want to see for 3 years correlation).
"R" automatically ignores those indices which doesn't have the value for the time frame. Whereas in Tableau, it still calculates for the indices & shows result in output, even if it has one data point. Actually those indices shouldn't show any result like "R".
How to remove those Indices from Tableau calculation, any help on this regard would certainly appreciable.
Note: It is very easy for me to use "R" instead of Tableau, however our Tableau server doesn't connect with "R Serve" due to technology restriction. Hence I recommended to use Tableau Calculations only
Tableau Calculation Code:
(WINDOW_SUM(SIZE()*[ValueAcross]*[ValueDown])-WINDOW_SUM([ValueDown])*WINDOW_SUM‌​([ValueAcross])) / (SQRT(((WINDOW_SUM(SIZE()*[ValueAcross]^2)-WINDOW_SUM([ValueAcross])^2))*(WINDOW‌​_SUM(SIZE()*[ValueDown]^2)-WINDOW_SUM([ValueDown])^2)))
R Calculation Code:
Script_Real("cor(.arg1,.arg2, method='pearson')",([ValueAcross]),([ValueDown]))

Related

Tableau to R connection - script_real returning rounded fraction numbers

I'm pretty new to Tableau but have a lot of experience with R. Everytime I use SCRIPT_REAL to call an R function based on Tableau aggregates, I get back a number that seems to be like the closest fraction approximation. For example if raw R gives me .741312, Tableau will spit out .777778, and so on. Does anything have any experience with this issue?
I'm pretty sure this is an aggregation issue.
From the Tableau and R Integration post by Jonathan Drummey on their community site:
Using Every Row of Data - Disaggregated Data For accurate results
for the R functions, sometimes those R functions need to be called
with every row in the underlying data. There are two solutions to
this:
Disaggregate the measures using Analysis->Aggregate Measures->Off. This doesn’t actually cause the measures to stop their
aggregations, instead it tells Tableau to return every row in the data
without aggregating by the dimensions on the view (which gives the
wanted effect). Using this with R scripts can get the desired results,
but can cause problems for views that we want to have R work on the
non-aggregated data and then display the data with some level of
aggregation.
The second solution deals with this situation: Add a
dimension such as a unique Row ID to the view, and set the Compute
Using (addressing) of the R script to be along that dimension. If
we’re doing some sort of aggregation with R, then we might need to
reduce the number of values returned by filtering them out with
something like:
IF FIRST()==0 THEN SCRIPT_REAL('insert R script here') END
If we need to then perform additional aggregations on that
data, we can do so with table calculations with the appropriate
Compute Usings that take into account the increased level of detail in
the view.

Princomp error in R : covariance matrix is not non-negative definite

I have this script which does a simple PCA analysis on number of variables and at the end attaches two coordinates and two other columns(presence, NZ_Field) to the output file. I have done this many times before but now its giving me this error:
I understand that it means there are negative eigenvalues. I looked at similar posts which suggest to use na.omit but it didn't work.
I have uploaded the "biodata.Rdata" file here:
covariance matrix is not non-negative definite
https://www.dropbox.com/s/1ex2z72lilxe16l/biodata.rdata?dl=0
I am pretty sure it is not because of missing values in data because I have used the same data with different "presence" and "NZ_Field" column.
Any help is highly appreciated.
load("biodata.rdata")
#save data separately
coords=biodata[,1:2]
biovars=biodata[,3:21]
presence=biodata[,22]
NZ_Field=biodata[,23]
#Do PCA
bpc=princomp(biovars ,cor=TRUE)
#re-attach data with auxiliary data..coordinates, presence and NZ location data
PCresults=cbind(coords, bpc$scores[,1:3], presence, NZ_Field)
write.table(PCresults,file= "hlb_pca_all.txt", sep= ",",row.names=FALSE)
This does appear to be an issue with missing data so there are a few ways to deal with it. One way is to manually do listwise deletion on the data before running the PCA which in your case would be:
biovars<-biovars[complete.cases(biovars),]
The other option is to use another package, specifically psych seems to work well here and you can use principal(biovars), and while the output is bit different it does work using pairwise deletion, so basically it comes down to whether or not you want to use pairwise or listwise deletion. Thanks!

R Project - summary output of glht function in tab delimited form?

I've used the function lme() in a database I've got. After that, I've used glht() in order to see the multiple comparisons of means. The summary of glht() gives me, in the linear hypotheses, 4 columns, with the last one (Pr(>|z|)) being the one I'm interest at, at the moment.
Because I'm dealing with 20 parameters, 4 different treatments, in 10 different groups, it is our plan to create a "heat map" of the p values obtained in the last column of summary(glht()).
My original plan is to copy and paste these values into Excel, and there I can make this heat map. However, summary() in R divides the columns with empty spaces, which is not helpful to send to Excel.
I've got 2 questions, then: (1) is it possible to change summary() in such a way all the values are tab delimited? (2) does anyone have a better/faster way of preparing this heat map? For this last question, I've thought about populating a matrix with the p values in R, as a starter. However, I'm not sure how to call this Pr(>|z|) column from the summary() function.
Many thanks in advance for the help!

R: how to use long vectors with randomForest?

One of the new features of R 3.0.0 was the introduction of long vectors. However, .C() and .Fortran() do not accept long vector inputs. On R-bloggers I find:
This is a precaution as it is very unlikely that existing code will have been written to handle long vectors (and the R wrappers often assume that length(x) is an integer)
I work with R-package randomForest and this package obviously needs .Fortran() since it crashes leaving the error message
Error in randomForest.default: long vectors (argument 20) are not supported in .Fortran
How to overcome this problem? I use randomForest 4.6-7 (built under R 3.0.2) on a Windows 7 64bit computer.
The only way to guarantee that your input data frame will be accepted by randomForest is to ensure that the vectors inside the data frame do not have length which exceeds 2^31 – 1 (i.e are not long). If you must start off with a data frame containing long vectors, then you would have the subset the data frame to achieve an acceptable dimension for the vectors. Here is one way you could subset a data frame to make it suitable for randomForest:
# given data frame 'df' with long vectors
maxDim <- 2^31 - 1;
df[1:maxDim, ]
However, there is a major problem with doing this which is that you would be throwing away all observations (i.e. features) appearing in rows 2^31 or higher. In practice, you probably do not need so many observations to run a random forest calculation. The easy workaround to your problem is to simply take a statistically valid sub sample of the original dataset with a size which does not exceed 2^31 - 1. Store the data using R vectors not of the long type, and your randomForest calculation should run without any issues.

R - 'princomp' can only be used with more units than variables

I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph.
"'princomp' can only be used with more units than variables"
I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again.
Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on it I can see the extra results column which shows which clusters they belong to.
IS there anything I am doing wrong, can I ger rid of this error and plot my larger sample???
Please help, been wrecking my head for a week now.
Thanks guys.
The problem is that you have more variables than sample points and the principal component analysis that is being done is failing.
In the help file for princomp it explains (read ?princomp):
‘princomp’ only handles so-called R-mode PCA, that is feature
extraction of variables. If a data matrix is supplied (possibly
via a formula) it is required that there are at least as many
units as variables. For Q-mode PCA use ‘prcomp’.
Principal component analysis is underspecified if you have fewer samples than data point.
Every data point will be it's own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions.
Simply speaking you can look at the problems like this:
If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful.
you can use prcomp instead of princomp

Resources