I'm using R package "cluster" in Power BI to cluster customers from their historical transactions.
My data looks like this:
Using Run R-Script command in Power Query Editor, I have the following code:
library(cluster)
k.means.fit_log <- kmeans(dataset[2:4], 3)
output<-dataset
output$cluster <- k.means.fit_log$cluster
After this code is executed, I get an additional column with the number of the cluster and all looks good.
However, k.means.fit_log is a list with 9 rows, that contains the cluster centers, size, etc. so I'd like to be able to create another table or tables in Power BI with the contents of that object.
How can I achieve this?
Thanks in advance for your help!
If it is a list, and delimited, you can use the spilt function in Power Query. It would be good to share the R outputed table in your question
Related
I would like to display an R dataframe as a table in Power BI. Using the "R script visual" widget, here is one possible solution:
library(gridExtra)
library(grid)
d<-head(df[,1:3])
grid.table(d)
reference: [https://cran.r-project.org/web/packages/gridExtra/vignettes/tableGrob.html][1]
As stated in the reference - this approach only works for small tables.
Is there an alternative approach that will allow an R data frame to be displayed as a table in Power BI - specifically for larger tables that can be 'scrolled'?
Don't use an R Visualization. Use the Run R Script functionality in the Power Query Editor instead (Home > Edit Queries).
If you follow the steps in posts post1 and/or post2 you'll see how you can import and transform any data into any table you want using R.
So with a simple dataset such as:
A,B
1,11
2,19
3,18
4,19
5,18
6,12
7,12
8,19
9,13
10,19
... you can produce a scrollable table of any format in Power BI:
R script:
# 'dataset' holds the input data for this script
dataset$C <- dataset$B*2
dataset2 <- dataset
Power Query Editor:
Power BI Desktop table:
Power BI Desktop interactive table:
And you can easily make the table interactive by introducing a slicer:
It appears that you now also have the possibility of importing an R DataTable visualization from the Power BI Marketplace:
With the same dataset, you'll end up with this:
I would like to know how to use table data as a parameter in Tableau's R integration.
Example
Tableau has a builtin data set called "Superstore" used for reproducible examples. Suppose I've used the Superstore data set to create a "text table" (i.e. spreadsheet) with Region as rows and SUM(Profit) as the data.
That looks like this:
Now, suppose I wanted to pass the data in this table to R for a calculation. I would start Rserve
library(Rserve)
Rserve()
and establish a connection with Tableau's UI.
Next I would want to create a calculated field to send the data to R and retrieve the results. I'm not sure how to do this.
My attempt looks like this:
SCRIPT_REAL('
output <- anRFunction(.arg1)
',
[someTableauMeasure])
which should be fine, except that I don't know how to represent the table data where it currently says someTableauMeasure. This is just an arbitrary example, but a reason I might want to do this is that I might provide the user with a filter, such as Country, so that they could filter the results at will and get an updated result from R.
For testing purposes that function anRFunction could be replaced with something like abs.
Tableau will pass the aggregated values to R, depending on the settings of your worksheet.
So in your case if you use:
SCRIPT_REAL('
output <- anRFunction(.arg1)
',
sum(Profit))
You will get the output according to the dimensions you have on your worksheet, in your case [Region] if you set up a filter by country, R will only receive and return the values for a certain country and if you choose to use [Category] instead, you will get the results of your R function broken down by category.
I would like to use the Cluster PAM algorithm in R to cluster a dataset of around 6000 rows.
I want the PAM algorithm to ignore a column called "ID" (Not use it in the clustering) but i do not want to delete that column. I want to use that column later on to combine my clustered data with the original dataset.
basically what i want is to add a cluster column to the original dataset.
I am want to use the PAM as a data compression/variables reduction method. I have 220 variables and i would like to cluster some of the variables and reduce the dimensionality of my dataset so i can apply a classification algorithm (Most likely a tree) to classify a problem that i am trying to solve.
If anyone knows a way around this or a better approach, please let me know.
Thank you
import data
data <- read.table(“sampleiris.txt”)
execution
result <- pam(data[2:4], 3, FALSE, “euclidean”)
Here subset [2:4] is done considering id is the first column.And the below code should fetch you the cluster values from PAM. you can the add this as a column to your Data
result$silinfo[[1]][1:nrow(pam.result$silinfo[[1]])]
Their is a small problem in the above code.
You should not use the silhouette information because it re-orders the rows as a preparation for the plot.
If you want to extract the cluster assignment while preserving the original dataset order and adding just a column of cluster assignment you should use $cluster. I tried it and it works like a charm.
This is the code:
data<- swiss[4:6]
result <- pam(data, 3)
summary (result)
export<-result$cluster
swiss[,"Clus"]<- export
View(export)
View(swiss)
Cheers
I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.
Is this possible? I cannot seem to find an answer anywhere.
If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.
And that step is easier to happen outside R.
(1) Linux Shell:
Assuming your data falls into a consistent format. Each row is one record. You can do:
sort -R data | head -n 1000 >data.sample
This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample
(2) If the data is not small enough to fit into memory.
There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:
select * from tablename order by rand() limit 1000
You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.
These are the two most commonly used ways based on my experience for dealing with 'big' data.
I have a dataset in SPSS that has 100K+ rows and over 100 columns. I want to filter both the rows and columns at the same time into a new SPSS dataset.
I can accomplish this very easily using the subset command in R. For example:
new_data = subset(old_data, select = ColumnA >10, select = c(ColumnA, ColumnC, ColumnZZ))
Even easier would be:
new data = old_data[old_data$ColumnA >10, c(1, 4, 89)]
where I am passing the column indices instead.
What is the equivalent in SPSS?
I love R, but the read/write and data management speed of SPSS is significantly better.
I am not sure what exactly you are referring to when you write that "the read/write and data management speed of SPSS being significantly better" than R. Your question itself demonstrates how flexible R is at data management! And, a dataset of 100k rows and 100 columns is by no means a large one.
But, to answer your question, perhaps you are looking for something like this. I'm providing a "programmatic" solution, rather than the GUI one, because you're asking the question on Stack Overflow, where the focus is more on the programming side of things. I'm using a sample data file that can be found here: http://www.ats.ucla.edu/stat/spss/examples/chp/p004.sav
Save that file to your SPSS working directory, open up your SPSS syntax editor, and type the following:
GET FILE='p004.sav'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'mynewdatafile.sav'
/KEEP currentm previous lactatio.
GET FILE='mynewdatafile.sav'.
More likely, though, you'll have to go through something like this:
FILE HANDLE directoryPath /NAME='C:\path\to\working\directory\' .
FILE HANDLE myFile /NAME='directoryPath/p004.sav' .
GET FILE='myFile'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'directoryPath/mynewdatafile.sav'
/KEEP currentm previous lactatio.
FILE HANDLE myFile /NAME='directoryPath/mynewdatafile.sav'.
GET FILE='myFile'.
You should now have a new file created that has just three columns, and where no value in the "lactatio" column is greater than 3.
So, the basic steps are:
Load the data you want to work with.
Subset for all columns from all the cases you're interested in.
Save a new file with only the variables you're interested in.
Load that new file before you proceed.
With R, the basic steps are:
Load the data you want to work with.
Create an object with your subset of rows and columns (which you know how to do).
Hmm.... I don't know about you, but I know which method I prefer ;)
If you're using the right tools with R, you can also directly read in the specific subset you are interested in without first loading the whole dataset if speed really is an issue.
In spss you can't combine the two actions in one command, but it's easy enough to do it in two:
dataset copy old_data. /* delete this if you don't need to keep both old and new data.
select if ColumnA>10.
add files /file=* /keep=ColumnA ColumnC ColumnZZ.