Kmeans on a million observations in R - trouble plotting clusters - r

I am trying to perform KMeans clustering on over a million rows with 4 observations, all numeric. I am using the following code:
kmeansdf<-as.data.frame(rbind(train$V3,train$V5,train$V8,train$length))
km<-kmeans(kmeansdf,2)
As it can be seen, I would like to divide my data into two clusters. The object km is getting populated but I am having trouble plotting the results. Here is the code I am using to plot:
plot(kmeansdf,col=km$cluster)
This piece of code gives me the following error:
Error in plot.new() : figure margins too large
I tried researching online but could not find a solution, I tried working on command line as well but still getting the same error (I am using RStudio at the moment)
Any help to resolve the error would be highly appreciated.
TIA.

When I run your code on a df with 1e6 rows, I don't get the same error, but the system hangs (interrupted after 10 min). It may be that creating a scatterplot matrix with 1e6 points per frame is just too much.
You might consider taking a random sample:
# all this to create a df with two distinct clusters
set.seed(1)
center.1 <- c(2,2,2,2)
center.2 <- c(-2,-2,-2,-2)
n <- 5e5
f <- function(x){return(data.frame(V1=rnorm(n,mean=x[1]),
V2=rnorm(n,mean=x[2]),
V3=rnorm(n,mean=x[3]),
V4=rnorm(n,mean=x[4])))}
df <- do.call("rbind",lapply(list(center.1,center.2),f))
km <- kmeans(df,2) # run kmeans on full dataset
df$cluster <- km$cluster # append cluster column to df
# sample is 10% of population (100,000 rows)
s <- 1e5
df <- df[sample(nrow(df),s),]
plot(df[,1:4],col=df$cluster)
Running the same thing with a 1% sample (50,000 rows) gives this.

Related

The R package BosonSampling keeps running without result

I tried to generate boson sampling data using R package BosonSampling. Although it takes long time to generate samples for larger values of n and m, I tried smaller values but I didn't get any output from code. I don't know what is the problem.
The documentation is available in the link:
https://cran.r-project.org/web/packages/BosonSampling/index.html
the code from documentation:
library('BosonSampling')
library('Rcpp')
set.seed(7)
n <- 10
m <- 20
A <- randomUnitary(m)[,1:n]
valueList <- bosonSampler(A, sampleSize=10, perm = FALSE)$values
valueList

How to customise a ctree (package 'party')?

So I have a problem using ctree in the R package party. I can't use the package partykit because it can't search for unordered splits in >= 31 levels
I used this code:
set.seed(1234) #To get reproducible result
ind <- sample(2,nrow(newnew_compressed_data), replace=TRUE, prob=c(0.7,0.3))
trainData <- newnew_compressed_data[ind==1,]
testData <- newnew_compressed_data[ind==2,]
myFormula <- MA ~ .
abundance_ctree <- party::ctree(myFormula, data=trainData)
abundance_ctree2 <- party::ctree(myFormula, data=testData)
print(abundance_ctree)
plot (abundance_ctree)
plot(abundance_ctree, type="simple")
plot (abundance_ctree2)
where MA is my y-variable and newnew_compressed_data is my dataset. The dataset has 1032 observations and 7 variables, which are being tested for importance.
This is what the tree currently looks like at the minute:
You can see the labels are revealing every item in the category, which I'd rather print or put into a table! In addition, I'm not sure which each of the nodes correspond to, the output said I had 13 nodes...
Does anyone know of a way to reduce the levels and produce a better legend to explain what is represented in each of the nodes? I just can't interpret anything from this and struggling to find examples with big datasets.

R: Silhouette function results in table

Is there a way to receive the results of the silhouette function in R in a table showing 1) the number of the clusters and 2) the average silhouette width for each cluster?
You can use summary() function to see all details of the silhouette() function for clustering analysis. I will follow #G5W's answer. You can get the number of observations in each cluster as well.
library(cluster)
Iris_KM3 <- kmeans(iris[,1:4],3)
SIL <- silhouette(Iris_KM3$cluster, dist(iris[,1:4]))
summary_SIL <- summary(SIL)
cluster_SIL <- t(rbind(summary_SIL[["clus.sizes"]], summary_SIL[["clus.avg.widths"]]))
colnames(cluster_SIL) <- c("No. of Obs", "Avg. Silh. Width")
Everything that you need is returned by the silhouette function. Just capture that and summarize it any way that you want. Here is an example using the built-in iris data.
library(cluster)
Iris_KM3 = kmeans(iris[,1:4],3)
SIL = silhouette(Iris_KM3$cluster, dist(iris[,1:4]))
aggregate(SIL[,3], list(SIL[,1]), mean)
Group.1 x
1 1 0.07624005
2 2 0.49471909
3 3 0.62148628
If you run the above code, try typing just SIL or str(SIL) to see what the function is giving you.

Spatial correlogram

I am trying to run a spatial auto correlogram for a project looking at deforestation in the Atlantic forest, Brazil.
I am however confused as to why I am hitting this problem.
Problem
When I run the initial part of my code i receive an error of
Error: ncol(x) == 2 is not TRUE
My code is
r.nb <- dnearneigh(as.matrix(shapeS$POINT_X,shapeS$POINT_Y),
d1=200, d2=100000, latlong=FALSE)
and then I hope to move run this code
p.cor <- sp.correlogram(r.nb, deforestation, order=15,
method="I", randomisation=FALSE)
r.nb <- dnearneigh(as.matrix(shapeS$POINT_X,shapeS$POINT_Y),
d1=200, d2=100000, latlong=FALSE)
My data is
A vector data set with the headings
POINTID GRID_CODE POINT_X POINT_Y
You need to use cbind, not as.matrix, or the approach that I show below. Always identify the R packages you are using. You claim that your data set a 'vector data set'. I doubt that. I am assuming it is a matrix.
If it is a matrix, you can do
m <- shapeS[, c('POINT_X', 'POINT_Y')]
library(spdep)
r.nb <- dnearneigh(m, d1=200, d2=100000, latlong=FALSE)
It it is a data.frame, you can do
m <- as.matrix(shapeS[, c('POINT_X', 'POINT_Y')])
or
m <- cbind(shapeS$POINT_X, shapeS$POINT_Y)

Error in plot.new() : figure margins too large in R (RGui 64-bit) [duplicate]

I am trying to perform KMeans clustering on over a million rows with 4 observations, all numeric. I am using the following code:
kmeansdf<-as.data.frame(rbind(train$V3,train$V5,train$V8,train$length))
km<-kmeans(kmeansdf,2)
As it can be seen, I would like to divide my data into two clusters. The object km is getting populated but I am having trouble plotting the results. Here is the code I am using to plot:
plot(kmeansdf,col=km$cluster)
This piece of code gives me the following error:
Error in plot.new() : figure margins too large
I tried researching online but could not find a solution, I tried working on command line as well but still getting the same error (I am using RStudio at the moment)
Any help to resolve the error would be highly appreciated.
TIA.
When I run your code on a df with 1e6 rows, I don't get the same error, but the system hangs (interrupted after 10 min). It may be that creating a scatterplot matrix with 1e6 points per frame is just too much.
You might consider taking a random sample:
# all this to create a df with two distinct clusters
set.seed(1)
center.1 <- c(2,2,2,2)
center.2 <- c(-2,-2,-2,-2)
n <- 5e5
f <- function(x){return(data.frame(V1=rnorm(n,mean=x[1]),
V2=rnorm(n,mean=x[2]),
V3=rnorm(n,mean=x[3]),
V4=rnorm(n,mean=x[4])))}
df <- do.call("rbind",lapply(list(center.1,center.2),f))
km <- kmeans(df,2) # run kmeans on full dataset
df$cluster <- km$cluster # append cluster column to df
# sample is 10% of population (100,000 rows)
s <- 1e5
df <- df[sample(nrow(df),s),]
plot(df[,1:4],col=df$cluster)
Running the same thing with a 1% sample (50,000 rows) gives this.

Resources