What is the difference between stat:kmeans and "naive" k-means - r

I am trying to understand what the stat:kmeans does differently to the simple version explained eg on Wikipedia. I am honestly so supremely clueless.
Reading the help on kmeans I learned that the default algorithm is Hartigan–Wong not the more basic method, so there should be a difference, but playing around with some normal distributed variables I couldn't find a case where they differed substantially and predictably.
For reference, this is my utterly horrible code I tested it against
##squre of eudlidean metric
my_metric <- function(x=vector(),y=vector()) {
stopifnot(length(x)==length(y))
sum((x-y)^2)
}
## data: xy data
## k: amount of groups
my_kmeans <- function(data, k, maxIt=10) {
##get length and check if data lengths are equal and if enough data is provided
l<-length(data[,1])
stopifnot(l==length(data[,2]))
stopifnot(l>k)
## generate the starting points
ms <- data[sample(1:l,k),]
##append the data with g column and initilize last
data$g<-0
last <- data$g
it<-0
repeat{
it<-it+1
##iterate through each data point and assign to cluster
for(i in 1:l){
distances <- c(Inf,Inf,Inf)
for(j in 1:k){
distances[j]<-my_metric(data[i,c(1,2)],ms[j,])
}
data$g[i] <- which.min(distances)
}
##update cluster points
for(i in 1:k){
points_in_cluster <- data[data$g==i,1:2]
ms[i,] <- c(mean(points_in_cluster[,1]),mean(points_in_cluster[,2]))
}
##break condition: nothing changed
if(my_metric(last,data$g)==0 | it > maxIt){
break
}
last<-data$g
}
data
}

First off, this was a duplication (as I just found out) of this post.
But I will still try to give an example: When the clusters are separated, Lloyd tends to leave the centers inside the clusters they start in, meaning that some may end up partitioned while some others might be lumped together

Related

time complexity issue when checking the connectivity between nodes in R igraph

I'd like to see which first degree nodes are connected to which second degree nodes for every node in a given graph. Suppose I generate a graph with 1000 nodes.
library(igraph)
g <- erdos.renyi.game(1000, 0.2)
When I calculate the set of adjacent nodes and second degree nodes for every node in the graph, there's no issue.
The code runs pretty quick independent from the size of the network. Once I add an if statement to check if two nodes are connected as:
for(j in adjacent){
for(k in secondDegreNodes){
if( are.connected(g, j, k) ){
}
}
}
My code takes forever. What is the exact complexity issue that I am facing? Is there a better way to conduct this operation?
There is something odd definetely happening. Even the following code block does not converge though it's the simplest operation.
g <- erdos.renyi.game(1000, 0.3)
A <- as_adjacency_matrix(g)
from<- 1
to <- nrow(A)
for(i in from:to){
for(j in i:to){
if(A[i,j] == 1){
#do nothing
}
}
}
EDIT: I believe there could be some issues with igraph package. I generated the graph in R igraph and coded everything in Java language, and it worked. As I expected, there is no complexity issue with the algorithm. However, I have no idea what is wrong with igraph.
In the example that you give, indexing matrix A repeatedly in a for-loop is rather inefficient. In this particular instance it is due to A being of class dgCMatrix from package Matrix.
When you compare the performance before and after converting A to another class, you will notice the difference. With N = 300 nodes, the duration of the for-loop decreases from 23.5 seconds to 0.1 seconds on my machine once I convert to the standard matrix class. There are moreover (N^2 + N) / 2 comparisons to be made. The squared term means going from 300 nodes to 3 x 300 = 900 nodes will roughly increase the computing time ninefold (at the very least).
If you further Rprofile the code, you will see that when subsetting an object of class dgCMatrix (i.e. A[i, j]) a number of further R functions are called, whereas the function [ is implemented straight in C for the basic matrix class. In addition, dgCMatrix is from the S4 object system. That means i.a. that finding the right method to use for [ is a little more costly than usual.
Finally, if you rely on R, you will be much better of (in general) using vectorised operations. These will typically avoid a deep call stack of further (inefficient) R functions and will often be implemented in C. With the adjacency matrix, you can quickly find second degree nodes by inspecting A_2 = A %*% A which will be very fast also (or I suspect especially so) for an object of class dgCMatrix.
Timing:
library(igraph)
N <- 300
g <- erdos.renyi.game(N, 0.3)
A <- as_adjacency_matrix(g)
from<- 1
to <- nrow(A)
class(A)
# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"
# run through matrix via for loop
# 23.5 seconds
system.time({
for(i in from:to){
for(j in i:to){
if(A[i,j] == 1) {}
}
}
})
# change class
A <- as.matrix(A)
class(A)
# [1] "matrix"
# run for loop again
# 0.097 seconds
system.time({
for(i in from:to){
for(j in i:to){
if(A[i,j] == 1) {}
}
}
})

Speeding up Monte Carlo simulations?

I am currently doing research to predict where kudzu (an invasive vine) will spread in Oklahoma over a five year time by using Monte Carlo simulation. I have created a raster with presence points and loaded it into R.
For each Monte Carlo simulation (each year), I am running 6000 iterations to provide accurate results. However, due to "for" loops, this is taking a long time. The first year usually finishes running in 3 days, however the second year has been running for over 3 weeks and still is not complete.
Is there any way to speed this process up?
Each year builds off the previous one. I have provided the code below for the first two years:
OK.rast16<-raster("OK_rast20161.tif")
p.a16<-as.matrix(OK.rast16)
table(p.a16)
# Set the random number seed so results can be reproduced if needed
set.seed(10)
drow.pa16 <- 133.7197873 # distance of grid cell (meters) in n-s direction
trow.pa16 <- length(p.a16[,1]) # total number of rows
dcol.pa16 <- 133.7197873 # distance of grid cell (meters) in e-w direction
tcol.pa16 <- length(p.a16[1,]) # total number of rows
#####Year 1 of infection in 2016#####
kudzu_sim1_16 <- matrix(0,trow.pa16,tcol.pa16)
for(m in 1:6000)
{
OK.kudzu_16 <- p.a16 # initialize matrix of annual dispersal
for(i in 1:trow.pa16)
{
for(j in 1:tcol.pa16)
{
if(!is.na(p.a16[i,j]) & p.a16[i,j] == 1)
{
for(k in 1:trow.pa16)
{
for(l in 1:tcol.pa16)
{
if(!is.na(OK.kudzu_16[k,l]) & OK.kudzu_16[k,l] == 0)
{
distcalc <- sqrt((abs(i-k)*drow.pa16)^2+(abs(j-l)*dcol.pa16)^2)
prob <- exp(0.0369599-0.00474401*distcalc)
if(prob>runif(1)) {OK.kudzu_16[k,l] <- 1}
}
}
}
}
}
}
kudzu_sim1_16 <- OK.kudzu_16+kudzu_sim1_16
}
#####Year 2 of infection in 2016####
kudzu_sim2_16 <- matrix(0,trow.pa16,tcol.pa16)
for(m in 1:6000)
{
OK.kudzu1_16 <- OK.kudzu_16 # initialize matrix of annual dispersal
for(i in 1:trow.pa16)
{
for(j in 1:tcol.pa16)
{
if(!is.na(OK.kudzu_16[i,j]) & OK.kudzu_16[i,j] == 1)
{
for(k in 1:trow.pa16)
{
for(l in 1:tcol.pa16)
{
if(!is.na(OK.kudzu1_16[k,l]) & OK.kudzu1_16[k,l] == 0)
{
distcalc <- sqrt((abs(i-k)*drow.pa16)^2+(abs(j-l)*dcol.pa16)^2)
prob <- exp(0.0369599-0.00474401*distcalc)
if(prob>runif(1)) {OK.kudzu1_16[k,l] <- 1}
}
}
}
}
}
}
kudzu_sim2_16 <- OK.kudzu1_16+kudzu_sim2_16
}
Here is the raster to load to start the code:
kudzu in OK
Originally a comment, but it rapidly exceeded the length limit:
1) 133mx133m is a very small grid size for spatial simulations on something as large as an entire state. It might help to find a way to make resolution a parameter of your simulation (rather than a hard-wired number), streamline the code so that it runs well at a larger resolution, then increase the resolution. The raster function has an optional parameter named res which can be used to control the resolution.
2) While vectorization will surely help, it is unlikely to transform an algorithm which runs for weeks with no output into one which runs in a reasonable amount of time. Perhaps you need to fundamentally rethink your algorithm. You seem to be comparing every grid cell with every other grid cell. That doesn't strike me as biologically plausible. Kudzu spreads locally. Why should what happens next year to a given 133m x 133m cell in the Oklahoma panhandle depend on the current state of another cell over by Lake Texoma? If your simulation has any biological realism, exp(0.0369599-0.00474401*distcalc) should be negligibly small for two such cells, but your code doesn't neglect it. An algorithm which is in some sense localized might be better.
3) There is an awful lot of entries in your matrix which correspond to points outside of Oklahoma. Unless your model is designed to see how kudzu also diffuses over a large part of Texas, those might be irrelevant for your program. If so, a fundamentally different data structure (e.g. a list of locations) which only has entries for points in Oklahoma might be preferable. Or, maybe not. Just something to think about.
4) For more detailed help, it would help if you explain what your algorithm actually is (and not just what it intends to do). It isn't completely obvious in a quick read of your code.

Function to calculate Euclidean distance in R

I am trying to implement KNN classifier in R from scratch on iris data set and as a part of this i have written a function to calculate the Euclidean distance. Here is my code.
known_data <- iris[1:15,c("Sepal.Length", "Petal.Length", "Class")]
unknown_data <- iris[16,c("Sepal.Length", "Petal.Length")]
# euclidean distance
euclidean_dist <- function(k,unk) {
distance <- 0
for(i in 1:nrow(k))
distance[i] <- sqrt((k[,1][i] - unk[,1][i])^2 + (k[,2][i] - unk[,2][i])^2)
return(distance)
}
euclidean_dist(known_data, unknown_data)
However, when i call the function it's returning the first value correctly and rest as NA.
Could anyone show where i could have gone wrong with the code?
Thanks in advance.
The aim is to calculate the distance between the ith row of known_data, and the single unknown_data point.
How to fix your code
When you calculate distance[i], you're trying to access the ith row of the unknown data point, which doesn't exits, and is hence NA. I believe your code should run fine if you make the following edits:
known_data <- iris[1:15,c("Sepal.Length", "Petal.Length", "Class")]
unknown_data <- iris[16,c("Sepal.Length", "Petal.Length")]
# euclidean distance
euclidean_dist <- function(k,unk) {
# Make distance a vector [although not technically required]
distance <- rep(0, nrow(k))
for(i in 1:nrow(k))
# Change unk[,1][i] to unk[1,1] and similarly for unk[,2][i]
distance[i] <- sqrt((k[,1][i] - unk[1,1])^2 + (k[,2][i] - unk[1,2])^2)
return(distance)
}
euclidean_dist(known_data, unknown_data)
One final note - in the version of R I'm using, the known dataset uses a Species as opposed to Class column
An alternative method
As suggested by #Roman Luštrik, the entire aim of getting the Euclidean distances can be achieved with a simple one-liner:
sqrt((known_data[, 1] - unknown_data[, 1])^2 + (known_data[, 2] - unknown_data[, 2])^2)
This is very similar to the function you wrote, but does it in vectorised form, rather than through a loop, which is often a preferable way of doing things in R.
The best and fastst way is using h2o package:
#load library
library(h2o)
#initialize the node
h2o.init()
#transform the df to h2o type
known_data<-as.h2o(known_data)
unknown_data<-as.h2o(unknown_data)
#create a matrix in which the distances are going to be record
matrix1<-h2o.createFrame(rows=nrow(known_data),cols=unknown_data)
#do a loop to calculate the distance between all the rows of both df
for(i in 1:nrow(unknown_data)){
matrix[,i]<-as.data.frame(h2o.distance(known_data, unknown_data[i,],"l2"))
}

vectorize random failure of a graph in R

I have the following function that I intend to run on a large scale with 10000 random networks.
m.ci.fail<-function(graph){
fail <- function(net) {
vids<-sample(V(net),1)
net <- delete.vertices(net,vids)
return(net)
}
compsize <- function(net,graph) {
b <- clusters(graph)
c <- clusters(net)
S <- max(c$csize)/max(b$csize)
return(S)
}
nodes<-1:vcount(graph)
R<-10000
cpmat<-matrix(nrow=length(nodes), ncol=R)
for(i in 1:R){
gr<-rewire(graph,mode="simple",niter=10000)
E(gr)$weight<-E(graph)$weight
grr<-gr
E(grr)$weight<-E(graph)$weight
cp<-numeric(length(nodes))
for(t in 1:(length(nodes)-1)){
gr<-fail(gr)
grcp<-compsize(gr,grr)
cp[t]<-grcp
}
cpmat[,i]<-cp
}
return(cpmat)
}
EDIT: I'm trying to create a randomly re-sampled distribution based on the original graph so I can get a confidence interval and later compare the range of random failures to centrality-based sequential deletions. Testing it as it is took hours with a graph of small size (30 nodes). I figure that if I could find a way to vectorize the random failure function this would be faster. I'm trying to vectorize the two 'for' loops but the fail function is making it a pain. Any suggestion on how I can do that?
Thanks in advance,

Document Term Matrix for Naive Bayes classfier: unexpected results R

I'm having some very annoying problems getting a Naive Bayes Classifier to work with a document term matrix. I'm sure I'm making a very simple mistake but can't figure out what it is. My data is from accounts spreadsheets. I've been asked to figure out which categories (in text format: mostly names of departments or names of budgets) are more likely to spend money on charities and which ones mostly (or only) spend on private companies. They suggested I use Naive Bayes classifiers to do this. I have a thousand or so rows of data to train a model and many hundreds of thousands of rows to test the model against. I have prepared the strings, replacing spaces with underscores and ands/&s with +, then treated each category as one term: so 'alcohol and drug addiction' becomes: alcohol+drug_addiction.
Some example rows:
"environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable" -> This row went to a charity
"west_north_west customer+tenancy premises h.r.a._special_maintenance" -> This row went to a private company.
Using this example as a template, I wrote the following function to come up with my document term matrix (using tm), both for training and test data.
library(tm)
library(e1071)
getMatrix <- function(chrVect){
testsource <- VectorSource(chrVect)
testcorpus <- Corpus(testsource)
testcorpus <- tm_map(testcorpus,stripWhitespace)
testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
testmatrix <- t(TermDocumentMatrix(testcorpus))
}
trainmatrix <- getMatrix(traindata$cats)
testmatrix <- getMatrix(testdata$cats)
So far, so good. The problem is when I try to a) apply a Naive Bayes model and b) predict from that model. Using klar package - I get a zero probability error, since many of the terms have zero instances of one category and playing around with the laplace terms does not seem to fix this. Using e1071, the model worked, but then when I tested the model using:
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$Code))
rs<- predict(model, as.matrix(testdata$cats))
... every single item predicted the same category, even though they should be roughly equal. Something in the model clearly isn't working. Looking at some of the terms in model$tables - I can see that many have high values for private and zero for charity and others vice versa. I have used as.factor for the code.
output:
rs 1 2
1 0 0
2 19 17
Any ideas on what is going wrong? Do dtm matrices not play nice with naivebayes? Have I missed a step out in preparing the data? I'm completely out of ideas. Hope this is all clear. Happy to clarify if not. Any suggestions would be much appreciated.
I have already had the problem myself. You have done (as far as I see it) everything right, the Naive Bayes Implementation in e1071 (and thus klar) is buggy.
But there is an easy and quick fix so that Naive Bayes as implemented in e1071 works again: You should change your text-vectors to categorial variables, i.e. as.factor. You have already done this with your target variable traindata$Code, yet you have to also do this for your trainmatrix and for sure then your testdata.
I could not track the bug to 100% percent down, but it lies in this part in the naive bayes implementation from e1071 (I may note, klar is only a wrapper around e1071):
L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
function(v) {
nd <- ndata[attribs[v]]
## nd is now a cell, row i, column attribs[v]
if (is.na(nd) || nd == 0) {
rep(1, length(object$apriori))
} else {
prob <- if (isnumeric[attribs[v]]) {
## we select table for attribute
msd <- object$tables[[v]]
## if stddev is eqlt eps, assign threshold
msd[, 2][msd[, 2] <= eps] <- threshold
dnorm(nd, msd[, 1], msd[, 2])
} else {
object$tables[[v]][, nd]
}
prob[prob <= eps] <- threshold
prob
}
})), 1, sum)
You see that there is an if-else-condition: if we have no numerics, naive bayes is used as we expect it to work. If we have numerics - and here comes the bug - this naive bayes automatically assumes a normal distribution. If you only have 0 and 1 in your text, dnorm pretty much sucks. I assume due to very low values created by dnorm the prob. are always replaced by the threshold and thus the variable with the higher a priori factor will always „win“.
However, if I understand your problem correct, you do not even need prediction, rather the a priori factor for identifying which department gives money to whom. Then all you have to do is have a deep look at your model. In your model for every term there appears the apriori probability, which is what I assume you are looking for. Let's do this and the aforementioned with a slightly modified version of your sample:
## i have changed the vectors slightly
first <- "environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable"
second <- "west_north_west customer+tenancy premises h.r.a._special_maintenance"
categories <- c("charity", "private")
library(tm)
library(e1071)
getMatrix <- function(chrVect){
testsource <- VectorSource(chrVect)
testcorpus <- Corpus(testsource)
testcorpus <- tm_map(testcorpus,stripWhitespace)
testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
## testmatrix <- t(TermDocumentMatrix(testcorpus))
## instead just use DocumentTermMatrix, the assignment is superflous
return(DocumentTermMatrix(testcorpus))
}
## since you did not supply some more data, I cannot do anything about these lines
## trainmatrix <- getMatrix(traindata$cats)
## testmatrix <- getMatrix(testdata$cats)
## instead only
trainmatrix <- getMatrix(c(first, second))
## I prefer running this instead of as.matrix as i can add categories more easily
traindf <- data.frame(categories, as.data.frame(inspect(trainmatrix)))
## now transform everything to a character vector since factors produce an error
for (cols in names(traindf[-1])) traindf[[cols]] <- factor(traindf[[cols]])
## traindf <- apply(traindf, 2, as.factor) did not result in factors
## check if it's as we wished
str(traindf)
## it is
## let's create a model (with formula syntax)
model <- naiveBayes(categories~., data=traindf)
## if you look at the output (doubled to see it more clearly)
predict(model, newdata=rbind(traindf[-1], traindf[-1]))
But as I have already said, you do not need to predict. A look at the model is all right, e.g. model$tables$premises will give you the likelihood for the premises giving money to private corporations: 100 %.
If you are dealing with very large datasets, you should specify threshold and eps in your model. Eps defines the limit, when the threshold should be supplied. E.g. eps = 0 and threshold = 0.000001 can be of use.
Furthermore you should stick to using term-frequency weighting. tf*idv e.g. will not work due to the dnorm in the naive bayes.
Hope I can finally get my 50 reputation :P

Resources