How to implement the jaccard distance in kproto function - r

I am trying to implement the distance of jaccard in the kproto function (package clustMixType in R), but without any success. The aim is to do a cluster analysis of my dataset.
The distance of jaccard that I want to use is the complement of the similarity coefficient of jaccard, so
distance of jaccard = 1-[a/(a+b+c)] = [(b+c)/(a+b+c)], or
distance of jaccard = 1-[M11/(M01+M10+M11)] = [(M01+M10)/(M01+M10+M11)].
The source code of the kproto function is presented bellow. The variable d1 is the euclidean distance for the numeric variables and the variable d2 is the distance from the simple matching coefficient for the categorical variables (as factors).
It computes the distances between the observations and the prototypes, not between observations. Prototypes are calculated, and not an observation of the data set it self.
So my twoo questions are
1) d2 is what I want to modify, but how?
2) should d1 be the sqrt of what is being calculated?
Thank you for all your help. It will be much apreciated.
Here is an excerpt of the dataset i'm working on, where V1 to V4 are factor (binary) variables (to use the jaccard distance) and V5 to V8 are numeric variables (to use the euclidean distance):
V1;V2;V3;V4;V5;V6;V7;V8
1;1;0;0;6;2;3;3
0;1;0;1;3;5;2;1
1;1;0;0;1;3;2;1
1;1;0;0;4;3;3;1
1;1;1;0;1;4;1;1
1;0;1;0;4;3;1;1
1;1;0;0;2;4;2;1
1;1;0;0;2;4;2;1
1;1;0;0;6;2;1;1
1;1;0;0;6;2;2;1
1;1;0;0;5;2;3;1
1;1;0;0;4;3;3;1
1;1;0;0;4;4;2;1
1;1;0;0;4;3;3;1
1;1;0;0;4;3;3;1
1;1;0;0;3;4;2;1
1;1;0;0;4;3;2;1
1;1;0;0;5;2;3;1
1;1;0;0;4;3;4;1
1;1;0;0;4;3;2;1
1;1;0;0;4;3;2;1
1;1;0;0;3;3;2;1
1;1;0;0;3;3;3;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;2;1
1;1;0;0;3;3;2;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;2;1
1;0;0;0;3;4;2;1
1;1;0;0;7;2;4;1
1;1;0;0;7;2;2;1
1;1;0;0;5;2;4;1
1;1;0;0;5;3;4;1
1;1;0;0;5;3;2;1
1;1;0;0;5;3;4;1
1;0;0;0;3;5;3;1
0;1;0;0;6;2;4;1
1;1;0;0;6;2;3;1
1;1;0;0;6;2;4;1
Lets take the first twoo observations from the dataset provided above as an example:
V1;V2;V3;V4;V5;V6;V7;V8
1;1;0;0;6;2;3;3
0;1;0;1;3;5;2;1
The algorithm first select the k prototypes from the data set randomly, so lets supose that the second observation is one of the inicial prototypes. As I understood the algorithm creates a data frame called "protos" initially with k random observations from the data set provided, so the second observation would be one of the lines of the "proto" dataframe.
The combined distance used to cluster the observations would be d=d1+lambda*d2. Lambda can also be a vector of individual weights to each variable. d is the distance between the observations in the data set provided and the "proto" matrix created initially with k random observations.
In this case, considering the first twoo observations presented, the calculated distances, between the observation (yi) and the prototype (yk), would be as follow:
Euclidian for the numeric variables (V5 to V8):
d1=sum[(yij-ykj)^2]^0,5
where,
k=1 to k clusters
i=1 to n observations
j=5 to 8 th variable
d1=[[(6-3)^2]+[(2-5)^2]+[(3-2)^2]+[(3-1)^2]]^0,5
d1=[9+9+1+4]^0,5
d1=4.796
Jaccard, for the set of binary variables (V1 to V4):
d2=[(b+c)/(a+b+c)]
where,
a=1
b=1
c=1
are correspondences counts between the n observations and the k prototypes, for variables 1 to 4.
d2=[(1+1)/(1+1+1)]
d2=2/3
d2=0.667
So the combined distance between this especific observation and the initial prototype of that cluster is:
d=d1+d2
d=4.796+0.667
d=5.463
The results, as I understood, are then stored in a matrix called "d", line by line, the size of [number of lines=number of observations, number of columns = number of clusters k].
I'm expecting to correctly calculate the euclidian and jaccard distances, modifiyng the kproto function, maintaining the steps and results provided by the original function.
NOTE: the final function should work on any number of observations, variables and prototypes, and not only to my specific dataset.
I've also tried to mix the codes from kproto (clustMixType package) and dist.binary (ade4 package), but they work in different ways.
#K-Prototypes algorithm
kproto.default <- function(x, k, lambda = NULL, iter.max = 100, nstart = 1, na.rm = TRUE, keep.data = TRUE, verbose = TRUE, ...){
# initial error checks
if(!is.data.frame(x)) stop("x should be a data frame!")
if(ncol(x) < 2) stop("For clustering x should contain at least two variables!")
if(iter.max < 1 | nstart < 1) stop("iter.max and nstart must not be specified < 1!")
if(!is.null(lambda)){
if(any(lambda < 0)) stop("lambda must be specified >= 0!")
if(!any(lambda > 0)) stop("lambda must be specified > 0 for at least one variable!")
}
# check for numeric and factor variables
numvars <- sapply(x, is.numeric)
anynum <- any(numvars)
catvars <- sapply(x, is.factor)
anyfact <- any(catvars)
if(!anynum) stop("\n No numeric variables in x! Try using kmodes() from package klaR...\n\n")
if(!anyfact) stop("\n No factor variables in x! Try using kmeans()...\n\n")
# treatment of missings
NAcount <- apply(x, 2, function(z) sum(is.na(z)))
if(verbose){
cat("# NAs in variables:\n")
print(NAcount)
}
if(any(NAcount == nrow(x))) stop(paste("Variable(s) have only NAs please remove them:",names(NAcount)[NAcount == nrow(x)],"!"))
if(na.rm) {
miss <- apply(x, 1, function(z) any(is.na(z)))
if(verbose){
cat(sum(miss), "observation(s) with NAs.\n")
if(sum(miss) > 0) message("Observations with NAs are removed.\n")
cat("\n")
}
x <- x[!miss,]
} # remove missings
if(!na.rm){
allNAs <- apply(x,1,function(z) all(is.na(z)))
if(sum(allNAs) > 0){
if(verbose) cat(sum(allNAs), "observation(s) where all variables NA.\n")
warning("No meaningful cluster assignment possible for observations where all variables NA.\n")
if(verbose) cat("\n")
}
}
if(nrow(x) == 1) stop("Only one observation clustering not meaningful.")
k_input <- k # store input k for nstart > 1 as clusters can be merged
# initialize prototypes
if(!is.data.frame(k)){
if (length(k) == 1){
if(as.integer(k) != k){k <- as.integer(k); warning(paste("k has been set to", k,"!"))}
if(nrow(x) < k) stop("Data frame has less observations than clusters!")
ids <- sample(nrow(x), k)
protos <- x[ids,]
}
if (length(k) > 1){
if(nrow(x) < length(k)) stop("Data frame has less observations than clusters!")
ids <- k
k <- length(ids)
if(length(unique(ids)) != length(ids)) stop("If k is specified as a vector it should contain different indices!")
if(any(ids<1)|any(ids>nrow(x))) stop("If k is specified as a vector all elements must be valid indices of x!")
#check for integer
protos <- x[ids,]
}
rm(ids)
}
if(is.data.frame(k)){
if(nrow(x) < nrow(k)) stop("Data frame has less observations than clusters!")
if(length(names(k)) != length(names(x))) stop("k and x have different numbers of columns!")
if(any(names(k) != names(x))) stop("k and x have different column names!")
if(anynum) {if( any(sapply(k, is.numeric) != numvars)) stop("Numeric variables of k and x do not match!")}
if(anyfact) {if( any(sapply(k, is.factor) != catvars)) stop("Factor variables of k and x do not match!")}
protos <- k
k <- nrow(protos)
}
if(k < 1) stop("Number of clusters k must not be smaller than 1!")
# automatic calculation of lambda
if(length(lambda) > 1) {if(length(lambda) != sum(c(numvars,catvars))) stop("If lambda is a vector, its length should be the sum of numeric and factor variables in the data frame!")}
if(is.null(lambda)){
if(anynum & anyfact){
vnum <- mean(sapply(x[,numvars, drop = FALSE], var, na.rm = TRUE))
vcat <- mean(sapply(x[,catvars, drop = FALSE], function(z) return(1-sum((table(z)/sum(!is.na(z)))^2))))
if (vnum == 0){
if(verbose) warning("All numerical variables have zero variance.")
anynum <- FALSE
}
if (vcat == 0){
if(verbose) warning("All categorical variables have zero variance.")
anyfact <- FALSE
}
if(anynum & anyfact){
lambda <- vnum/vcat
if(verbose) cat("Estimated lambda:", lambda, "\n\n")
}else{
lambda <- 1
}
}
}
# initialize clusters
clusters <- numeric(nrow(x))
tot.dists <- NULL
moved <- NULL
iter <- 1
# check for any equal prototypes and reduce cluster number in case of occurence
if(k > 1){
keep.protos <- rep(TRUE,k)
for(l in 1:(k-1)){
for(m in (l+1):k){
d1 <- sum((protos[l,numvars, drop = FALSE]-protos[m,numvars, drop = FALSE])^2) # euclidean for numerics
d2 <- sum(protos[l,catvars, drop = FALSE] != protos[m,catvars, drop = FALSE]) # wtd simple matching for categorics
if((d1+d2) == 0) keep.protos[m] <- FALSE
}
}
if(!all(keep.protos)){
protos <- protos[keep.protos,]
k <- sum(keep.protos)
if(verbose) message("Equal prototypes merged. Cluster number reduced to:", k, "\n\n")
}
}
# special case only one cluster
if(k == 1){clusters <- rep(1, nrow(x)); size <- table(clusters); iter <- iter.max} # REM: named vector size is needed later...
# start iterations for standard case (i.e. k > 1)
while(iter < iter.max){
# compute distances
nrows <- nrow(x)
dists <- matrix(NA, nrow=nrows, ncol = k)
for(i in 1:k){
#a0 <- proc.time()[3]
#d1 <- apply(x[,numvars],1, function(z) sum((z-protos[i,numvars])^2)) # euclidean for numerics
d1 <- (x[,numvars, drop = FALSE] - matrix(rep(as.numeric(protos[i, numvars, drop = FALSE]), nrows), nrow=nrows, byrow=T))^2
if(length(lambda) == 1) d1 <- rowSums(d1, na.rm = TRUE)
if(length(lambda) > 1) d1 <- as.matrix(d1) %*% lambda[numvars]
#a1 <- proc.time()[3]
#d2 <- lambda * apply(x[,catvars],1, function(z) sum((z != protos[i,catvars]))) # wtd simple matching for categorics
d2 <- sapply(which(catvars), function(j) return(x[,j] != rep(protos[i,j], nrows)) )
d2[is.na(d2)] <- FALSE
if(length(lambda) == 1) d2 <- lambda * rowSums(d2)
if(length(lambda) > 1) d2 <- as.matrix(d2) %*% lambda[catvars]
#a2 <- proc.time()[3]
dists[,i] <- d1 + d2
#cat(a1-a0, a2-a1, "\n")
}
# assign clusters
old.clusters <- clusters
# clusters <- apply(dists, 1, function(z) which.min(z))
clusters <- apply(dists, 1, function(z) {a <- which.min(z); if (length(a)>1) a <- sample(a,1); return(a)}) # sample in case of multiple minima
size <- table(clusters)
min.dists <- apply(cbind(clusters, dists), 1, function(z) z[z[1]+1])
within <- as.numeric(by(min.dists, clusters, sum))
tot.within <- sum(within)
# prevent from empty classes
#tot.within <- numeric(k)
#totw.list <- by(min.dists, clusters, sum)
#tot.within[names(totw.list)] <- as.numeric(totw.list)
# ...check for empty clusters and eventually reduce number of prototypes
if (length(size) < k){
k <- length(size)
protos <- protos[1:length(size),]
if(verbose) cat("Empty clusters occur. Cluster number reduced to:", k, "\n\n")
}
# trace
tot.dists <- c(tot.dists, sum(tot.within))
moved <- c(moved, sum(clusters != old.clusters))
# compute new prototypes
remids <- as.integer(names(size))
for(i in remids){
protos[which(remids == i), numvars] <- sapply(x[clusters==i, numvars, drop = FALSE], mean, na.rm = TRUE)
protos[which(remids == i), catvars] <- sapply(x[clusters==i, catvars, drop = FALSE], function(z) levels(z)[which.max(table(z))])
}
if(k == 1){clusters <- rep(1, length(clusters)); size <- table(clusters); iter <- iter.max; break}
# check for any equal prototypes and reduce cluster number in case of occurence
if(iter == (iter.max-1)){ # REM: for last iteration equal prototypes are allowed. otherwise less prototypes than assigned clusters.
keep.protos <- rep(TRUE,k)
for(l in 1:(k-1)){
for(m in (l+1):k){
d1 <- sum((protos[l,numvars, drop = FALSE]-protos[m,numvars, drop = FALSE])^2) # euclidean for numerics
d2 <- sum(protos[l,catvars, drop = FALSE] != protos[m,catvars, drop = FALSE]) # wtd simple matching for categorics
if((d1+d2) == 0) keep.protos[m] <- FALSE
}
}
if(!all(keep.protos)){
protos <- protos[keep.protos,]
k <- sum(keep.protos)
if(verbose) cat("Equal prototypes merged. Cluster number reduced to:", k, "\n\n")
}
}
# add stopping rules
if(moved[length(moved)] == 0) break
if(k == 1){clusters <- rep(1, length(clusters)); size <- table(clusters); iter <- iter.max; break}
#cat("iter", iter, "moved", moved[length(moved)], "tot.dists",tot.dists[length(tot.dists)],"\n" )
iter <- iter+1
}
### Final update of prototypes and dists
if(iter == iter.max){ # otherwise there have been no moves anymore and prototypes correspond to cluster assignments
# compute new prototypes
remids <- as.integer(names(size))
for(i in remids){
protos[which(remids == i), numvars] <- sapply(x[clusters==i, numvars, drop = FALSE], mean, na.rm = TRUE)
protos[which(remids == i), catvars] <- sapply(x[clusters==i, catvars, drop = FALSE], function(z) levels(z)[which.max(table(z))])
}
# compute distances
nrows <- nrow(x)
dists <- matrix(NA, nrow=nrows, ncol = k)
for(i in 1:k){
d1 <- (x[,numvars, drop = FALSE] - matrix(rep(as.numeric(protos[i, numvars, drop = FALSE]), nrows), nrow=nrows, byrow=T))^2
if(length(lambda) == 1) d1 <- rowSums(d1, na.rm = TRUE)
if(length(lambda) > 1) d1 <- as.matrix(d1) %*% lambda[numvars]
d2 <- sapply(which(catvars), function(j) return(x[,j] != rep(protos[i,j], nrows)) )
d2[is.na(d2)] <- FALSE
if(length(lambda) == 1) d2 <- lambda * rowSums(d2)
if(length(lambda) > 1) d2 <- as.matrix(d2) %*% lambda[catvars]
dists[,i] <- d1 + d2
}
size <- table(clusters)
min.dists <- apply(cbind(clusters, dists), 1, function(z) z[z[1]+1])
within <- as.numeric(by(min.dists, clusters, sum))
tot.within <- sum(within)
}
names(clusters) <- row.names(dists) <- row.names(x)
rownames(protos) <- NULL
# create result:
res <- list(cluster = clusters,
centers = protos,
lambda = lambda,
size = size,
withinss = within,
tot.withinss = tot.within,
dists = dists,
iter = iter,
trace = list(tot.dists = tot.dists, moved = moved))
# loop: if nstart > 1:
if(nstart > 1)
for(j in 2:nstart){
res.new <- kproto(x=x, k=k_input, lambda = lambda, iter.max = iter.max, nstart=1, verbose=verbose)
if(res.new$tot.withinss < res$tot.withinss) res <- res.new
}
if(keep.data) res$data = x
class(res) <- "kproto"
return(res)
}

I've managed to modify the functions to accept a variety of similarity measures and uploaded the R file at http://dx.doi.org/10.17632/63nyn9tjcd.1, in case someone needs it.

Related

Genetic algorythm (GA) to select the optimal n values of a vector

I have to choose 10 elements of a vector to maximizes a function. Since the vector is pretty long there are to many possibilities (~1000 choose 10) to compute them all. So I started to look into the GA package to use a genetic algorithm.
I came up with this MWE:
values <- 1:1000
# Fitness function which I want to maximise
f <- function(x){
# Choose values
y <- values[x]
# From the first 10 sum up the odd values.
y <- ifelse(y %% 2 != 0, y, 0)
y <- y[1:10]
return(sum(y))
}
# Maximum value of f for this example
y <- ifelse(values %% 2 != 0, values, 0)
sum(sort(y, decreasing = TRUE)[1:10])
# [1] 9900
# Genetic algorithm
GA <- ga(type = "permutation", fitness = f, lower = rep(1, 10), upper = rep(1000, 10), maxiter = 100)
summary(GA)
The results are a bit underwhelming. From summary(GA), I get the feeling that the algorithm always permutates all 1000 values (the solution goes from x1 to x1000) which leads to an inefficient optimization. How can I tell the algorithm that it should only should use 10 values (so the solution is x1 .. x10)?
You should read https://www.jstatsoft.org/article/view/v053i04. You don't have permutation problem but selection one hence you should use binary type of genetic algorithm. Because you want to select exclusively 10 (10 ones and 990 zeroes) you should probably write your own genetic operators because that is constraint that will hardly ever be satisfied by default operators (with inclusion of -Inf in fitness function if you have more than 10 zeroes). One approach:
Population (k tells how much ones you want):
myInit <- function(k){
function(GA){
m <- matrix(0, ncol = GA#nBits, nrow = GA#popSize)
for(i in seq_len(GA#popSize))
m[i, sample(GA#nBits, k)] <- 1
m
}
}
Crossover
myCrossover <- function(GA, parents){
parents <- GA#population[parents,] %>%
apply(1, function(x) which(x == 1)) %>%
t()
parents_diff <- list("vector", 2)
parents_diff[[1]] <- setdiff(parents[2,], parents[1,])
parents_diff[[2]] <- setdiff(parents[1,], parents[2,])
children_ind <- list("vector", 2)
for(i in 1:2){
k <- length(parents_diff[[i]])
change_k <- sample(k, sample(ceiling(k/2), 1))
children_ind[[i]] <- if(length(change_k) > 0){
c(parents[i, -change_k], parents_diff[[i]][change_k])
} else {
parents[i,]
}
}
children <- matrix(0, nrow = 2, ncol = GA#nBits)
for(i in 1:2)
children[i, children_ind[[i]]] <- 1
list(children = children, fitness = c(NA, NA))
}
Mutation
myMutation <- function(GA, parent){
ind <- which(GA#population[parent,] == 1)
n_change <- sample(3, 1)
ind[sample(length(ind), n_change)] <- sample(setdiff(seq_len(GA#nBits), ind), n_change)
parent <- integer(GA#nBits)
parent[ind] <- 1
parent
}
Fitness (your function adapted for binary GA):
f <- function(x, values){
ind <- which(x == 1)
y <- values[ind]
y <- ifelse(y %% 2 != 0, y, 0)
y <- y[1:10]
return(sum(y))
}
GA:
GA <- ga(
type = "binary",
fitness = f,
values = values,
nBits = length(values),
population = myInit(10),
crossover = myCrossover,
mutation = myMutation,
run = 300,
pmutation = 0.3,
maxiter = 10000,
popSize = 100
)
Chosen values
values[which(GA#solution[1,] == 1)]

Speeding up linear model fitting on complete pairwise observations in large sparse matrix in R

I have a numeric data.frame df with 134946 rows x 1938 columns.
99.82% of the data are NA.
For each pair of (distinct) columns "P1" and "P2", I need to find which rows have non-NA values for both and then do some operations on those rows (linear model).
I wrote a script that does this, but it seems quite slow.
This post seems to discuss a related task, but I can't immediately see if or how it can be adapted to my case.
Borrowing the example from that post:
set.seed(54321)
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow=nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
df <- as.data.frame(dat)
df_ps <- names(df)
N_ps <- length(df_ps)
My script is:
tic = proc.time()
out <- do.call(rbind,sapply(1:(N_ps-1), function(i) {
if (i/10 == floor(i/10)) {
cat("\ni = ",i,"\n")
toc = proc.time();
show(toc-tic);
}
do.call(rbind,sapply((i+1):N_ps, function(j) {
w <- which(complete.cases(df[,i],df[,j]))
N <- length(w)
if (N >= 5) {
xw <- df[w,i]
yw <- df[w,j]
if ((diff(range(xw)) != 0) & (diff(range(yw)) != 0)) {
s <- summary(lm(yw~xw))
o <- c(i,j,N,s$adj.r.squared,s$coefficients[2],s$coefficients[4],s$coefficients[8],s$coefficients[1],s$coefficients[3],s$coefficients[7])} else {
o <- c(i,j,N,rep(NA,7))
}
} else {o <- NULL}
return(o)
},simplify=F))
}
,simplify=F))
toc = proc.time();
show(toc-tic);
This takes about 10 minutes on my machine.
You can imagine what happens when I need to handle a much larger (although more sparse) data matrix. I never managed to finish the calculation.
Question: do you think this could be done more efficiently?
The thing is I don't know which operations take more time (subsetting of df, in which case I would remove duplications of that? appending matrix data, in which case I would create a flat vector and then convert it to matrix at the end? ...).
Thanks!
EDIT following up from minem's post
As shown by minem, the speed of this calculation strongly depended on the way linear regression parameters were calculated. Therefore changing that part was the single most important thing to do.
My own further trials showed that: 1) it was essential to use sapply in combination with do.call(rbind, rather than any flat vector, to store the data (I am still not sure why - I might make a separate post about this); 2) on the original matrix I am working on, much more sparse and with a much larger nrows/ncolumns ratio than the one in this example, using the information on the x vector available at the start of each i iteration to reduce the y vector at the start of each j iteration increased the speed by several orders of magnitude, even compared with minem's original script, which was already much better than mine above.
I suppose the advantage comes from filtering out many rows a priori, thus avoiding costly xna & yna operations on very long vectors.
The modified script is the following:
set.seed(54321)
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow = nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.90)] = NA
df <- as.data.frame(dat)
df_ps <- names(df)
N_ps <- length(df_ps)
tic = proc.time()
naIds <- lapply(df, function(x) !is.na(x))
dl <- as.list(df)
rl <- sapply(1:(N_ps - 1), function(i) {
if ((i-1)/10 == floor((i-1)/10)) {
cat("\ni = ",i,"\n")
toc = proc.time();
show(toc-tic);
}
x <- dl[[i]]
xna <- which(naIds[[i]])
rl2 <- sapply((i + 1):N_ps, function(j) {
y <- dl[[j]][xna]
yna <- which(naIds[[j]][xna])
w <- xna[yna]
N <- length(w)
if (N >= 5) {
xw <- x[w]
yw <- y[yna]
if ((min(xw) != max(xw)) && (min(yw) != max(yw))) {
# extracts from lm/lm.fit/summary.lm functions
X <- cbind(1L, xw)
m <- .lm.fit(X, yw)
# calculate adj.r.squared
fitted <- yw - m$residuals
rss <- sum(m$residuals^2)
mss <- sum((fitted - mean(fitted))^2)
n <- length(m$residuals)
rdf <- n - m$rank
# rdf <- df.residual
r.squared <- mss/(mss + rss)
adj.r.squared <- 1 - (1 - r.squared) * ((n - 1L)/rdf)
# calculate se & pvals
p1 <- 1L:m$rank
Qr <- m$qr
R <- chol2inv(Qr[p1, p1, drop = FALSE])
resvar <- rss/rdf
se <- sqrt(diag(R) * resvar)
est <- m$coefficients[m$pivot[p1]]
tval <- est/se
pvals <- 2 * pt(abs(tval), rdf, lower.tail = FALSE)
res <- c(m$coefficients[2], se[2], pvals[2],
m$coefficients[1], se[1], pvals[1])
o <- c(i, j, N, adj.r.squared, res)
} else {
o <- c(i,j,N,rep(NA,7))
}
} else {o <- NULL}
return(o)
}, simplify = F)
do.call(rbind, rl2)
}, simplify = F)
out2 <- do.call(rbind, rl)
toc = proc.time();
show(toc - tic)
E.g. try with nr=100000; nc=100.
I should probably mention that I tried using indices, i.e.:
naIds <- lapply(df, function(x) which(!is.na(x)))
and then obviously generating w by intersection:
w <- intersect(xna,yna)
N <- length(w)
This however is slower than the above.
Larges bottleneck is lm function, because there are lot of checks & additional calculations, that you do not necessarily need. So I extracted only the needed parts.
I got this to run in +/- 18 seconds.
set.seed(54321)
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow = nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
df <- as.data.frame(dat)
df_ps <- names(df)
N_ps <- length(df_ps)
tic = proc.time()
naIds <- lapply(df, function(x) !is.na(x)) # outside loop
dl <- as.list(df) # sub-setting list elements is faster that columns
rl <- sapply(1:(N_ps - 1), function(i) {
x <- dl[[i]]
xna <- naIds[[i]] # relevant logical vector if not empty elements
rl2 <- sapply((i + 1):N_ps, function(j) {
y <- dl[[j]]
yna <- naIds[[j]]
w <- xna & yna
N <- sum(w)
if (N >= 5) {
xw <- x[w]
yw <- y[w]
if ((min(xw) != max(xw)) && (min(xw) != max(xw))) { # faster
# extracts from lm/lm.fit/summary.lm functions
X <- cbind(1L, xw)
m <- .lm.fit(X, yw)
# calculate adj.r.squared
fitted <- yw - m$residuals
rss <- sum(m$residuals^2)
mss <- sum((fitted - mean(fitted))^2)
n <- length(m$residuals)
rdf <- n - m$rank
# rdf <- df.residual
r.squared <- mss/(mss + rss)
adj.r.squared <- 1 - (1 - r.squared) * ((n - 1L)/rdf)
# calculate se & pvals
p1 <- 1L:m$rank
Qr <- m$qr
R <- chol2inv(Qr[p1, p1, drop = FALSE])
resvar <- rss/rdf
se <- sqrt(diag(R) * resvar)
est <- m$coefficients[m$pivot[p1]]
tval <- est/se
pvals <- 2 * pt(abs(tval), rdf, lower.tail = FALSE)
res <- c(m$coefficients[2], se[2], pvals[2],
m$coefficients[1], se[1], pvals[1])
o <- c(i, j, N, adj.r.squared, res)
} else {
o <- c(i,j,N,rep(NA,6))
}
} else {o <- NULL}
return(o)
}, simplify = F)
do.call(rbind, rl2)
}, simplify = F)
out2 <- do.call(rbind, rl)
toc = proc.time();
show(toc - tic);
# user system elapsed
# 17.94 0.11 18.44

Reducing a network by appending strong links to the spanning tree

I am trying to reduce a full graph so that it only shows the strongest links.
To do that I am doing this:
Multiply the original matrix by -1 (the values are in the interval [0,1])
Obtain the minimum spanning tree, step 1. makes it a spanning tree with the strongest links
Paste other strong links that are not a part of the spanning tree until the average number of links per node is <= 5.
To perform these steps I have defined two functions that should be equivalent and I have to determine if the fast function output is the same as the slow function output.
Setup
# number of links
n <- 10
n2 <- sqrt(n)
library(igraph)
set.seed(1234)
mat <- matrix(rnorm(n, 1, 0.5), nrow = n2, ncol = n2)
mat[mat < 0] <- 0
colnames(mat) <- letters[1:n2]
rownames(mat) <- letters[1:n2]
diag(mat) <- 1
Function 1: binding rows after converting graph to data.frame
trim_network_1 <- function(mat, avg_links = 5, tolerance = 0.01) {
mat <- (-1) * mat
g <- graph_from_adjacency_matrix(mat, weighted = TRUE, mode = "undirected", diag = FALSE)
g_mst <- mst(g, algorithm = "prim")
threshold <- 0
avg_links_n <- FALSE
while (avg_links_n == FALSE) {
if (threshold < 1) {
message(sprintf("%s threshold...", threshold))
g_not_in_mst <- delete.edges(g, which(abs(E(g)$weight) <= threshold))
g_not_in_mst <- graph.difference(g_not_in_mst, g_mst)
g <- rbind(
as_data_frame(g_mst),
as_data_frame(g_not_in_mst)
)
g <- graph_from_data_frame(g, directed = F)
avg_links_n <- ifelse(mean(degree(g)) <= avg_links, TRUE, FALSE)
threshold <- threshold + tolerance
if (avg_links_n == TRUE) {
message(sprintf("%s threshold achieves the avg number of connections", threshold))
E(g)$weight <- (-1) * E(g)$weight
return(g)
}
} else {
warning("no threshold achieves the avg number of connections\nreturning maximum spanning tree")
avg_links_n <- TRUE
E(g_mst)$weight <- (-1) * E(g_mst)$weight
return(g_mst)
}
}
}
Function 2: Combine weights after merging graphs
trim_network_2 <- function(mat, avg_links = 5, tolerance = 0.01) {
mat <- (-1) * mat
g <- graph_from_adjacency_matrix(mat, weighted = TRUE, mode = "undirected", diag = FALSE)
g_mst <- mst(g, algorithm = "prim")
threshold <- 0
avg_links_n <- FALSE
while (avg_links_n == FALSE) {
if (threshold < 1) {
message(sprintf("%s threshold...", threshold))
g_not_in_mst <- delete.edges(g, which(abs(E(g)$weight) <= threshold))
g_not_in_mst <- graph.difference(g_not_in_mst, g_mst)
g <- graph.union(g_mst, g_not_in_mst)
E(g)$weight <- pmin(E(g)$weight_1, E(g)$weight_2, na.rm = T)
g <- remove.edge.attribute(g, "weight_1")
g <- remove.edge.attribute(g, "weight_2")
avg_links_n <- ifelse(mean(degree(g)) <= avg_links, TRUE, FALSE)
threshold <- threshold + tolerance
if (avg_links_n == TRUE) {
message(sprintf("%s threshold achieves the avg number of connections", threshold))
E(g)$weight <- (-1) * E(g)$weight
return(g)
}
} else {
warning("no threshold achieves the avg number of connections\nreturning maximum spanning tree")
avg_links_n <- TRUE
E(g_mst)$weight <- (-1) * E(g_mst)$weight
return(g_mst)
}
}
}
Comparting the outputs
g1 <- trim_network_1(mat)
g2 <- trim_network_2(mat)
g1 <- as_data_frame(g1)
g2 <- as_data_frame(g2)
g1w <- order(g1$weight)
g1 <- g1[g1w, ]
g2w <- order(g2$weight)
g2 <- g2[g2w, ]
# this is symmetric so A-B is the same as B-A (undirected graph)
# all.equal(g1, g2) doens't always hold bc of symmetry
all.equal(g1$weight, g2$weight)
all.equal(nrow(g1), nrow(g2))
I tried with graph.union(g_mst, g_not_in_mst) and even after doing E(g)$weight <- pmin(E(g)$weight_1, E(g)$weight_2, na.rm = T) I cannot find a formal way to prove that the two results are the equal. The idea of comparing two results is that one function is more efficient than the other.

Manual simulation of Markov Chain in R (3)

I have tried to improve my previous code so that I can incorporate conditional probability.
Source Code
states <- c(1, 2)
alpha <- c(1, 1)/2
mat <- matrix(c(0.5, 0.5,
0, 1), nrow = 2, ncol = 2, byrow = TRUE)
# this function calculates the next state, if present state is given.
# X = present states
# pMat = probability matrix
nextX <- function(X, pMat)
{
#set.seed(1)
probVec <- vector() # initialize vector
if(X == states[1]) # if the present state is 1
{
probVec <- pMat[1,] # take the 1st row
}
if(X==states[2]) # if the prsent state is 2
{
probVec <- pMat[2,] # take the 2nd row
}
return(sample(states, 1, replace=TRUE, prob=probVec)) # calculate the next state
}
# this function simulates 5 steps
steps <- function(alpha1, mat1, n1)
{
vec <- vector(mode="numeric", length = n1+1) # initialize an empty vector
X <- sample(states, 1, replace=TRUE, prob=alpha1) # initial state
vec[1] <- X
for (i in 2:(n1+1))
{
X <- nextX(X, mat1)
vec[i] <- X
}
return (vec)
}
# this function repeats the simulation n1 times.
# steps(alpha1=alpha, mat1=mat, n1=5)
simulate <- function(alpha1, mat1, n1)
{
mattt <- matrix(nrow=n1, ncol=6, byrow=T);
for (i in 1:(n1))
{
temp <- steps(alpha1, mat1, 5)
mattt[i,] <- temp
}
return (mattt)
}
Execution
I created this function so that it can handle any conditional probability:
prob <- function(simMat, fromStep, toStep, fromState, toState)
{
mean(simMat[toStep+1, simMat[fromStep+1, ]==fromState]==toState)
}
sim <- simulate(alpha, mat, 10)
p <- prob(sim, 0,1,1,1) # P(X1=1|X0=1)
p
Output
NaN
Why is this source code giving NaN?
How can I correct it?
I didn't inspect the rest of your code, but it seems that only prob has a mistake; you are mixing up rows with columns and instead it should be
prob <- function(simMat, fromStep, toStep, fromState, toState)
mean(simMat[simMat[, fromStep + 1] == fromState, toStep + 1] == toState)
Then NaN still remains a valid possibility for the following reason. We are looking at a conditional probability P(X1=1|X0=1) which, by definition, is well defined only when P(X0=1)>0. The same holds with sample estimates: if there are no cases where X0=1, then the "denominator" in the mean inside of prob is zero. Thus, it cannot and should not be fixed (i.e., returning 0 in those cases would be wrong).

k-means clustering with constraint based on the node values

Maybe I'm missing something as this seems to be a simple problem, but I looked this up online and haven't found anything in the literature.
Basically what I need to do is to do a clustering of a set of destination cities based on their location (so latitude/longitude as features of each node, Euclidean distances for the similarity metric), with fixed number of clusters. All seems good and a k-means would do the trick. However, I have the following constraint for each cluster: Every city (node) has a corresponding value assigned to it, and the sum of these values in each cluster should not exceed a fixed threshold (same threshold for all clusters). Is there an easy way to do so?
You have 2 options:
-You could instead use rpart as a clustering, and use weights and the minbucket option. However the clusters, which predict will give you will be rectangulars.
-You could have a look at the source code which I found on
https://searchcode.com/codesearch/view/18689414/ :
kmeans <-
function(x, centers, iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"))
{
do_one <- function(nmeth) {
Z <-
switch(nmeth,
{ # 1
Z <- .Fortran(C_kmns, x, m, p,
centers = centers,
as.integer(k), c1 = integer(m), integer(m),
nc = integer(k), double(k), double(k), integer(k),
double(m), integer(k), integer(k),
as.integer(iter.max), wss = double(k),
ifault = 0L)
switch(Z$ifault,
stop("empty cluster: try a better set of initial centers",
call.=FALSE),
warning(gettextf("did not converge in %d iterations",
iter.max), call.=FALSE, domain =NA),
stop("number of cluster centres must lie between 1 and nrow(x)",
call.=FALSE)
)
Z
},
{ # 2
Z <- .C(C_kmeans_Lloyd, x, m, p,
centers = centers, as.integer(k),
c1 = integer(m), iter = as.integer(iter.max),
nc = integer(k), wss = double(k))
if(Z$iter > iter.max)
warning("did not converge in ",
iter.max, " iterations", call.=FALSE)
if(any(Z$nc == 0))
warning("empty cluster: try a better set of initial centers", call.=FALSE)
Z
},
{ # 3
Z <- .C(C_kmeans_MacQueen, x, m, p,
centers = as.double(centers), as.integer(k),
c1 = integer(m), iter = as.integer(iter.max),
nc = integer(k), wss = double(k))
if(Z$iter > iter.max)
warning("did not converge in ",
iter.max, " iterations", call.=FALSE)
if(any(Z$nc == 0))
warning("empty cluster: try a better set of initial centers", call.=FALSE)
Z
})
Z
}
x <- as.matrix(x)
m <- as.integer(nrow(x))
if(is.na(m)) stop("invalid nrow(x)")
p <- as.integer(ncol(x))
if(is.na(p)) stop("invalid ncol(x)")
if(missing(centers))
stop("'centers' must be a number or a matrix")
nmeth <- switch(match.arg(algorithm),
"Hartigan-Wong" = 1,
"Lloyd" = 2, "Forgy" = 2,
"MacQueen" = 3)
if(length(centers) == 1L) {
if (centers == 1) nmeth <- 3
k <- centers
## we need to avoid duplicates here
if(nstart == 1)
centers <- x[sample.int(m, k), , drop = FALSE]
if(nstart >= 2 || any(duplicated(centers))) {
cn <- unique(x)
mm <- nrow(cn)
if(mm < k)
stop("more cluster centers than distinct data points.")
centers <- cn[sample.int(mm, k), , drop=FALSE]
}
} else {
centers <- as.matrix(centers)
if(any(duplicated(centers)))
stop("initial centers are not distinct")
cn <- NULL
k <- nrow(centers)
if(m < k)
stop("more cluster centers than data points")
}
if(iter.max < 1) stop("'iter.max' must be positive")
if(ncol(x) != ncol(centers))
stop("must have same number of columns in 'x' and 'centers'")
if(!is.double(x)) storage.mode(x) <- "double"
if(!is.double(centers)) storage.mode(centers) <- "double"
Z <- do_one(nmeth)
best <- sum(Z$wss)
if(nstart >= 2 && !is.null(cn))
for(i in 2:nstart) {
centers <- cn[sample.int(mm, k), , drop=FALSE]
ZZ <- do_one(nmeth)
if((z <- sum(ZZ$wss)) < best) {
Z <- ZZ
best <- z
}
}
centers <- matrix(Z$centers, k)
dimnames(centers) <- list(1L:k, dimnames(x)[[2L]])
cluster <- Z$c1
if(!is.null(rn <- rownames(x)))
names(cluster) <- rn
totss <- sum(scale(x, scale = FALSE)^2)
structure(list(cluster = cluster, centers = centers, totss = totss,
withinss = Z$wss, tot.withinss = best,
betweenss = totss - best, size = Z$nc),
class = "kmeans")
}
## modelled on print methods in the cluster package
print.kmeans <- function(x, ...)
{
cat("K-means clustering with ", length(x$size), " clusters of sizes ",
paste(x$size, collapse=", "), "\n", sep="")
cat("\nCluster means:\n")
print(x$centers, ...)
cat("\nClustering vector:\n")
print(x$cluster, ...)
cat("\nWithin cluster sum of squares by cluster:\n")
print(x$withinss, ...)
cat(sprintf(" (between_SS / total_SS = %5.1f %%)\n",
100 * x$betweenss/x$totss),
"Available components:\n", sep="\n")
print(names(x))
invisible(x)
}
fitted.kmeans <- function(object, method = c("centers", "classes"), ...)
{
method <- match.arg(method)
if (method == "centers") object$centers[object$cl, , drop=FALSE]
else object$cl
}
Please note, that the code checks if an improvement happened with these lines:
if((z <- sum(ZZ$wss)) < best) {
Z <- ZZ
best <- z
}
Here you can add your constraint.
You can use the same principle as KMeans. Iterate in 2-3 until convergence:
Assign cities to clusters (randomly)
Compute the centroids of clusters
Assign points to centroids such that:
The sum of distances to points to the assigned centroids are minimized
The threshold constraints are respected
In standard KMeans there are no constraints. Hence the second step is performed easily by assigning each point to the closest centroid. Here you will have to solve an optimization problem in step 2.
It is probably faster if you just model it as an integer programming problem. OR Tools has facilities for solving integer programming problems.
Here is a python implementation that does K-means clustering with different constraints, including one with a maximum on total weight of instances in a cluster.

Resources