Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am searching for a package to calculate the adjusted mutual information between two clusterings. I have only found some python code via google. Is there any built-in R package or function that can be clustering the data via mutual information?
Here is a link
https://github.com/defleury/adjusted_mutual_information
It says "contains code for the fast & parallelized calculation of Adjusted Mutual Information (AMI), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) between clusterings in R."
For small clusters, here are 3 functions. Function f_rez() takes as input 2 vectors in which numbers says the partition of that element and returns AMI. It takes about 30 s for 3 pairs of clusters of length N = 11.117, on a dual core non-parallel.
f_nij <- function(v1,v2,l1,l2){ #contingency table n(i,j)=t(i,j)
m <- matrix(0,l1,l2)
for (i in 1:length(v1)){
m[v1[i],v2[i]] <- m[v1[i],v2[i]] +1
}
m
}
f_emi <- function(s1,s2,l1,l2,n){ #expected mutual information
s_emi <- 0
for(i in 1:l1){
for (j in 1:l2){
min_nij <- max(1,s1[i]+s2[j]-n)
max_nij <- min(s1[i],s2[j])
n.ij <- seq(min_nij, max_nij) #sequence of consecutive numbers
t1<- (n.ij / n) * log((n.ij * n) / (s1[i]*s2[j]))
t2 <- exp(lfactorial(s1[i]) + lfactorial(s2[j]) + lfactorial(n - s1[i]) + lfactorial(n - s2[j]) - lfactorial(n) - lfactorial(n.ij) - lfactorial(s1[i] - n.ij) - lfactorial(s2[j] - n.ij) - lfactorial(n - s1[i] - s2[j] + n.ij))
emi <- sum(t1*t2)
s_emi <- s_emi + emi
}
}
return(s_emi)
}
f_rez <- function(v1,v2){
library(infotheo)
s1 <- tabulate(v1);
s2 <- tabulate(v2);
l1 <- length(s1)
l2 <- length(s2)
N <- length(v1)
tij <- f_nij(v1,v2,l1,l2) #contingency table n(i,j)=t(i,j). this would be equivalent with table(v1,v2)
mi <- mutinformation(v1,v2) #function for Mutual Information from package infotheo
h1 <- -sum(s1*log(s1/N))/N
h2 <- -sum(s2*log(s2/N))/N
nmi <- mi/max(h1,h2) # NMI Normalized MI
emi <- f_emi(s1,s2,l1,l2,N) # EMI Expected MI
ami <- (mi-emi)/max(h1,h2) #AMI Adjusted MI
return(c(nmi,ami))
}
I found here a matlab code for Adjusted Mutual Information(AMI). And according to this thread on stackoverflow, it is possible to translate .m file into .r file, though this isn't trivial, at least it is a way to get a function of AMI for R.
The R package "CLUE" provided Normalized Mutual Information(NMI), a "less-good" version of AMI. According to this paper "A Novel Approach for Automatic Number of Clusters Detection in Microarray Data based on Consensus Clustering" by Nguyen Xuan Vinh and Julien Epps, it seems that the Adjusted Rand Index(ARI) is a good substitution of AMI.
Luckily for ARI, there are a few R packages that implemented this funciton, such as, RRand() funciton in Phyclust package, RandIndex() function in flexclust package, adjustedRandIndex() function in mclust package, and a.rand.index2() in clustergas package.
Hope this helps.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
In R, I am generating uncorrelated values in time domain with rnorm(). Then I apply fft() to these values, however, I am only getting a value of 0.88 instead of 1. Is there anything I am not aware of?
Here is a MWE:
# dt <- 0.01 # time stesp
nSteps <- 100000 # Number of time steps
# df <- 1/(nSteps*dt) # frequency resolution
# t <- 0:(nSteps-1)*dt #
y <- rnorm(nSteps, mean=0, sd=1) # generate uncorrelated data. Should result in a white noise spectrum with sd=1
y_sq_sum <- sum(y^2)
# We ignore cutting to the Nyquist frequency.
# f <- 0:(nSteps-1)*df
fft_y <- abs(fft(y))/sqrt(length(y))
fft_y_sq_sum <- sum(fft_y^2)
print(paste("Check for Parseval's theorem: y_sq_sum = ", y_sq_sum, "; fft_y_sq_sum = ", fft_y_sq_sum, sep=""))
print(paste("Mean amplitude of my fft spectrum: ", mean(fft_y)))
print(paste("The above is typically around 0.88, why is it not 1?"))
This question doesn't belong on StackOverflow, it's more of a Cross-validated kind of thing. But here's an answer anyway:
Parseval's theorem says that the mean of fft_y^2 should be 1. The square root function is a concave function, so Jensen's inequality says the mean of sqrt(fft_y^2) will be less than 1. Since fft_y is positive in your definition, fft_y = sqrt(fft_y^2).
I am trying to find jaccard and NMI indexes from R using NMI package and cluster package of R.the thng is i am getting wrong indexes as the primary data conatins 3 clusters whereas test data has 5 clusters and 2 clusters respectively. Is there any way that NMI and jaccard indexes can be correctly calculated for data like this. following is the code which ia m presently using
clarax <- clara(a,3)
clV1 <- clarax$clustering
clara1 <- clara(e,5)
clV2 <- clara1$clustering
ci <- cindex(clV1=clV1, clV2=clV2, self=FALSE, minSZ=1, method="jaccard")
ci[2:3] # Returns Jaccard index and variables used to compute it
ci$intersects
a4<-ci$Jaccard_Index
sprintf("%.16f", a4)
clarax <- clara(a,3)
clV1 <- clarax$clustering
clara1 <- clara(f,2)
clV2 <- clara1$clustering
ci <- cindex(clV1=clV1, clV2=clV2, self=TRUE, minSZ=1, method="jaccard")
ci[2:3] # Returns Jaccard index and variables used to compute it
ci$intersects
a5<-ci$Jaccard_Index
sprintf("%.16f", a5)
and for NMI
b4<-NMI(groundtruth,clustering_4)
sprintf("%.16f", b4)
here is the link to dataset
http://www.mediafire.com/file/brar5msd12j552z/clustering_4.txt
http://www.mediafire.com/file/xd12xkd0h4jx6my/clustering_5.txt
http://www.mediafire.com/file/bkwat6edkcyp4wj/partitions.txt
the partition dataset is the primary dataset and we have to find the indexes based on these values.
Please help me out
Regards
I don't really see the problem here. NMI doesn't require the same number of clusters. Using the NMI package (as i don't find it very clear what you used in your question), it's as simple as this:
library(NMI)
NMI(partitions, clustering4)
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 4 years ago.
Improve this question
: ) I previously wrote an R function that will compute a least-squares polynomial of arbitrary order to fit whatever data I put into it. "LeastSquaresDegreeN.R" The code works because I can reproduce results I got previously. However, when I try to put new data into it I get a "Non-conformable arguments" error.
"Error in Conj(t(Q))%*%t(b) : non-conformable arguments"
An extremely simple example of data that should work:
t <- seq(1,100,1)
fifthDegree <- t^5
LeastSquaresDegreeN(t,fifthDegree,5)
This should output and plot a polynomial f(t) = t^5 (up to rounding errors).
However I get "Non-conformable arguments" error even if I explicitly make these vectors:
t <- as.vector(t)
fifthDegree <- as.vector(fifthDegree)
LeastSquaresDegreeN(t,fifthDegree,5)
I've tried putting in the transpose of these vectors too - but nothing works.
Surely the solution is really simple. Help!? Thank you!
Here's the function:
LeastSquaresDegreeN <- function(t, b, deg)
{
# Usage: t is independent variable vector, b is function data
# i.e., b = f(t)
# deg is desired polynomial order
# deg <- deg + 1 is a little adjustment to make the R loops index correctly.
deg <- deg + 1
t <- t(t)
dataSize <- length(b)
A <- mat.or.vec(dataSize, deg) # Built-in R function to create zero
# matrix or zero vector of arbitrary size
# Given basis phi(z) = 1 + z + z^2 + z^3 + ...
# Define matrix A
for (i in 0:deg-1) {
A[1:dataSize,i+1] = t^i
}
# Compute QR decomposition of A. Pull Q and R out of QRdecomp
QRdecomp <- qr(A)
Q <- qr.Q(QRdecomp, complete=TRUE)
R <- qr.R(QRdecomp, complete=TRUE)
# Perform Q^* b^T (Conjugate transpose of Q)
c <- Conj(t(Q))%*%t(b)
# Find x. R isn't square - so we have to use qr.solve
x <- qr.solve(R, c)
# Create xPlot (which is general enough to plot any degree
# polynomial output)
xPlot = x[1,1]
for (i in 1:deg-1){
xPlot = xPlot + x[i+1,1]*t^i
}
# Now plot it. Least squares "l" plot first, then the points in red.
plot(t, xPlot, type='l', xlab="independent variable t", ylab="function values f(t)", main="Data Plotted with Nth Degree Least Squares Polynomial", col="blue")
points(t, b, col="red")
} # End
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
Disclaimer
Hello everyone! I recently started programming in R. My codes are working just fine, but in terms of speed some of them are taking way too long to put to good use. I hope someone can help me making this code run faster either by optimising the code, or with the use of one of the multicore packages.
About the code
I have large datasets containing about 15000 numeric data each. The code takes two parameters (p, n) where p >= n, and make subsets of the data. It applies the zyp.yuepilon function (from the zyp package) to each row of the subsets. Then the parameter n is used to apply the same function on an n sized subset.
Problem is I run this code in a nested for loop: p in 10:40 and n in 10:40 so it takes an eternity to get the results, and it's just one dataset among many others.
sp <- function(p, n){
library(zyp)
data <- runif(15000, 1, 4)
lower <- seq(80 - p + 1, by=1, length.out=length(data)-81)
upper <- lower + p - 1
subsets <- matrix(nrow=length(lower), ncol=p)
for(j in 1:length(lower)){
subsets[j, ] = data[lower[j] : upper[j]]
}
ret <- apply(subsets, 1, zyp.yuepilon)
subset_n <- subsets[, 1:n]
ret2 <- apply(subset_n, 1, zyp.yuepilon)
return(list(ret, ret2))
}
Benchmark results in seconds:
expr min lq median uq max neval
sp(7, 6) 92.77266 94.24901 94.53346 95.10363 95.64914 10
Here is a series of comments, rather than an answer.
Looking at the zyp.yuepilon function body, by calling the function without parenthesis in a R session, you see that this function, and the function zyp.sen are written in plain R code (as opposed to compiled code).
The biggest speed-up is likely attained by using the Rcpp package which facilitates calling (compiled) C++ code within R. In fact, there is a small linear model example here Fast LM model using Rcpp/RcppArmadillo.
I would be inclined to rewrite the two functions zyp.yuepilon and zyp.sen in C++, using Rcpp, including the loop over subset vectors (for which you are currently using apply to do).
For general R speed-up issues see this question R loop performance, as well as the R package plyr, which may provide an entry point for taking a map-reduce type of approach to your problem.
If you want to steer clear of C++, then a series of micro-optimisations would be your quickest win. To speed up the apply aspect of your code, you could use something like this
library(doParallel)
library(parallel)
library(foreach)
library(zyp)
cl<-makeCluster(4)
registerDoParallel(cl)
sp_1<-function(p=7, n=6){
N_ob=15000;
off_set=81;
N_ob_o=N_ob-off_set;
am<-matrix(runif(N_ob*p),ncol=p);
subsets<-am[-(1:off_set),];
ret=matrix(unlist( foreach(i=1:N_ob_o) %dopar% zyp::zyp.yuepilon(subsets[i,]),use.names=FALSE),ncol=11, byrow=TRUE);
subset_n <- subsets[, 1:n]
ret2=matrix(unlist( foreach(i=1:N_ob_o) %dopar% zyp::zyp.yuepilon(subset_n[i,]),use.names=FALSE),nrow=11);
return(list(ret, ret2))
}
sp<-function(p=7, n=6){
data <- runif(15000, 1, 4)
lower <- seq(80 - p + 1, by=1, length.out=length(data)-81)
upper <- lower + p - 1
subsets <- matrix(nrow=length(lower), ncol=p)
for(j in 1:length(lower)){
subsets[j, ] = data[lower[j] : upper[j]]
}
ret <- apply(subsets, 1, zyp.yuepilon)
subset_n <- subsets[, 1:n]
ret2 <- apply(subset_n, 1, zyp.yuepilon)
return(list(ret, ret2))
}
system.time(sp_1())
system.time(sp())
This gives me a speed-up of around a factor of 2. But this will depend on your platform, etc. Check out the help files for the functions and packages above, and tune the number of clusters using makeCluster to see what works best for your platform (in the absence of any information about your particular set-up).
Another route might be to make use of the byte-code compiler via library(compiler) to see if the various functions can be optimised, this way.
library(compiler)
enableJit(3);
zyp_comp<-cmpfun(zyp.yuepilon);
I'm developing automatic forecast Software with JAVA & R. The following steps are used in R to forecast next 18 values:
trends <- scan("c:/data_for_R/trends.dat")
auto.arima(trends) (cf. arima(pdq))
trendsarima <- arima(trends, order=c(2,1,3)), note that (2,1,3) was found by the step #2)
trendsforecasts <- forecast.Arima(trendsarima, h=18)
trendsforecasts
plot.forecast(trendsforecasts)
All I want to know is, how do you integrate steps #2, #3 (preferably by a single command)?
trendsarima <- auto.arima(trends)