I have a correlation matrix in R and I want to know how many groups (and put these groups into vectors) of elements correlate between them in more than 95%.
X <- matrix(0,3,5)
X[,1] <- c(1,2,3)
X[,2] <- c(1,2.2,3)*2
X[,3] <- c(1,2,3.3)*3
X[,4] <- c(6,5,1)
X[,5] <- c(6.1,5,1.2)*4
cor.matrix <- cor(X)
cor.matrix <- cor.matrix*lower.tri(cor.matrix)
cor.vector <- which(cor.matrix>0.95, arr.ind=TRUE)
cor.vector then contains:
row col
[1,] 2 1
[2,] 3 1
[3,] 3 2
[4,] 5 4
That means, as expected, that the vectors 1,2 and 3 correlate between them, and also 4 and 5.
What I would need is to get two vectors c(1,2,3) and c(4,5) as the final result.
This is a simple example, I am processing large matrices though.
Here's an approach using igraph package:
require(igraph)
g <- graph.data.frame(cor.vector, directed = FALSE)
split(unique(as.vector(cor.vector)), clusters(g)$membership)
# $`1`
# [1] 2 3 1
# $`2`
# [1] 5 4
What this essentially does is to find the clusters in the graph g (disconnected sets), as illustrated in the figure below. Since the vertices are used to create the graph in the order you entered (from your cor.vector), the clustering order also comes back in the same order. That is: for vertices c(2,3,5,1,4) the clusters are c(1,1,2,1,2) with a total of two clusters (cluster 1 and cluster 2). So, we just use this to split using the cluster group.
Related
I have obtained an mppm object by fitting a model on several independent datasets using the mppm function from the R package spatstat. How can I generate simulated realisations of this model and obtain the x, y, and marks attributes of the simulations ?
I fitted my model as such:
data <- listof(NMJ1,NMJ2,NMJ3)
data <- hyperframe(X=1:3, Points=data)
model <- mppm(Points ~marks*sqrt(x^2+y^2), data)
where NMJ1, NMJ2, and NMJ3 are marked ppp and are independent realisations of the same experiment.
sim <- simulate(model) allows me to generate simulated realisations of this model, and plot(sim,axes = TRUE) to plot them. sim itself is an hyperframe object:
> sim
Hyperframe:
Sim1
1 (ppp)
2 (ppp)
3 (ppp)
How can I access the values (x, y, and marks) in this hyperframe ? My goal is to generate a large number of independent realisations from my model, and to use the simulated values for another task. Is there a practical way to obtain, retrieve and save these values ?
Since you say you want to simulate this many times I have here shown the code
with two simulations (rather than one as you have in the question):
library(spatstat)
data <- list(amacrine, amacrine, amacrine)
data <- hyperframe(X=1:3, Points=data)
model <- mppm(Points ~marks*sqrt(x^2+y^2), data)
sim <- simulate(model, nsim = 2)
#> Generating simulated realisations of 3 models..
#> 1, 2, 3.
Now sim is a hyperframe with 2 columns (one for each simulation). Each
column is a list of 3 point patterns. To get the three sets of coordinates
and marks for the first simulation use as.data.frame on each point pattern:
co1 <- lapply(sim$Sim1, as.data.frame)
Then co1 is a list of length 3, and we can print out the first few
coordinates with the head() command, e.g. the coordinates of the third
point pattern:
head(co1[[3]])
#> x y marks
#> 1 0.4942587 0.7889985 off
#> 2 0.6987270 0.7637359 on
#> 3 0.3926415 0.6819965 on
#> 4 0.7982686 0.9060733 off
#> 5 1.3507722 0.9731363 on
#> 6 0.6450985 0.6924126 on
We can extract the coordinates and marks for each simulation by another lapply that
runs over every simulation (in this case 2):
co <- lapply(sim, function(x) lapply(x, as.data.frame))
Now co is a list with 2 elements, and each element is a list of 3 sets of
coordinates:
length(co)
#> [1] 2
length(co[[2]])
#> [1] 3
head(co[[2]][[3]])
#> x y marks
#> 1 0.1660580 0.04180501 on
#> 2 0.7840025 0.71727782 on
#> 3 1.2011733 0.17109112 on
#> 4 1.0429867 0.49284639 on
#> 5 1.1411869 0.86711072 off
#> 6 1.0375942 0.06427601 on
How do I determine the optimal number of clusters while using hierarchical clustering. If I am just having the distance matrix as I am measuring only pairwise distances (levenshtein distances), how do I find out the optimal number of clusters? I referred to other posts they all use k-means, hierarchical but not for string type of data as shown below. Any suggestions on how to use R to find the number of clusters?
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i) {do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
Several statistics can be used.
Look for example at the WeightedCluster package that can compute and plot a series of such statistics.
To illustrate, you get the optimal number of groups for each available statistics as follows:
library("WeightedCluster")
hcRange <- as.clustrange(hc, diss=as.dist(d), ncluster=6)
summary(hcRange)
## 1. N groups 1. stat
## PBC 3 0.8799136
## HG 3 1.0000000
## HGSD 3 0.9987651
## ASW 3 0.4136550
## ASWw 3 0.4722895
## CH 3 8.3605263
## R2 6 0.4734561
## CHsq 3 20.6538462
## R2sq 6 0.6735039
## HC 3 0.0000000
You can also plot the statistics (here we show the Average silhouette width, ASWw, Huber's Gamma, HG, and the Point biserial correlation) for all the computed solutions
plot(hcRange, stat = c("ASWw", "HG", "PBC"), lwd = 2)
The better solution seems to be the three groups solution.
I want create predict function which predicts for which cluster, observation belong
data(iris)
mydata=iris
m=mydata[1:4]
train=head(m,100)
xNew=head(m,10)
rownames(train)<-1:nrow(train)
norm_eucl=function(train)
train/apply(train,1,function(x)sum(x^2)^.5)
m_norm=norm_eucl(train)
result=kmeans(m_norm,3,30)
predict.kmean <- function(cluster, newdata)
{
simMat <- m_norm(rbind(cluster, newdata),
sel=(1:nrow(newdata)) + nrow(cluster))[1:nrow(cluster), ]
unname(apply(simMat, 2, which.max))
}
## assign new data samples to exemplars
predict.kmean(m_norm, x[result$cluster, ], xNew)
After i get the error
Error in predict.kmean(m_norm, x[result$cluster, ], xNew) :
unused argument (xNew)
i understand that i am making something wrong function, cause I'm just learning to do it, but I can't understand where exactly.
indeed i want adopt similar function of apcluster ( i had seen similar topic, but for apcluster)
predict.apcluster <- function(s, exemplars, newdata)
{
simMat <- s(rbind(exemplars, newdata),
sel=(1:nrow(newdata)) + nrow(exemplars))[1:nrow(exemplars), ]
unname(apply(simMat, 2, which.max))
}
## assign new data samples to exemplars
predict.apcluster(negDistMat(r=2), x[apres#exemplars, ], xNew)
how to do it?
Rather than trying to replicate something, let's come up with our own function. For a given vector x, we want to assign a cluster using some prior k-means output. Given how k-means algorithm works, what we want is to find which cluster's center is closest to x. That can be done as
predict.kmeans <- function(x, newdata)
apply(newdata, 1, function(r) which.min(colSums((t(x$centers) - r)^2)))
That is, we go over newdata row by row and compute the corresponding row's distance to each of the centers and find the minimal one. Then, e.g.,
head(predict(result, train / sqrt(rowSums(train^2))), 3)
# 1 2 3
# 2 2 2
all.equal(predict(result, train / sqrt(rowSums(train^2))), result$cluster)
# [1] TRUE
which confirms that our predicting function assigned all the same clusters to the training observations. Then also
predict(result, xNew / sqrt(rowSums(xNew^2)))
# 1 2 3 4 5 6 7 8 9 10
# 2 2 2 2 2 2 2 2 2 2
Notice also that I'm calling simply predict rather than predict.kmeans. That is because result is of class kmeans and a right method is automatically chosen. Also notice how I normalize the data in a vectorized manner, without using apply.
Let's assume I have done several operations and created cluster vectors of correlation values shown below
D <- matrix(rexp(10*10,rate=.1), ncol=10) #create a randomly filled 10x10 matrix
C <- matrix(rexp(10*10,rate=.1),ncol=10)
DCor <- cor(D) # generate correlation matrix
CCor <- cor(C)
DUpper<- DCor[upper.tri(DCor)] # extract upper triangle
CUpper<- CCor[upper.tri(CCor)]
ClusterD <- kmeans(DUpper,3) # cluster correlations
ClusterC <- kmeans(CUpper,3)
ClusterC <- cbind(c(1:45),matrix(ClusterC$cluster)) # add row numbers as column
ClusterD <- cbind(c(1:45),matrix(ClusterD$cluster))
I would like to generate a matrix shows the intersection of each cluster group. In this matrix, 5 rows belong to both C1 and D2 group.
How can I generate a matrix like this?
Before the cbind lines, you could do:
table(ClusterC$cluster, ClusterD$cluster)
I need to write a program that does the following in R:
I have a data set (42 rows, 2 columns) of y variables and x variables.
I want to randomly select 12 rows from this matrix and record the coefficients (slope and intercept) of a linear regression of the randomly generated matrix. I would also like to write a loop for this so I can repeat this 1000 times, so I can then have a matrix with 1000 rows and 2 columns filled in with the slopes and intercepts of the 1000 randomly selected sets of 12 rows from my data set.
I am able to get this far but do not know how to incorporate a loop into the code, and a way to store the coefficients into a a matrix.
#Box.Z and Box.DC.gm are columns of data used to generate my initial matrix of data
A <- matrix(c(Box.Z, Box.DC.gm), nrow=42)
B <- A[sample(42, 12), ]
C <- lm(B[,2] ~ B[,1])
D <- matrix(c(coefficients(C)), ncol =2)
Something like this maybe:
#set.seed(23)
A <- matrix(runif(84),ncol=2)
randco <- function(A) {
B <- A[sample(42,12),]
lm(B[,2] ~ B[,1])$coefficients
}
t(replicate(10,randco(A)))
# (Intercept) B[, 1]
# [1,] 0.6018459 -0.1643174222
# [2,] 0.4411607 0.0005322798
#...
# [9,] 0.3201649 0.4848679516
#[10,] 0.5413830 0.1850853748