I'm trying to perform k-means on a dataframe with 69 columns and 1000 rows. First, I need to decide upon the optimal numbers of clusters first with the use of the Davies-Bouldin index. This algorithm requires that the input should be in the form of a matrix, I used this code first:
totalm <- data.matrix(total)
Followed by the following code (Davies-Bouldin index)
clusternumber<-0
max_cluster_number <- 30
#Davies Bouldin algorithm
library(clusterCrit)
smallest <-99999
for(b in 2:max_cluster_number){
a <-99999
for(i in 1:200){
cl <- kmeans(totalm,b)
cl<-as.numeric(cl)
intCriteria(totalm,cl$cluster,c("dav"))
if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
}
if(a<smallest){
smallest <- a
clusternumber <-b
}
}
print("##clusternumber##")
print(clusternumber)
print("##smallest##")
print(smallest)
I keep on getting this error:(list) object cannot be coerced to type 'double'.
How can I solve this?
Reproducable example:
a <- c(0,0,1,0,1,0,0)
b <- c(0,0,1,0,0,0,0)
c <- c(1,1,0,0,0,0,1)
d <- c(1,1,0,0,0,0,0)
total <- cbind(a,b,c,d)
The error is coming from cl<-as.numeric(cl). The result of a call to kmeans is an object, which is a list containing various information about the model.
Run ?kmeans
I would also recommend you add nstart = 20 to your kmeans call. k-means clustering is a random process. This will run the algorithm 20 times and find the best fit (i.e. for each number of centers).
for(b in 2:max_cluster_number){
a <-99999
for(i in 1:200){
cl <- kmeans(totalm,centers = b,nstart = 20)
#cl<-as.numeric(cl)
intCriteria(totalm,cl$cluster,c("dav"))
if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
}
if(a<smallest){
smallest <- a
clusternumber <-b
}
}
This gave me
[1] "##clusternumber##"
[1] 4
[1] "##smallest##"
[1] 0.138675
(tempoarily changing max clusters to 4 as reproducible data is a small set)
EDIT Integer Error
I was able to reproduce your error using
a <- as.integer(c(0,0,1,0,1,0,0))
b <- as.integer(c(0,0,1,0,0,0,0))
c <- as.integer(c(1,1,0,0,0,0,1))
d <- as.integer(c(1,1,0,0,0,0,0))
totalm <- cbind(a,b,c,d)
So that an integer matrix is created.
I was then able to remove the error by using
storage.mode(totalm) <- "double"
Note that
total <- cbind(a,b,c,d)
totalm <- data.matrix(total)
is unnecessary for the data in this example
> identical(total,totalm)
[1] TRUE
Related
I try to measure the phylogenetic signal in two variables, a discrete and a continuous one. To do so, I use the δ-statistic (Borges et al 2018) and the K-statistic (Blomberg 2003), respectively. I have a tree, and two vectors corresponding to my variables. The line of code for these two statistics are the following:
1) delta(vector, tree, lambda0, se, sim, thin, burn)
2) phylosig(tree, vector, method = "K")
I get a single value each time. But I would like to randomize my vectors to test the significance of the orignal values. I would like to do 1000 repetitions and then proceed to a simple test of significance but, as I am a new R user, I have no idea how to do it. I think of something like this:
For δ:
%first repetition
random_vector <- sample(vector)
random_delta <- delta(vector, tree, lambda0, se, sim, thin, burn)
write.xlsx(random_delta, path)
%second repetition
random_vector <- sample(random_vector)
random_delta <- delta(vector, tree, lambda0, se, sim, thin, burn)
write.xlsx(random_delta, path, append = TRUE)
And on, and on, until 1000 δ-statistics stored in a single .xlsx, ready to be used in a test.
For K, I guess it is a bit different since it is not a vector anymore but a table with two columns (species, values):
%first repetition
random_vector <- sample(vector)
names(random_vector) <- tree$tip.label
random_K <- phylosig(tree, vector, method = "K")
write.xlsx(random_K, path)
%second rep
random_vector <- sample(random_vector)
names(random_vector) <- tree$tip.label
random_K <- phylosig(tree, vector, method = "K")
write.xlsx(random_delta, path, append = TRUE)
Etc.
I thought of that, but maybe someone has another idea. Either way, I am in.
I hope I have made myself clear.
EDIT:
Thank you for your answers. Yes, a loop is totally what I need. And yes, write all the values in one vector seems more approriate, I agree with you.
With the δ-statistic, the phylogenetic signal is all the more important in the data as δ is high. But what is high? That is why I want to do 1000 iterations, to calcultate the p-value and demonstrate if the original value is 'exceptional' or not. Same with K, which is comprised between 0 and 1 in the presence of a phylogenetic signal.
Here is a more explicite example:
> library(phytools)
> trait_delta <- c(2,1,3,1,1,3,1,3,2,1,1,2,2,2,2,1,1,3,1,1)
> trait_K <- c(2,1,3,1,1,3,1,3,2,1,1,2,2,2,2,1,1,3,1,1)
> set.seed(25)
> ns <- 20
> tree <- rtree(ns)
> plot(tree)
>
> #delta
> lambda0 <- 0.1
> se <- 0.5
> sim <- 10000
> thin <- 10
> burn <- 100
>
> delta <- delta(trait_delta,tree,lambda0,se,sim,thin,burn)
> rand_values_delta <- c(print(delta))
>
> #to loop 999 times
> rand_trait_delta <- sample(trait_delta)
> rand_delta <- delta(rand_trait_delta,tree,lambda0,se,sim,thin,burn)
> rand_values_delta <- append(rand_values_delta, print(rand_delta), after =
> length(rand_values_delta+1))
>
>
> #K
> names(trait_K) <- tree$tip.label
> K <- phylosig(tree, trait_K, method = "K")
> rand_values_K <- c(K)
>
> #to loop 999 times
> rand_trait_K <- sample(trait_K)
> names(rand_trait_K) <- tree$tip.label
> rand_K <- phylosig(tree, rand_trait_K, method = "K")
> rand_values_K <- append(rand_values_K, rand_K, after =
> length(rand_values_K+1))
I'm still a bit hazy on exactly what you are doing, but hopefully this points you in the right direction:
n <- 10000 # run this many iterations
results <- rep(NA, n) # create an empty vector to store all the values
for (i in 1:n){
# get the random vector for this iteration
random_vector <- sample(vector)
# save the value you output
results[i] <- delta(vector, tree, lambda0, se, sim, thin, burn)
}
# write the single vector to file
write.xlsx(results, path)
This is untested as I don't know what vector or path are, I just used them from your code
I want to perform an IDW cross-validation and find out which "power"-value gives the smallest RMSE. In order to do this, I want to store the "power" and "RMSE"-values in a list and sort them by the smallest RMSE, for example
I'd like something like this:
RMSE Power
[1,] 1.230 2.5
[2,] 1.464 1.5
[3,] 1.698 2.0
[4,] 1.932 3.0
What I have so far is this:
require(sp)
require(gstat)
data("meuse")
#### create grid:
pixels <- 500 #define resolution
#define extent
raster.grd <- expand.grid(x=seq(floor(min(x=meuse$x)),
ceiling(max(x=meuse$x)),
length.out=pixels),
y=seq(floor(min(y=meuse$y)),
ceiling(max(y=meuse$y)),
length.out=pixels))
# convert the dataframe to a spatial points and then to a spatial pixels
grd.pts <- SpatialPixels(SpatialPoints((raster.grd)))
grd <- as(grd.pts, "SpatialGrid")
gridded(grd) = TRUE
#### perform IDW and loop through different power-values
power = seq(from = 1.5, to = 3, by = 0.5)
results=list()
results.cv=list()
for(i in power) {
results[[paste0(i,"P")]] <- gstat::idw(meuse$zinc ~ 1, meuse, grd, idp = i)
results.cv[[paste0(i,"P")]] <- krige.cv(zinc ~ 1, meuse, nfold = nrow(meuse),set = list(idp = i))
}
Now my attempt to calculate and store the RMSE with a for-loop:
results_rmse <- list()
pwr <- names(results.cv)
for(i in results.cv){ #for each Element (1.5P, 2P, etc) in results.cv
for(j in 1:length(pwr)){ #for each Power
results_rmse <- sqrt(mean(i$residual^2))
print(pwr[j])
}
print(paste("RMSE",results_rmse))
}
But with this loop, it prints each RMSE individually. So I changed the code like this
results_rmse[[i]] <- sqrt(mean(i$residual^2))
But then I get an error
Error in results_rmse[[i]] <- sqrt(mean(i$residual^2)) : invalid subscript type 'S4'
I tried several versions of the for-loop, but I couldn't even figure out how to store the values in a list, not to mention to sort them by the smallest RMSE.
There is an extra loop for j in the RMSE calculation that is not needed as far as I understand the problem. Also, I rearranged the loop in such a way that it cycles through a sequence of elements rather than calling them by their names.
# Data, because your script doesn't run for me. The rest is identical from your code
for(i in power) {
results.cv[[paste0(i,"P")]]$residual <- rnorm(50)
}
# Fixed loop
for(i in 1:length(results.cv)){
results_rmse[[i]] <- sqrt(mean(results.cv[[i]]$residual^2))
}
names(results_rmse) <- names(results.cv)
Alternatively, the for loop can be avoided with the apply function. The result is a named list corresponding to the input names, so the last line can be omitted to achieve the same results_rmse.
results_rmse <- lapply(results.cv, function(x) sqrt(mean(x$residual^2)))
To print the data as you showed in your question:
cbind(RMSE=unlist(results_rmse), Power=power)
I have two matrices of equal dimensions (p and e) and I would like to make a spearman correlation between columns of the same name. I want to have the output of pair correlations in a matrix (M)
I used the corr.test() function from library Psych and here is what I did:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(p[,rs],e[,rs],method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}
But I get an error message:
Error in 1:ncol(y) : argument of length 0
Could you please show me what is wrong? or suggest another method?
No need for all this looping and indexing etc:
# test data
p <- matrix(data = rnorm(100),nrow = 10)
e <- matrix(data = rnorm(100),nrow = 10)
cor <- corr.test(p, e, method="spearman", adjust="none")
data.frame(name=colnames(p), r=diag(cor$r), p=diag(cor$p))
# name r p
#a a 0.36969697 0.2930501
#b b 0.16363636 0.6514773
#c c -0.15151515 0.6760652
# etc etc
If the names of the matrices don't already match, then match them:
cor <- corr.test(p, e[,match(colnames(p),colnames(e))], method="spearman", adjust="none")
Since the two matrices are huge, it would take very long system.time to execute the function corr.test() on all possible pairs but the loop that finally worked is as follow:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(as.data.frame(p[,rs]),as.data.frame(e[,rs]),
method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}
Given a set of xy coordinates, how can I choose n points such that those n points are most distant from each other?
An inefficient method that probably wouldn't do too well with a big dataset would be the following (identify 20 points out of 1000 that are most distant):
xy <- cbind(rnorm(1000),rnorm(1000))
n <- 20
bestavg <- 0
bestSet <- NA
for (i in 1:1000){
subset <- xy[sample(1:nrow(xy),n),]
avg <- mean(dist(subset))
if (avg > bestavg) {
bestavg <- avg
bestSet <- subset
}
}
This code, based on Pascal's code, drops the point that has the largest row sum in the distance matrix.
m2 <- function(xy, n){
subset <- xy
alldist <- as.matrix(dist(subset))
while (nrow(subset) > n) {
cdists = rowSums(alldist)
closest <- which(cdists == min(cdists))[1]
subset <- subset[-closest,]
alldist <- alldist[-closest,-closest]
}
return(subset)
}
Run on a Gaussian cloud, where m1 is #pascal's function:
> set.seed(310366)
> xy <- cbind(rnorm(1000),rnorm(1000))
> m1s = m1(xy,20)
> m2s = m2(xy,20)
See who did best by looking at the sum of the interpoint distances:
> sum(dist(m1s))
[1] 646.0357
> sum(dist(m2s))
[1] 811.7975
Method 2 wins! And compare with a random sample of 20 points:
> sum(dist(xy[sample(1000,20),]))
[1] 349.3905
which does pretty poorly as expected.
So what's going on? Let's plot:
> plot(xy,asp=1)
> points(m2s,col="blue",pch=19)
> points(m1s,col="red",pch=19,cex=0.8)
Method 1 generates the red points, which are evenly spaced out over the space. Method 2 creates the blue points, which almost define the perimeter. I suspect the reason for this is easy to work out (and even easier in one dimension...).
Using a bimodal pattern of initial points also illustrates this:
and again method 2 produces much larger total sum distance than method 1, but both do better than random sampling:
> sum(dist(m1s2))
[1] 958.3518
> sum(dist(m2s2))
[1] 1206.439
> sum(dist(xy2[sample(1000,20),]))
[1] 574.34
Following #Spacedman's suggestion, I have written a function that drops a point from the closest pair, until the desired number of points remains. It seems to work well, however, it slows down pretty quickly as you add points.
xy <- cbind(rnorm(1000),rnorm(1000))
n <- 20
subset <- xy
alldist <- as.matrix(dist(subset))
diag(alldist) <- NA
alldist[upper.tri(alldist)] <- NA
while (nrow(subset) > n) {
closest <- which(alldist == min(alldist,na.rm=T),arr.ind=T)
subset <- subset[-closest[1,1],]
alldist <- alldist[-closest[1,1],-closest[1,1]]
}
I'm new to this site. I was wondering if anyone had experience with turning a list of grid coordinates (shown in example code below as df). I've written a function that can handle the job for very small data sets but the run time increases exponentially as the size of the data set increases (I think 800 pixels would take about 25 hours). It's because of the nested for loops but I don't know how to get around it.
## Dummy Data
x <- c(1,1,2,2,2,3,3)
y <- c(3,4,2,3,4,1,2)
df <- as.data.frame(cbind(x,y))
df
## Here's what it looks like as an image
a <- c(NA,NA,1,1)
b <- c(NA,1,1,1)
c <- c(1,1,NA,NA)
image <- cbind(a,b,c)
f <- function(m) t(m)[,nrow(m):1]
image(f(image))
## Here's my adjacency matrix function that's slowwwwww
adjacency.coordinates <- function(x,y) {
df <- as.data.frame(cbind(x,y))
colnames(df) = c("V1","V2")
df <- df[with(df,order(V1,V2)),]
adj.mat <- diag(1,dim(df)[1])
for (i in 1:dim(df)[1]) {
for (j in 1:dim(df)[1]) {
if((df[i,1]-df[j,1]==0)&(abs(df[i,2]-df[j,2])==1) | (df[i,2]-df[j,2]==0)&(abs(df[i,1]-df[j,1])==1)) {
adj.mat[i,j] = 1
}
}
}
return(adj.mat)
}
## Here's the adjacency matrix
adjacency.coordinates(x,y)
Does anyone know of a way to do this that will work well on a set of coordinates a couple thousand pixels long? I've tried conversion to SpatialGridDataFrame and went from there but it won't get the adjacency matrix correct. Thank you so much for your time.
While I thought igraph might be the way to go here, I think you can do it more simply like:
result <- apply(df, 1, function(pt)
(pt["x"] == df$x & abs(pt["y"] - df$y) == 1) |
(abs(pt["x"] - df$x) == 1 & pt["y"] == df$y)
)
diag(result) <- 1
And avoid the loopiness and get the same result:
> identical(adjacency.coordinates(x,y),result)
[1] TRUE