Clustering function R Hclust Loop and develop a table - r

I'm working on a text mining/clustering project and am trying to create a table which contains number of clusters as rows and 6 columns representing the following 6 metrics:
max.diameter, min.separation, average.within,average.between,avg.silwidth,dunn.
I need to create the tables for 3 methods - kmeans, pam and hclust.
I was able to create something for kmeans
dtm0.90Dist = dist(dtm0.90)
foreachcluster = function(k) {
kmeans.result = kmeans(dtm0.90, k);
kmeans.stats = cluster.stats(dtm0.90Dist,kmeans.result$cluster);
c(kmeans.stats$min.separation, kmeans.stats$max.diameter,
kmeans.stats$average.within, kmeans.stats$avearge.between,
kmeans.stats$avg.silwidth, kmeans.stats$dunn)
}
rbind(foreachcluster(2), foreachcluster(3), foreachcluster(4), foreachcluster(5),
foreachcluster(6), foreachcluster(7),foreachcluster(8))
and I get the following output
[,1] [,2] [,3] [,4] [,5]
[1,] 3.162278 30.19934 5.831550 0.5403872 0.10471348
[2,] 2.236068 28.37252 5.006058 0.3923446 0.07881104
[3,] 1.000000 28.37252 4.995478 0.2496066 0.03524537
[4,] 1.000000 26.40076 4.387212 0.2633338 0.03787770
[5,] 1.000000 26.40076 4.353248 0.2681947 0.03787770
[6,] 1.000000 26.40076 4.163757 0.1633954 0.03787770
[7,] 1.000000 26.40076 4.128927 0.2676423 0.03787770
I need similar output for hclust and pam methods but for the life of me can't get the same function to work for either of the two methods
OK, so I was able to make the function for HCLUST
forhclust=function(k){dfDist = dist(dtm0.90);
hclust.result = hclust(dfDist);
hclust.cluster = (cutree(hclust.result, k));
cluster.stats(dfDist,hclust.cluster);c(cluster.stats$min.separation)}
But I get an error when i run this
Error in cluster.stats$min.separation :
object of type 'closure' is not subsettable
What I need is for it to print "min.separation" output.
I would really appreciate all the help and perhaps some guidance in understanding why my approach is failing in hclust.
Also, is there a good source that can explain the functioning and application of these methods, step by step, in detail?
Thank You

foreachcluster2 = function(k) {
hc = hclust(mDist, method = "ave")
hresult = cutree(hc, k)
h.stats = cluster.stats(mDist,hresult);
c( max.dia=h.stats$max.diameter,
min.sep=h.stats$min.separation,
avg.wi=h.stats$average.within,
avg.bw=h.stats$average.between,
silwidth=h.stats$avg.silwidth,
dunn=h.stats$dunn)
}
t2 = rbind(foreachcluster2(2), foreachcluster2(3), foreachcluster2(4), foreachcluster2(5),foreachcluster2(6),
foreachcluster2(7), foreachcluster2(8), foreachcluster2(9), foreachcluster2(10),
foreachcluster2(11), foreachcluster2(12),foreachcluster2(13),foreachcluster2(14))
rownames(t2) = 2:14
t2
This should work. For pam():
pamC <- pam(x=m, k=2)
pamC
pamC$clustering
use $clustering instead of $cluster, the rest are the same.

Related

Decomposed variance ill-defined in Analysis of heterogeneity (ANOHE)

I am trying to run a meta analysis using a package "gemtc", and the code performs very well in my test data..............................................
The code is listed:
data <- read.csv("input.txt", sep=",", header=T)
network <- mtc.network(data, description="Example")
result.anohe <- mtc.anohe(network, n.adapt=10000, n.iter=50000)
#The file (problem.txt) is also attached.
However, when I use my real data, it has an unfixed bug:
Error in decompose.study(study.samples[, colIndexes, drop = FALSE], studies[i]) :
Decomposed variance ill-defined for 1. Most likely the USE did not converge:
[,1] [,2] [,3] [,4]
[1,] 0.000 2478.307 2491.482 2485.044
[2,] 2478.307 0.000 1106288.727 -440067.825
[3,] 2491.482 1106288.727 0.000 -1459996.199
[4,] 2485.044 -440067.825 -1459996.199 0.000
Thanks very much in advance!
The input file causing problem is attached:
file
..............................................................................................................................................................................................

Creating spatialpolygons dataframe from list of polygons

I am currently trying to create a polygon shapefile from a list of polygons (study areas for biodiversity research).
Currently these polygons are stored in a list in this format:
$SEW22
[,1] [,2]
[1,] 427260.4 5879458
[2,] 427161.4 5879472
[3,] 427175.0 5879571
[4,] 427273.9 5879557
[5,] 427260.4 5879458
$SEW23
[,1] [,2]
[1,] 418011.0 5867216
[2,] 417912.0 5867230
[3,] 417925.5 5867329
[4,] 418024.5 5867315
[5,] 418011.0 5867216
I tried to simply write them as shpfile with writeOGR but the following error occurs:
> #write polygons to shp
> filenameshp <- paste('Forestplots')
> layername <- paste('Forestplots')
> writeOGR(obj=forest, dsn = filenameshp,
+ layer=layername, driver="ESRI Shapefile", overwrite_layer = TRUE)
Error in writeOGR(obj = forest, dsn = filenameshp, layer = layername, :
inherits(obj, "Spatial") is not TRUE
I read this tutorial by Barry Rowlingson to create spatialpolygons and thought I should probably first create a dataframe and did this:
forestm<-do.call(rbind,forest)
but this returned nothing useful as you can imagine, plus it lost the names of the plots.
As I am still new to R I also tried lots of different other approaches which sensefulness I could not fully judge but none returned what I hoped for and so I spare you with these random approaches.....
I am looking forward to your propositions.
Many thanks
P.S. I also tried the following as described in the spatialpolygons{sp} package:
> Polygons(forest, ID)
Error in Polygons(forest, ID) : srl not a list of Polygon objects
You can follow the approach described in this answer: https://gis.stackexchange.com/questions/18311/instantiating-spatial-polygon-without-using-a-shapefile-in-r.
Here's how to apply the approach to your case. First, I create a list of matrices as in your sample data:
forest <- list(
"SEW22" = matrix(c(427260.4, 5879458, 427161.4, 5879472, 427175.0, 5879571, 427273.9, 5879557, 427260.4, 5879458),
nc = 2, byrow = TRUE),
"SEW23" = matrix(c(418011.0, 5867216, 417912.0, 5867230, 417925.5, 5867329, 418024.5, 5867315, 418011.0, 5867216),
nc = 2, byrow = TRUE)
)
Now
library(sp)
p <- lapply(forest, Polygon)
ps <- lapply(seq_along(p), function(i) Polygons(list(p[[i]]), ID = names(p)[i]))
sps <- SpatialPolygons(ps)
sps_df <- SpatialPolygonsDataFrame(sps, data.frame(x = rep(NA, length(p)), row.names = names(p)))
In the first step, we iterate through the list of matrices and apply the Polygon function to each matrix to create a list of Polygon objects. In the second step, we iterate through this list to create a Polygons object, setting the ID of each element in this object to the corresponding name in the original list (e.g. "SEW22", "SEW23"). The third step creates a SpatialPolygons object. Finally, we create a SpatialPolygonsDataFrame object. Here I have a dummy dataframe populated with NAs (note that the row names must correspond to the polygon IDs).
Finally, write the data
rgdal::writeOGR(obj = sps_df,
dsn = "Forestplots",
layer = "Forestplots",
driver = "ESRI Shapefile",
overwrite_layer = TRUE)
This creates a new folder in your working directory:
list.files()
# [1] "Forestplots"
list.files("Forestplots")
# [1] "Forestplots.dbf" "Forestplots.shp" "Forestplots.shx"
Consult the linked answer for more details.

Why does not the 'outer' function work properly for some argument values in R?

When I run the R command:
outer(37:42, 37:42, complex, 1)
I get an error
"Error in dim(robj) <- c(dX, dY) : dims [product 36] do not match the length of object [37]"
in my R session. But when I run
outer(36:42, 36:42, complex, 1)
I have a valid matrix as a result. The problem persists for all values greater than 36. And there is no problem for all values less then 37.
Is this a bug?
My system: Microsoft R Open 3.4.4 / RStudio 1.1.447 / Ubuntu 16.04
More specifically, when running the function with arguments m:n, m:n it returns the error whenever n < (n - m + 1)^2 [citation needed]. Try for example outer(20:23, 20:23, complex, 1) and outer(20:24, 20:24, complex, 1), where the first will fail but the latter won't, because 24 < (24-20+1)^2. I suspect this has to do with the first argument of complex being length.out, which defines the length of the vector to return - not really an explanation, I know. So your first argument 37:42 is passed to the length.out parameter. This does not make a lot of sense so please correct me if I am wrong, but I think what you want to do is the following:
outer(37:42, 37:42, function(x,y) {complex(1, real = x, imaginary = y)})
Which outputs:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 37+37i 37+38i 37+39i 37+40i 37+41i 37+42i
[2,] 38+37i 38+38i 38+39i 38+40i 38+41i 38+42i
[3,] 39+37i 39+38i 39+39i 39+40i 39+41i 39+42i
[4,] 40+37i 40+38i 40+39i 40+40i 40+41i 40+42i
[5,] 41+37i 41+38i 41+39i 41+40i 41+41i 41+42i
[6,] 42+37i 42+38i 42+39i 42+40i 42+41i 42+42i
Hope this helps.
The problem is in the 4th argument: it should be named:
outer(37:42, 37:42, complex, length.out = 1)
works fine!

R subscript out of bounds corrected?

I generated the following data matrix called arrayDataMatrixQuantile in R:
DNp73flflV2324I DNp73flflV2324J DNp73flflV2324K DNp73nullV2523B DNp73nullV2523C DNp73nullV2523E
ENSMUSG00000028180 8.185794 5.6914560 5.693373 6.9734687 8.8689120 5.9152113
ENSMUSG00000028182 0.000000 0.1749128 0.000000 0.1685122 0.1784736 0.1229401
ENSMUSG00000028185 0.000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
ENSMUSG00000028184 7.439927 8.8635180 10.288115 11.8621800 13.4530467 13.4414667
ENSMUSG00000028187 7.458357 10.0175407 14.108493 11.7789400 19.7581400 12.1482933
ENSMUSG00000028186 0.400568 0.1346390 3.450423 0.1643176 0.0000000 0.0000000
I want to generate log2 of each of the values and output that. The R code I wrote:
log2_matrix<-matrix( nrow(arrayDataMatrixQuantile),ncol(arrayDataMatrixQuantile)) #opens new matrix
for (i in 1:nrow(arrayDataMatrixQuantile)) {
for (j in 1:ncol(arrayDataMatrixQuantile)) {
add <- ((arrayDataMatrixQuantile[i,j])+10^-5) #Added 10-5 to avoid errors with 0 values
log2_matrix[i,j] <-add }
}
This code gives the following error:
Error in [<-(*tmp*, i, j, value = 2.50880030780749) : subscript out of bounds
However, once I change the line :
log2_matrix<-matrix( nrow(arrayDataMatrixQuantile),ncol(arrayDataMatrixQuantile))
to
log2_matrix<-matrix(0, nrow(arrayDataMatrixQuantile),ncol(arrayDataMatrixQuantile))
it works. I dont know how adding a "0" in the new matrix gets rid of the error. I used that as I saw other users adding a 0 at the start of each new matrix. Any advise on this?
We could do this either using apply
apply(arrayDataMatrixQuantile, 2, FUN=function(x) x+ 10^-5)
Or directly add the number to the entire dataset
arrayDataMatrixQuantile+10^-5
Regarding the error in the OP's code, it happened because the matrix created was not of the same dimensions as the "arrayDataMatrixQuantile"
log2_matrix<- matrix( nrow(arrayDataMatrixQuantile),
ncol(arrayDataMatrixQuantile))
The "log2_matrix" doesn't have a data argument and its dimensions are 6,1 with 6 as the value (from the nrow(...)). Instead, we need to add a , before the nrow(..) so that we get a matrix of NA with dimensions 6,6
log2_matrix <- matrix(, nrow(arrayDataMatrixQuantile),
ncol(arrayDataMatrixQuantile))

How to fix this PCA in R

I am creating a PCA plot from data:
label <- read.table('label_clusters.tsv')
mydata <- read.table('raw_clusters.tsv')
GP.svd = svd(mydata)
dat = data.frame("pc1"= GP.svd$u[,1],
"pc2"= GP.svd$u[,2],
"Data"= c(rep("my", nsamples(our.obj2)), rep("zeller", nsamples(z.obj))))
GP.svd is a large list in the form of:
[,97] [,98] [,99] [,100] [,101] [,102]
[1,] -9.616173e-02 -0.0779788701 -0.1087899396 -0.0653396699 -0.140911786 -5.064931e-02
[2,] 1.101038e-01 0.0465664554 0.0237686772 0.1344639223 0.035536326 2.715842e-02
[3,] -3.247248e-02 0.0295960109 0.0148926826 0.0021550661 -0.003509716 -1.887659e-02
When I run the code thus far, I get this error:
Error in data.frame(pc1 = GP.svd$u[, 1], pc2 = GP.svd$u[, 2], Data = c(rep("my", :
could not find function "nsamples"
I am not sure why this is happening, any help is appreciated
Your code cannot find the nsamples function. This means that you:
have to import an package that contains nsamples, or
write an nsamples function yourself that works correctly on our.obj2, or
use a different function, for example nrow if our.obj2 is a data.frame.

Resources