I'm doing hierarchical clustering with an R package called pvclust, which builds on hclust by incorporating bootstrapping to calculate significance levels for the clusters obtained.
Consider the following data set with 3 dimensions and 10 observations:
mat <- as.matrix(data.frame("A"=c(9000,2,238),"B"=c(10000,6,224),"C"=c(1001,3,259),
"D"=c(9580,94,51),"E"=c(9328,5,248),"F"=c(10000,100,50),
"G"=c(1020,2,240),"H"=c(1012,3,260),"I"=c(1012,3,260),
"J"=c(984,98,49)))
When I use hclust alone, the clustering runs fine for both Euclidean measures and correlation measures:
# euclidean-based distance
dist1 <- dist(t(mat),method="euclidean")
mat.cl1 <- hclust(dist1,method="average")
# correlation-based distance
dist2 <- as.dist(1 - cor(mat))
mat.cl2 <- hclust(dist2, method="average")
However, when using the each set up with pvclust, as follows:
library(pvclust)
# euclidean-based distance
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", nboot=1000)
# correlation-based distance
mat.pcl2 <- pvclust(mat, method.hclust="average", method.dist="correlation", nboot=1000)
... I get the following errors:
Euclidean: Error in hclust(distance, method = method.hclust) :
must have n >= 2 objects to cluster
Correlation: Error in cor(x, method = "pearson", use = use.cor) :
supply both 'x' and 'y' or a matrix-like 'x'.
Note that the distance is calculated by pvclust so there is no need for a distance calculation beforehand. Also note that the hclust method (average, median, etc.) does not affect the problem.
When I increase the dimensionality of the data set to 4, pvclust now runs fine. Why is it that I'm getting these errors for pvclust at 3 dimensions and below but not for hclust? Furthermore, why do the errors disappear when I use a data set above 4 dimensions?
At the end of function pvclust we see a line
mboot <- lapply(r, boot.hclust, data = data, object.hclust = data.hclust,
nboot = nboot, method.dist = method.dist, use.cor = use.cor,
method.hclust = method.hclust, store = store, weight = weight)
then digging deeper we find
getAnywhere("boot.hclust")
function (r, data, object.hclust, method.dist, use.cor, method.hclust,
nboot, store, weight = F)
{
n <- nrow(data)
size <- round(n * r, digits = 0)
....
smpl <- sample(1:n, size, replace = TRUE)
suppressWarnings(distance <- dist.pvclust(data[smpl,
], method = method.dist, use.cor = use.cor))
....
}
also note, that the default value of parameter r for function pvclust is r=seq(.5,1.4,by=.1). Well, actually as we can see this value is being changed somewhere:
Bootstrap (r = 0.33)...
so what we get is size <- round(3 * 0.33, digits =0) which is 1, finally data[smpl,] has only 1 row, which is less than 2. After correction of r it returns some error which possibly is harmless and output is given too:
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean",
nboot=1000, r=seq(0.7,1.4,by=.1))
Bootstrap (r = 0.67)... Done.
....
Bootstrap (r = 1.33)... Done.
Warning message:
In a$p[] <- c(1, bp[r == 1]) :
number of items to replace is not a multiple of replacement length
Let me know if the results is satisfactory.
Related
I am trying to learn about the "kohonen" package in R. In particular, there is a function called "supersom()" (https://www.rdocumentation.org/packages/kohonen/versions/3.0.10/topics/supersom , corresponding to the SOM (Self Organizing Maps) algorithm used in unsupervised machine learning) that I am trying to apply on some data.
Below, (from a previous question: R error: "Error in check.data : Argument Should be Numeric") I learned how to apply the "supersom()" function on some artificially created data with both "factor" and "numeric" variables.
#the following code works
#load libraries
library(kohonen)
library(dplyr)
#create and format data
a =rnorm(1000,10,10)
b = rnorm(1000,10,5)
c = rnorm(1000,5,5)
d = rnorm(1000,5,10)
e <- sample( LETTERS[1:4], 100 , replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25) )
f <- sample( LETTERS[1:5], 100 , replace=TRUE, prob=c(0.2, 0.2, 0.2, 0.2, 0.2) )
g <- sample( LETTERS[1:2], 100 , replace=TRUE, prob=c(0.5, 0.5) )
data = data.frame(a,b,c,d,e,f,g)
data$e = as.factor(data$e)
data$f = as.factor(data$f)
data$g = as.factor(data$g)
cols <- 1:4
data[cols] <- scale(data[cols])
#som model
som <- supersom(data= as.list(data), grid = somgrid(10,10, "hexagonal"),
dist.fct = "euclidean", keep.data = TRUE)
Everything works well - the problem is, when I try to apply the "supersom()" function on " more realistic and bigger data", I get the following error:
"Error: Non-informative layers present : mean distances between objects zero"
When I look at the source code for this function (https://rdrr.io/cran/kohonen/src/R/supersom.R), I notice a reference for the same error:
if (any(sapply(meanDistances, mean) < .Machine$double.eps))
stop("Non-informative layers present: mean distance between objects zero")
Can someone please show me how I might be able to resolve this error, i.e. make the "supersom()" function work with factor and numeric data?
I thought that perhaps removing duplicate rows and NA's might fix this problem:
data <- na.omit(data)
data <- unique(data)
However the same error ("Non-informative layers present : mean distances between objects zero") is still there.
Can someone please help me figure out what might be causing this error? Note: when I remove the "factor" variables, everything works fine.
Sources:
https://cran.r-project.org/web/packages/kohonen/kohonen.pdf
https://www.rdocumentation.org/packages/kohonen/versions/2.0.5/topics/supersom
https://rdrr.io/cran/kohonen/src/R/supersom.R
The error happens if you have certain numeric columns whose mean is 0. You can reproduce the error by turning any 1 column to 0.
data$a <- 0
som <- supersom(data= as.list(data), grid = somgrid(10,10, "hexagonal"),
dist.fct = "euclidean", keep.data = TRUE)
Error in supersom(data = as.list(data), grid = somgrid(10, 10, "hexagonal"), :
Non-informative layers present: mean distance between objects zero
Maybe you can investigate why those column have 0 mean or remove the columns with 0 means from the data.
library(kohonen)
library(dplyr)
data <- data %>% select(where(~(is.numeric(.) && mean(.) > 0) | !is.numeric(.)))
#som model
som <- supersom(data= as.list(data), grid = somgrid(10,10, "hexagonal"),
dist.fct = "euclidean", keep.data = TRUE)
I am trying to get the optimal number of clusters from a dendrogram obtained with the cluster package and diana method (used euclidean dissimilarity) with the following:
mydatad <- diana(mydata, stand = FALSE)
mydata_dend <- fviz_dend(mydatad, cex = 0.5, k = 2, palette = "jco")
## use factoextra and ggplot2 for visualization
then I try to use the dissimilarity matrix with NbClust:
NC <- NbClust(data = NULL, diss = mydatadd, distance = NULL, min.nc = 2, max.nc = 50, method= "single", index = "silhouette")
and I get this error message":
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
missing value where TRUE/FALSE needed
however my data has no missing values, the lowest values in the dissim matrix should be 0 for identical elements (mydata is ~2000 elements , 11 variables).
Then I tried to estimate the dissim . distance using the original data and estimating the euclidean distance, thus not using the dissimilarity matrix
NC <- NbClust(T2141d, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 50, method= "single", index = "silhouette")
and I get the following error message:
Error in t(jeu) %*% jeu :
requires numeric/complex matrix/vector arguments
any suggestions are welcome, even suggestion for different methods to obtain the optimal number of cluster. thnx.
The NbClust function requires a dissimilarity matrix. When you run diana, the object returned is not a dissimilarity matrix, although it was calculated. You need to set keep.diss=TRUE. Before I show how to return the dissimilarity matrix using an example dataset:
library(factoextra)
library(cluster)
mydata=data.frame(matrix(runif(2000*11),ncol=11))
mydatad <- diana(mydata, stand = FALSE,keep.diss=TRUE)
# check the dissimilarity matrix stored
class(mydatad$diss)
[1] "dissimilarity" "dist"
NC <- NbClust(data = NULL, diss = mydatad$diss,
distance = NULL, min.nc = 2, max.nc = 50,
method= "single", index = "silhouette")
I am trying to use a mallet topic model with the LDAvis package. To do so you must extract a number of parameters from the topic.model object: phi, theta, vocab, doc.length, and term.frequency.
The mallet documentation makes no mention of these parameters. How can I extract them from a topic.model object generated from data using mallet.import() and MalletLDA()?
So far, I've used mallet to fit the topic model:
id_numbers <- as.integer(c(1, 2, 3))
comments <- c("words to be used for text mining", "that may or may not be interesting", "but could serve as a good example")
df <- data.frame(id_numbers, comments, stringsAsFactors = F)
# Set up topic model
library(mallet)
stoplist <- c("to", "be", "or")
write.csv(stoplist, file = "example_stoplist.csv")
mallet.instances <- mallet.import(
as.character(df$id_numbers),
as.character(df$comments),
"example_stoplist.csv",
FALSE,
token.regexp="[\\p{L}']+")
topic.model <- MalletLDA(num.topics=10)
topic.model$loadDocuments(mallet.instances)
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)
topic.model$setAlphaOptimization(40, 80) # tweaking optimization interval and burn-in iterations)
topic.model$train(400)
topic.words.m <- mallet.topic.words(topic.model, smoothed=TRUE,
normalized=TRUE)
dim(topic.words.m)
vocabulary <- topic.model$getVocabulary()
colnames(topic.words.m) <- vocabulary
doc.topics.m <- mallet.doc.topics(topic.model, smoothed=T,
normalized=T)
doc.topics.df <- as.data.frame(doc.topics.m)
doc.topics.df <- cbind(id_numbers, doc.topics.df)
doc.topic.means.df <- aggregate(doc.topics.df[, 2:ncol(doc.topics.df)],
list(doc.topics.df[,1]),
mean)
Out of this I now need to generate the JSON for LDAvis. I tried the following:
# LDAvis
library(LDAvis)
phi <- t(mallet.topic.words(topic.model, smoothed = TRUE, normalized = TRUE))
phi.count <- mallet.topic.words(topic.model, smoothed = TRUE, normalized = FALSE)
topic.words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE)
topic.counts <- rowSums(topic.words)
topic.proportions <- topic.counts/sum(topic.counts)
vocab <- topic.model$getVocabulary()
doc.tokens <- data.frame(id=c(1:nrow(doc.topics.m)), tokens=0)
for(i in vocab){
# Find word if word in text
matched <- grepl(i, df$comments)
doc.tokens[matched,2] =doc.tokens[matched,2] + 1
}
createJSON(phi = phi,
theta = doc.topics.m,
doc.length = doc.tokens,
vocab = vocab,
term.frequency = apply(phi.count, 1, sum))
However, this gives me the following error message:
Error in createJSON(phi = phi, theta = doc.topics.m, doc.length = doc.tokens, :
Number of rows of phi does not match
number of columns of theta; both should be equal to the number of topics
in the model.
So I seem to be generating the phi and theta matrices in the wrong way.
Try removing the matrix transpose function t() from the line where you create phi.
RMallet is returning these matrices in the format expected by LDAvis: topics are columns for document topics (theta) and rows for topic words (phi). Sometimes it makes sense to flip one of them so that either rows or columns always means topics, but not here.
When I was learning kknn in R, I came across the IRIS data example, I found kknn does not take any argument of centroid numbers.
require(kknn)
data(iris)
m <- dim(iris)[1]
val <- sample(1:m, size = round(m/3), replace = FALSE, prob = rep(1/m, m))
iris.learn <- iris[-val,] # train
iris.valid <- iris[val,] # test
iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1, kernel ="triangular")
How does it optimize the number of centroid?
I remember knn takes argument of centroid number.
According to the documentation of kknn package there is an argument to determine the number of centroids and that is argument k.
From documentation:
k: Number of neighbors considered.
The default value is 7 centroids. Your can be written as:
require(kknn)
data(iris)
m <- dim(iris)[1]
val <- sample(1:m, size = round(m/3), replace = FALSE, prob = rep(1/m, m))
iris.learn <- iris[-val,] # train
iris.valid <- iris[val,] # test
iris.kknn <- kknn(Species~., iris.learn, iris.valid,
distance = 1, kernel ="triangular", k=7)
#you can add whichever number you want above
Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: factor x has new levels Y.
For example, using package boot:
library(boot)
d <- data.frame(x=c('A', 'A', 'B', 'B', 'C', 'C'), y=c(1, 2, 3, 4, 5, 6))
m <- glm(y ~ x, data=d)
m.cv <- cv.glm(d, m, K=2) # Sometimes succeeds
m.cv <- cv.glm(d, m, K=2)
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor x has new levels B
Update: This is a toy example. The same problem occurs with larger datasets as well, where there are several occurrences of level C but none of them is present in the training partition.
The function createDataPartition function from the package caret does stratified sampling for the outcome variables and correctly warns:
Also, for ‘createDataPartition’, very small class sizes (<= 3) the classes may not show up in both the training and test data.
There are two solutions which spring to mind:
First, create a subset of the data by selecting one random sample of each factor level first, starting from the rarest class (by frequency) and then greedily satisfying the next rare class and so on. Then using createDataPartition on the rest of the dataset and merging the results to create a new train dataset which contains all levels.
Using createDataPartitions and and doing rejection sampling.
So far, option 2 has worked for me because of the data sizes, but I cannot help but think that there must be a better solution than a hand rolled out one.
Ideally, I would want a solution which just works for creating partitions and fails early if there is no way to create such partitions.
Is there a fundamental theoretical reason why packages do not offer this? Do they offer it and I just haven't been able to spot them because of a blind spot? Is there a better way of doing this stratified sampling?
Please leave a comment if I should ask this question on stats.stackoverflow.com.
Update:
This is what my hand rolled out solution (2) looks like:
get.cv.idx <- function(train.data, folds, factor.cols = NA) {
if (is.na(factor.cols)) {
all.cols <- colnames(train.data)
factor.cols <- all.cols[laply(llply(train.data[1, ], class), function (x) 'factor' %in% x)]
}
n <- nrow(train.data)
test.n <- floor(1 / folds * n)
cond.met <- FALSE
n.tries <- 0
while (!cond.met) {
n.tries <- n.tries + 1
test.idx <- sample(nrow(train.data), test.n)
train.idx <- setdiff(1:nrow(train.data), test.idx)
cond.met <- TRUE
for(factor.col in factor.cols) {
train.levels <- train.data[ train.idx, factor.col ]
test.levels <- train.data[ test.idx , factor.col ]
if (length(unique(train.levels)) < length(unique(test.levels))) {
cat('Factor level: ', factor.col, ' violated constraint, retrying.\n')
cond.met <- FALSE
}
}
}
cat('Done in ', n.tries, ' trie(s).\n')
list( train.idx = train.idx
, test.idx = test.idx
)
}
Everyone agrees that there sure is an optimal solution. But personally, I would just try the cv.glm call until it works usingwhile.
m.cv<- try(cv.glm(d, m, K=2)) #First try
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list
I've tried it with 100,000 rows in the data.fame and it only takes a few seconds.
library(boot)
n <-100000
d <- data.frame(x=c(rep('A',n), rep('B', n), 'C', 'C'), y=1:(n*2+2))
m <- glm(y ~ x, data=d)
m.cv<- try(cv.glm(d, m, K=2))
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list
When I call traceback I get this:
> traceback()
9: stop(sprintf(ngettext(length(m), "factor %s has new level %s",
"factor %s has new levels %s"), nm, paste(nxl[m], collapse = ", ")),
domain = NA)
8: model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels)
7: model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)
6: predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==
"link", "response", type), terms = terms, na.action = na.action)
5: predict.glm(d.glm, data[j.out, , drop = FALSE], type = "response")
4: predict(d.glm, data[j.out, , drop = FALSE], type = "response")
3: mean((y - yhat)^2)
2: cost(glm.y[j.out], predict(d.glm, data[j.out, , drop = FALSE],
type = "response"))
1: cv.glm(d, m, K = 2)
And looking at the cv.glm function gives:
> cv.glm
function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2),
K = n)
{
call <- match.call()
if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
runif(1)
seed <- get(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
n <- nrow(data)
out <- NULL
if ((K > n) || (K <= 1))
stop("'K' outside allowable range")
K.o <- K
K <- round(K)
kvals <- unique(round(n/(1L:floor(n/2))))
temp <- abs(kvals - K)
if (!any(temp == 0))
K <- kvals[temp == min(temp)][1L]
if (K != K.o)
warning(gettextf("'K' has been set to %f", K), domain = NA)
f <- ceiling(n/K)
s <- sample0(rep(1L:K, f), n)
n.s <- table(s)
glm.y <- glmfit$y
cost.0 <- cost(glm.y, fitted(glmfit))
ms <- max(s)
CV <- 0
Call <- glmfit$call
for (i in seq_len(ms)) {
j.out <- seq_len(n)[(s == i)]
j.in <- seq_len(n)[(s != i)]
Call$data <- data[j.in, , drop = FALSE]
d.glm <- eval.parent(Call)
p.alpha <- n.s[i]/n
cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out,
, drop = FALSE], type = "response"))
CV <- CV + p.alpha * cost.i
cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm,
data, type = "response"))
}
list(call = call, K = K, delta = as.numeric(c(CV, CV + cost.0)),
seed = seed)
}
It seems the problem has to do with your extremely small sample size and categorical effect (with values "A", "B", and "C"). You are fitting a glm with 2 effects: "B:A" and "C:A". In each CV iteration you bootstrap from the sample dataset and fit a new model d.glm. Given the size, the bootstrapped data are guaranteed to come up with 1 or more iteration in which the value "C" is not sampled, hence the error comes from calculating fitted probabilities from the bootstrap model from the training data in which validation data has a "C" level for x not observed in the training data.
Frank Harrell (often on stats.stackexchange.com) wrote in Regression Modelling Strategies that one ought to favor against split sample validation when sample size is small and/or some cell counts are small in categorical data analysis. Singularity (as you are seeing here) is one of many reasons why I think this is true.
Given the small sample size here, you should consider some split sample cross validation alternatives like a permutation test, or a parametric bootstrap. Another important consideration is exactly why you feel model based inference isn't correct. As Tukey said of the bootstrap, he'd like to call it a shotgun. It will blow the head off of any problem, as long as you're willing to reassemble the pieces.
There don't seem to be many simple solutions around the web so here's one I worked out that should be easy to generalize to as many factors as you need. It uses pre-installed packages and Caret but you could get away with just base R if you really wanted.
To use cross-validation when you have multiple factors follow a two-step process. Convert the factors to numerics and then multiply them together. Use this new variable as the target variable in a stratified sampling function. Be sure to remove it or keep it out of your training set after creating your folds.
If y is your DV and x is a factor then:
#Simulated factors (which are conveniently distributed for the example)
dataset <-data.frame(x=as.factor(rep(c(1,10),1000)),y=as.factor(rep(c(1,2,3,4),250)[sample(1000)]))
#Convert the factors to numerics and multiply together in new variable
dataset$cv.variable <-as.numeric(levels(dataset$x))[dataset$x]*as.numeric(levels(dataset$y))[dataset$y]
prop.table(table(dataset$y)) #One way to view distribution of levels
ftable(dataset$x,dataset$y) #A full table of all x and y combinations
folds <- caret::createFolds(dataset$cv.variable,k=10)
testIndexes <- folds[[k]]
testData <- as.data.frame(dataset[testIndexes, ])
trainData <- as.data.frame(dataset[-testIndexes, ])
prop.table(table(testData$y))
ftable(testData$x,testData$y) #evaluate distribution
which should produce a result that is close to balanced.
Note: In real life, if your sample lacks the requisite unique combinations of factors then your problem is harder overcome and might be impossible. You can either drop some levels from consideration before creating folds or employ some kind of over-sampling.