Kappa Statistic Extremely Large/Sparse matrix - r

I have a large sparseMatrix (mat):
138493 x 17694 sparse Matrix of class "dgCMatrix", with 10000132 entries
I want to investigate Inter-rating agreement using kappa statistics but when I run Fleiss:
kappam.fleiss(mat)
I am shown the following error
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Is this due to my matrix being too large?
Is there any other methods I can use to calculate kappa statistics for IRR on a matrix this large?

The best answer that I can offer is that this is not really possible due to the extreme sparsity in your matrix. The problem: With 10,000,132 entries for a 138,493 * 17694 = 2,450,495,142 cell matrix, you have mostly (99.59%) missing values. The irr package allows for these but here you are placing some extreme demands on the system, by asking it to compare ratings for users whose films do not overlap.
This is compounded by the problem that the methods in the irr package a) require dense matrixes as input, and b) (at least in kripp.alpha() loop over columns making them very slow.
Here is an illustration constructing a matrix similar in nature to yours (but with no pattern - in reality your situation will be better because viewers tend to rate similar sets of movies).
Note that I used Krippendorff's alpha here, since it allows for ordinal or interval ratings (as your data suggests), and normally handles missing data fine.
require(Matrix)
require(irr)
seed <- 100
(sparseness <- 1 - 10000132 / (138493 * 17694))
## [1] 0.9959191
138493 / 17694 # multiple of movies to users
## [1] 7.827117
# nraters <- 17694
# nusers <- 138493
nmovies <- 100
nusers <- 783
raterMatrix <-
Matrix(sample(c(NA, seq(0, 5, by = .5)), nmovies * nusers, replace = TRUE,
prob = c(sparseness, rep((1-sparseness)/11, 11))),
nrow = nmovies, ncol = nusers)
kripp.alpha(t(as.matrix(raterMatrix)), method = "interval")
## Krippendorff's alpha
##
## Subjects = 100
## Raters = 783
## alpha = -0.0237
This worked for that size matrix, but fails if I increase it 100x (10x on each dimension), keeping the same proportions as in your reported dataset, then it fails to produce an answer after even 30 minutes, so I killed the process.
What to conclude: You are not really asking the right question of this data. It's not an issue of how many users agreed, but probably what sort of dimensions exist in this data in terms of clusters of viewing and clusters of preferences. You probably want to use association rules or some dimensional reduction methods that don't balk at the sparsity in your dataset.

Related

How to perform k-mean clustering in R

I am trying to explore a creditcard fraud dataset to learn R and also k-means clustering. But I encountered an issue while getting the optimal number of clusters. Unfortunately, not many findings about that error or even how to performing kmeans clustering in R can be google. I would like to know what's the warning about? And why the result only show 1 cluster? Thanks in advance!
Code:
data = read.csv("creditcard.csv")
scaled_data <- scale(data )
wss <- (nrow(scaled_data)-1)*sum(apply(scaled_data,2,var))
for (i in 2:100) wss[i] <- sum(kmeans(scaled_data, centers=i)$withiness)
plot(1:100, wss, type='b', xlab="Clusters", ylab="WSS")
Warning:
Warning messages:
1: Quick-TRANSfer stage steps exceeded maximum (= 14240350)
2: did not converge in 10 iterations
3: Quick-TRANSfer stage steps exceeded maximum (= 14240350)
4: did not converge in 10 iterations
You have several issues with your code. Let's go through it using an example data set available on R since you did not provide reproducible data:
data(iris)
scaled_iris <- scale(iris[, -5])
Since the data have been scaled, all of the variances are 1 so this is all you need to compute the total:
wss <- sum(colSums(scaled_iris^2))
wss
# [1] 596
Now the the clustering. I'll include the argument that #mhovd mentions with its default value (there is no argument for convergence). If you get the warning increase iter.max= to 15 or 20 or more. This does not guarantee that your results for any number of groups are optimal. To increase the chances of that you should use the nstart= argument and set a value of 5 or more:
for (i in 2:100) wss[i] <- kmeans(scaled_iris, centers=i, iter.max=10)$tot.withinss
head(wss);tail(wss)
# [1] 596.00000 220.87929 138.88836 113.97017 104.98669 81.03783
# [1] 3.188483 2.688470 2.716485 2.535701 2.497792 2.116150
plot(wss, type='b', xlab="Clusters", ylab="WSS")
Note you misspelled withinss and you did not realize that kmeans returns their sum as tot.withinss. It is always good to read the manual page ?kmeans. Note that you do not need 1:100 since the plot function will automatically supply consecutive integers if you provide only one vector.

Error in NbClust: not enough objects to cluster

I am trying to use the NbClust method in R to determine the best number of clusters in a cluster analysis following the approach in the book from Manning.
However, I get an error message saying:
Error in hclust(md, method = "average"): must have n >= 2 objects to
cluster.
Even though the hclust method appears to work. Therefore, I assume that the problem is (which is also stated by the error message), that NbClust tries to create groups with only one object inside.
Here is my code:
mydata = read.table("PLR_2016_WM_55_5_Familienstand_aufbereitet.csv", skip = 0, sep = ";", header = TRUE)
mydata <- mydata[-1] # Without first line (int)
data.transformed <- t(mydata) # Transformation of matrix
data.scale <- scale(data.transformed) # Scaling of table
data.dist <- dist(data.scale) # Calculates distances between points
fit.average <- hclust(data.dist, method = "average")
plot(fit.average, hang = -1, cex = .8, main = "Average Linkage Clustering")
library(NbClust)
nc <- NbClust(data.scale, distance="euclidean",
min.nc=2, max.nc=15, method="average")
I found a similar problem here, but I was not able to adapt the code.
There are some problems in your dataset.
The last 4 rows do not contain data and must be deleted.
mydata <- read.table("PLR_2016_WM_55_5_Familienstand_aufbereitet.csv", skip = 0, sep = ";", header = TRUE)
mydata <- mydata[1:(nrow(mydata)-4),]
mydata[,1] <- as.numeric(mydata[,1])
Now rescale the dataset:
data.transformed <- t(mydata) # Transformation of matrix
data.scale <- scale(data.transformed) # Scaling of table
For some reason data.scale is not a full rank matrix:
dim(data.scale)
# [1] 72 447
qr(data.scale)$rank
# [1] 71
Hence, we delete a row from data.scale and transpose it:
data.scale <- t(data.scale[-72,])
Now the dataset is ready for NbClust.
library(NbClust)
nc <- NbClust(data=data.scale, distance="euclidean",
min.nc=2, max.nc=15, method="average")
The output is
[1] "Frey index : No clustering structure in this data set"
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 8 proposed 2 as the best number of clusters
* 4 proposed 3 as the best number of clusters
* 8 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 1 proposed 11 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************

K-means: Initial centers are not distinct

I am using the GA Package and my aim is to find the optimal initial centroids positions for k-means clustering algorithm. My data is a sparse-matrix of words in TF-IDF score and is downloadable here. Below are some of the stages I have implemented:
0. Libraries and dataset
library(clusterSim) ## for index.DB()
library(GA) ## for ga()
corpus <- read.csv("Corpus_EnglishMalay_tfidf.csv") ## a dataset of 5000 x 1168
1. Binary encoding and generate initial population.
k_min <- 15
initial_population <- function(object) {
## generate a population to turn-on 15 cluster bits
init <- t(replicate(object#popSize, sample(rep(c(1, 0), c(k_min, object#nBits - k_min))), TRUE))
return(init)
}
2. Fitness Function Minimizes Davies-Bouldin (DB) Index. Where I evaluate DBI for each solution generated from initial_population.
DBI2 <- function(x) {
## x is a vector of solution of nBits
## exclude first column of corpus
initial_centroid <- corpus[x==1, -1]
cl <- kmeans(corpus[-1], initial_centroid)
dbi <- index.DB(corpus[-1], cl=cl$cluster, centrotypes = "centroids")
score <- -dbi$DB
return(score)
}
3. Running GA. With these settings.
g2<- ga(type = "binary",
fitness = DBI2,
population = initial_population,
selection = ga_rwSelection,
crossover = gabin_spCrossover,
pcrossover = 0.8,
pmutation = 0.1,
popSize = 100,
nBits = nrow(corpus),
seed = 123)
4. The problem. Error in kmeans(corpus[-1], initial_centroid) : initial centers are not distinct`.
I found a similar problem here, where the user also had to used a parameter to dynamically pass in the number of clusters to use. It was solve by hard-coding the number of clusters. However for my case, I really need to dynamically pass in the number of clusters, since it is coming in from a randomly generated binary vector, where those 1's will represent the initial centroids.
Checking with the kmeans() code, I noticed that the error is caused by duplicated centers:
if(any(duplicated(centers)))
stop("initial centers are not distinct")
I edited the kmeans function with trace to print out the duplicated centers. The output:
[1] "206" "520" "564" "1803" "2059" "2163" "2652" "2702" "3195" "3206" "3254" "3362" "3375"
[14] "4063" "4186"
Which shows no duplication in the randomly selected initial_centroids and I have no idea why this error keeps occurring. Is there anything else that would lead to this error?
P/S: I do understand some may suggest GA + K-means is not a good idea. But I do hope to finish what I have started. It is better to view this problem as a K-means problem (well at least in solving the initial centers are not distinct error).
Genetic algorithms are not well suited for optimizing k-means by the nature of the problem - initialization seeds interact too much, ga will not be better than taking a random sample of all possible seeds.
So my main advise is to not use genetic algorithms at all here!
If you insist, what you would need to do is detect the bad parameters, then simply return a bad score for bad initialization so they don't "survive".
To answer your question just do:
any(corpus[520, -1] != corpus[564, -1])
Your 520 and 564 rows of corpus are the same, with the only difference in an attribute row.names, see:
identical(colnames(corpus[520, -1]), colnames(corpus[564, -1])) # just to be sure
rownames(corpus[520, -1])
rownames(corpus[564, -1])
Regarding the GA and k-means, see e.g.:
Bashar Al-Shboul, Myaeng Sung-Hyon, "Initializing K-Means using Genetic Algorithms", World Academy of Science, Engineering & Technology, Jun2009, Issue 30, p. 114, (especially section II B); or
BAIN KHUSUL KHOTIMAH, FIRLI IRHAMNI, AND TRI SUNDARWATI, "A GENETIC ALGORITHM FOR OPTIMIZED INITIAL CENTERS K-MEANS CLUSTERING IN SMEs", Journal of Theoretical and Applied Information Technology, 2016, Vol. 90, No. 1

R - warning for dissimilarity calculation, clustering with numeric matrix

Reproducible data:
Data <- data.frame(
X = sample(c(0,1), 10, replace = TRUE),
Y = sample(c(0,1), 10, replace = TRUE),
Z = sample(c(0,1), 10, replace = TRUE)
)
Convert dataframe to matrix
Matrix_from_Data <- data.matrix(Data)
Check the structure
str(Matrix_from_Data)
num [1:10, 1:3] 1 0 0 1 0 1 0 1 1 1 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "X" "Y" "Z"
The question:
I have dataframe of binary, symmetric variables (larger than the example), and I'd like to do some hierarchical clustering, which I've never tried before. There are no missing or NA values.
I convert the dataframe into a matrix before attempting to run the daisy function from the 'cluster' package, to get the dissimilarity matrix. I'd like to explore the options for calculating different dissimilarity metrics, but am running into a warning (not an error):
library(cluster)
Dissim_Euc_Matrix_from_Data <- daisy(Matrix_from_Data, metric = "euclidean", type = list(symm =c(1:ncol(Matrix_from_Data))))
Warning message:
In daisy(Matrix_from_Data, metric = "euclidean", type = list(symm = c(1:ncol(Matrix_from_Data)))) :
with mixed variables, metric "gower" is used automatically
...which seems weird to me, since "Matrix_from_Data" is all numeric variables, not mixed variables. Gower might be a fine metric, but I'd like to see how the others impact the clustering.
What am I missing?
Great question.
First, that message is a Warning and not an Error. I'm not personally familiar with daisy, but my ignorant guess is that that particular warning message pops up when you run the function and doesn't do any work to see if the warning is relevant.
Regardless of why that warning appears, one simple way to compare the clustering done by several different distances measures in hierarchical clustering is to plot the dendograms. For simplicity, let's compare the "euclidean" and "binary" distance metrics programmed into dist. You can use ?dist to read up on what the "binary" distance means here.
# When generating random data, always set a seed if you want your data to be reproducible
set.seed(1)
Data <- data.frame(
X = sample(c(0,1), 10, replace = TRUE),
Y = sample(c(0,1), 10, replace = TRUE),
Z = sample(c(0,1), 10, replace = TRUE)
)
# Create distance matrices
mat_euc <- dist(Data, method="euclidean")
mat_bin <- dist(Data, method="binary")
# Plot the dendograms side-by-side
par(mfrow=c(1,2))
plot(hclust(mat_euc))
plot(hclust(mat_bin))
I generally read dendograms from the bottom-up since points lower on the vertical axis are more similar (i.e. less distant) to one another than points higher on the vertical axis.
We can pick up a few things from these plots:
4/6, 5/10, and 7/8 are grouped together using both metrics. We should hope this is true if the rows are identical :)
3 is most strongly associated with 7/8 for both distance metrics, although the degree of association is a bit stronger in the binary distance as opposed to the Euclidean distance.
1, 2, and 9 have some notably different relationships between the two distance metrics (e.g. 1 is most strongly associated with 2 in Euclidean distance but with 9 in binary distance). It is in situations like this where the choice of distance metric can have a significant impact on the resulting clusters. At this point it pays to go back to your data and understand why there are differences between the distance metrics for these three points.
Also remember that there are different methods of hierarchical clustering (e.g. complete linkage and single linkage), but you can use this same approach to compare the differences between methods as well. See ?hclust for a complete list of methods provided by hclust.
Hope that helps!

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

Resources