clusterboot function in the fpc package - r

I have a dataset of various measurements of eggs and coloration patterns etc.
I want to group these into clusters. I have used hierarchical clustering on the dataset, but I haven't found a good way to verify or validate the clusters.
I've heard discussion of cluster stability, and I want to use something like the clusterboot function in the fpc package. For some reason I can't get it to work though. I was wondering if there is anyone on here who has experience with this function.
Here is the code I was using below:
dMOFF.2007<-dist(MOFF.2007)
cf1<-clusterboot(MOFF.2007,B=3,bootmethod=boot,bscompare=TRUE,multipleboot=TRUE,clustermethod=hclust)
I'm just starting to understand what all of this means. I have experience with R but not with this specific function or much with cluster analyses.
I get this error:
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
missing value where TRUE/FALSE needed
Any thoughts? What am I doing wrong?

Just came across this because I'm working with clusterboot too--are you still stuck on this? I have two basic thoughts: 1) wouldn't you want to pass the distance matrix to clusterboot (dMOFF.2007) instead of the raw data (MOFF.2007)? 2) for the clustermethod argument, I believe it should be hclustCBI, not hclust. Hope you've got it working.

Related

Using permanova in r to analyse the effect of 3 independent variables on reef systems

I am trying to understand how to run PERMANOVA using Adonis2 in R to analyse some data that I have collected. I have been looking online, but as it often happens, explanations are a bit convoluted, so I am asking for your help, if you can help me. I have got some fish and coral groups as columns, as well as 3 independent variables (reef age, depth, and material). Snapshot of my dataset structure I think I have understood that p-values are not the only important bit of the output, and that the R2 values indicate how much each variable contributes to the model. Is there something wrong or that I am missing here? Also, I think I understood that I should check for homogeneity of variance, but I have not understood, again, if I should check for it on each variable independently, or if I should include them all in the same bit of code (which does not seem to work). Here are the bit of code that I am using to run the PERMANOVA (1), and the one that I am trying to use to assess homogeneity of variance - which does not work (2).
(1) adonis2(species ~ Age+Material+Depth,data=data.var,by="margin)
'Species' is the subset of the dataset including all the species'count, while 'data.var'is the subset including the 3 independent variables. Also what is the difference in using '+' or '' in the code? When I use '' it gives me - 'Error in qr.X(object$CCA$QR) :
need larger value of 'ncol' as pivoting occurred'. What does this mean?
(2) variance.check<-betadisper(species.distance,data.var, type=c("centroid"), bias.adjust= FALSE)
'species.distance' is a matrix calculated through 'vegdist' using Bray-Curtis method. I used 'data.var'to check variance on all the 3 independent variables, but it does not work, while it works if I check them independently (3). Why is that?
(3) variance.check<-betadisper(species.distance, data$Depth, type=c("centroid"), bias.adjust= FALSE)
Thank you in advance for your responses, and for your help. It will really help me to get my head around it (and sorry for the many questions).

"system is computationally singular" error when I use 'winsorize'

I am going to winsorize my dataset to get rid of some outliers with the package robustHD. It is the first time I ran into this error. The dataset contains 50+ variables and 100+ observations.
How can I fix this? And why matrix singularity matters for a calculation like winsorize? Thanks.
df_win<-winsorize(df,prob=0.95)
Error in solve.default(R) : system is computationally singular: reciprocal condition number = 1.26103e-18
The reason for this is that winsorize in robustHD uses solve. If you look deeper into the code, winsorize on a data frame calls the winsorize.data.frame method, which is simply a script that runs as.matrix and then uses the winsorize.matrix method. This in turns does a bunch of things, but the problem here is that it uses the solve function.
The error you get is from solve. The error probably occurs because you included some variables/columns that are very highly correlated, or rather, they are linear combinations of each other. You may want to check if you have duplicated variables or variables that are transformations of each other.
There are several things you can do:
Remove one of the highly correlated variables and try again.
Check out a different package to use winsorize from.
Write your own winsorize function.
The quickest way to do the second step:
require(sos)
findFn("winsorize")
This will produce an overview of all functions that have the word "winsorize" in their description. Just look for functions that are described to be used for winsorization.

Multiple predictors with the smbinning package

This might not be the right place to ask but I'm not sure where else to ask it. I'm trying to use the smbinning package. In particular, I'm trying to bin by multiple predictor variables. The issue is all the examples in the package documentation only deal with one predictor variable. I tried this naively:
result=smbinning(df=training,y="FlagGB",x=".,",p=.05)
which seemed to execute okay, but then if I tried to run result$ivtable I got the error
Error in result$ivtable : $ operator is invalid for atomic vectors
Does anyone know a) how to get smbinning to accept multiple predictors or if it can't another package that can; b) how to resolve the specific error listed above?
I have solved the problem ,It is because the training may not a data frame, you have to convert training into data frame with as.data.frame(training). you can see the smbinning code (https://github.com/cran/smbinning/blob/master/R/smbinning.R#L490), there is this block
i=which(names(df)==y) # Find Column for dependant
j=which(names(df)==x) # Find Column for independant
if (!is.numeric(df[,i]))
{
return("Target (y) not found or it is not numeric")
}
secondly,the y FlagGB must be numerical ,if your y varible is factor ,you have to convert to numerical ,you can use as.numeric(as.character(y)) not directly use as.numerical()
the problem is similarly to "Target (y) not found or it is not numeric" -Package smbinning - R
Have you looked into "Information" package? It seems to be doing the job, but there is no facility to recode the variable. Of if there is one, I haven't been able to find. Otherwise, it is a really great package for exploration and analysis of the variables.
To answer b) you should do: result and (most probably) see that the function in fact did not execute for the specific reason that you will get in return.
Indeed, it is a bit confusing that the smbinning package returns its errors silently and within the variable itself.
Question a), on the other hand, is hard to answer without looking at the data. You can try to cross/multiply your variables, but that may result in a very large number of factor levels. I would suggest that you apply the smbinnign package to group each of your characteristics into a few groups and then try to cross the groups.
for question a), you should use sumiv method which can calculates IV for all variables in one step. code like:
sumivt=smbinning.sumiv(chileancredit.train,y="FlagGB")
sumivt # Display table with IV by characteristic

Sparse data clustering for extremely large dataset

I have tried using
kmeansparse, from sparcl packages (Lack of Memory error)
bigkmeans from Biganalytics (Weird error: Couldn't find anything online; Error in duplicated.default(centers[[length(centers)]]) :
duplicated() applies only to vectors)
skmean from skmeans (similar results as kmeans)
but I am still not able to get proper clustering for my sparse data. The clusters are not well defined, have overlapping membership for most part. Am I missing something in terms of handling sparse data?
What kind of pre-processing is suggested for data? should the missing values be marked -1 instead of 0 for clear distinction? Please feel free to ask for more details if you have any ideas that may help.

what could be the best tool or package to perform PCA on very large datasets?

This might seem like a similar question which was asked in this URL (Apply PCA on very large sparse matrix).
But I am still not able to get my answer for which i need some help. I am trying to perform a PCA for a very large dataset of about 700 samples (columns) and > 4,00,000 locus (rows). I wish to plot "samples" in the biplot and hence want to consider all of the 4,00,000 locus to calculate the principal components.
I did try using princomp(), but I get the following error which says,
Error in princomp.default(transposed.data, cor = TRUE) :
'`princomp`' can only be used with more units than variables
I checked with the forums and i saw that in the cases where there are less units than variables, it is better to use prcomp() than princomp(), so i tried that as well, but i again get the following error,
Error in cor(transposed.data) : allocMatrix: too many elements specified
So I want to know if any of you could suggest me any other good option which could be best suited for my very large data. I am a beginner for statistics, but I did read about how PCA works. I want to know if there are any other easy-to-use R packages or tools to perform this?

Resources