I am applying the functions from the flexclust package for hard competitive learning clustering, and I am having trouble with the convergence.
I am using this algorithm because I was looking for a method to perform a weighed clustering, giving different weights to groups of variables. I chose hard competitive learning based on a response for a previous question (Weighted Kmeans R).
I am trying to find the optimal number of clusters, and to do so I am using the function stepFlexclust with the following code:
new("flexclustControl") ## check the default values
fc_control <- new("flexclustControl")
fc_control#iter.max <- 500 ### 500 iterations
fc_control#verbose <- 1 # this will set the verbose to TRUE
fc_control#tolerance <- 0.01
### I want to give more weight to the first 24 variables of the dataframe
my_weights <- rep(c(1, 0.064), c(24, 31))
set.seed(1908)
hardcl <- stepFlexclust(x=df, k=c(7:20), nrep=100, verbose=TRUE,
FUN = cclust, dist = "euclidean", method = "hardcl", weights=my_weights, #Parameters for hard competitive learning
control = fc_control,
multicore=TRUE)
However, the algorithm does not converge, even with 500 iterations. I would appreciate any suggestion. Should I increase the number of iterations? Is this an indicator that something else is not going well, or did I a mistake with the R commands?
Thanks in advance.
Two things that answer my question (as well as a comment on weighted variables for kmeans, or better said, with hard competitive learning):
The weights are for observations (=rows of x), not variables (=columns of x). so using hardcl for weighting variables is wrong.
In hardcl or neural gas you need much more iterations compared to standard k-means: In k-means one iteration uses the complete data set to change the centroids, hard competitive learning and uses only a single observation. In comparison to k-means multiply the number of iterations by your sample size.
Related
I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000
there i have two regression models ,rf1 and rf2 and i want o find value of variables that allow output of rf1 to be between 20 and 26 and output of rf2 should be inferior to 10 :
i tried grid search but i found nothing,please i you know how to do it with a heuristic (simulated annealing or genetic algorithm) please help me
you can find the code for this example in this repository here
library(randomForest)
model_rf_fines<- readRDS(file = paste0("rf1.rds"))
model_rf_gros<- readRDS(file = paste0("rf2.rds"))
#grid------
grid_input_test = expand.grid(
"Poste" ="P1",
"Qualité" ="BTNBA",
"CPT_2500" =13.83,
"CPT400" = 46.04,
"CPT160" =15.12,
"CPT125" =5.9,
"CPT40"=15.09,
"CPT_40"=4.02,
"retart"=0,
"dure"=0,
'Débit_CV004'=seq(1300,1400,10),
"Dilution_SB002"=seq(334.68,400,10),
"Arrosage_Crible_SC003"=seq(250,300,10),
"Dilution_HP14"=1200,
"Dilution_HP15"=631.1,
"Dilution_HP18"=500,
"Dilution_HP19"=seq(760.47,800,10),
"Pression_PK12"=c(0.59,0.4),
"Pression_PK13"=c(0.8,0.7),
"Pression_PK14"=c(0.8,0.9,0.99,1),
"Pression_PK16"=c(0.5),
"Pression_PK18"=c(0.4,0.5)
)
#levels correction ----
levels(grid_input_test$Qualité) = model_rf_fines$forest$xlevels$Qualité
levels(grid_input_test$Poste) = model_rf_fines$forest$xlevels$Poste
for(i in 1:nrow(grid_input_test)){
#fines
print("----------------------------")
print(i)
print(paste0('Fines :', predict(object = model_rf_fines,newdata = grid_input_test[i,]) ))
#gros
print(paste0('Gros :',predict(object = model_rf_gros,newdata = grid_input_test[i,]) ))
if(predict(object = model_rf_gros,newdata = grid_input_test[i,])<=10){break}
}
any suggestions will be greatly appreciated
thanks.
It might be such variables/input does not exists. If rf1 and rf2 represent two Random Forest models, with say >50 trees, the number of trees will average out spikes/edges of the model.
Similar to the law of large numbers, the more trees in each forest, the more closer output of rf1 and rf2 will be. This is all if indeed rf_ represent random forests both trained on same data, indeed than the more trees the more impossible your input that satisfies the conditions.
Indeed try a naive grid search first, and keep track of minimum value of rf2 while rf1 satisfies your condition. Call this minimum M_grid
If you want to implement simulated annealing, I would start with a simple neighbour scheme, say take a random input variable and vary it a bit. Use python packages for the annealing scheme. If this simple scheme beats your M_grid by quite a bit and you feel you are close to the solution, you can play around with slower cooling schemes, or more complicated neighbour proposals.
Also, the objective for both SA and GA should not be chosen too fast. Probably you want a objective that steers rf1 close to its lowest edge of 20, and rf2 as minium as possible, with maybe a exp() or **3 to reward going down plenty.
I made some assumptions here, maybe wrong. But hope this helps anyway.
I have a large dataset (3.5+ million observations) of a binary response variable that I am trying to compute a Hierarchical GAM with a global smoother with individual effects that have a Shared penalty (e.g. 'GS' in Pedersen et al. 2019). Specifically I am trying to estimate the following structure: Global > Geographic Zone (N=2) > Bioregion (N=20) > Season (N varies by bioregion). In total, I am trying to estimate 36 different nested parameters.
Here is the the code I am currently using:
modGS <- bam(
outbreak ~
te(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
My main issue is that it is taking a long time (5+ days) to construct the model. This nesting structure cannot be discretized, so I cannot compute it in parallel. Further I have tried gamm4 but I ran into memory limit issues. Here is the gamm4 code:
modGS <- gamm4(
outbreak ~
t2(days_diff,NDVI_mean,bs=c("tp","tp"),k=c(5,5)) +
t2(days_diff, NDVI_mean, Zone, Bioregion, Season, bs=c("tp", "tp","re","re","re"),k=c(5, 5), m=2, full=TRUE) +
s(Latitude,Longitude,k=50),
family=binomial(),select = TRUE,data=dat)
What is the best/most computationally feasible way to run this model?
I cut down the computational time by reducing the amount of bioregion levels and randomly sampling ca. 60% of the data. This actually allow me to calculate OOB error for the model.
There is an article I read recently that has a specific section on decreasing computational time. The main things they highlight are:
Use the bam function with it's useful fREML estimation, which refactorizes the model matrix to make calculation faster. Here it seems you have already done that.
Adding the discrete = TRUE argument, which assumes only a smaller finite number of unique values for estimation.
Manipulating nthreads in this function so it runs more than one core in parallel in your computer.
As the authors caution, the second option can reduce the amount of accuracy in your estimates. I fit some large models recently doing this and found that it was not always the same as the default bam function, so its best to use this as a quick inspection rather than the full result you are looking for.
I am running agglomerative clustering on a data set of 130K rows (130K unique keys) and 7 columns, each column ranging from 20 to 2000 unique levels. The data are categorical, specifically alphanumeric codes. At most they can be thought of as factors. I am experimenting with what results I might get from a couple of alternatives to k-modes, including hierarchical clustering and MCA.
My question is, is there any good way to visualize the results up to a certain level with the tree structure?
Standard steps are not a problem:
library{cluster}
Compute Gower distance,
ptm <- proc.time()
gower.dist <- daisy(df[,colnams], metric = c("gower"))
elapsed <- proc.time() - ptm
c(elapsed[3],elapsed[3]/60)
Compute agglomerative clustering object from Gower distance
aggl.clust.c <- hclust(gower.dist, method = "complete")
Now to plotting it. The following line works, but the plot is humanly unreadable
plot(aggl.clust.c, main = "Agglomerative, complete linkages")
Ideally what I am looking for would be something like so (the below is pseudocode that failed on my system)
plot(cutree(aggl.clust.c, k=7), main = "Agglomerative, complete linkages")
I am running R version 3.2.3. That version cannot change (and I don't believe it ought to make a difference for what I am trying to do).
I'd be interested in doing the same in Python, if anyone has good pointers.
I found a useful answer to my question re plotting part of a tree using the as.dendogram() method. Link: http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning
I have 150 Experimental substances. 80 characteristics were measured for each of these substances separately. I applied PCA to compute its PCs and determined first three components.Now, I want to apply k-means clustering in R. software (www.R-project.org) with 1000 iterations on low-dimensional data to separate the individuals to their respective populations.
Can anyone see how this can be done? thanks
See adegenet package and try DAPC.
Please, read http://bmcgenet.biomedcentral.com/articles/10.1186/1471-2156-11-94 I think it does what You wish. It is implemented in adegenet R package as DAPC. This implementation is designed for multi locus genotype data, but the principle is very well described, so that You can modify it for Your own data or find something similar.
It performs K-means clustering on PC-transformed ("cleared") data, which significantly speeds up whole calculations. Finally it performs discriminant analysis to get the best clustering. It is very efficient method.
http://www.statmethods.net/advstats/cluster.html Provides nice and easy examples to cluster data.
For your question:
Consider some random normal data and some simple code to fit Kmeans clustering. Note, 3 clusters will be fit to this data (purely arbitrarily).
data = matrix(rnorm(450),ncol=3)
fit = kmeans(data, centers = 3, iter.max = 1000)
cluster.data = data.frame(data, fit$cluster)
Has this answered your question?