cluster::daisy drops labels - r

I am trying to cluster data using cluster::daisy function and dissimilarity matrix. The data looks as shown below.
> head(score_mat_unique_3)
ID_1 ID_2 Score
1: 1000035849 1000532512 2.49e-60
2: 1000035849 1000682765 3.87e-08
3: 1000086994 1000658924 8.90e-18
4: 1000234640 1000535109 1.20e-87
5: 1000235015 1000754236 6.29e-34
6: 1000258002 1000598768 8.34e-36
Score shows how different the objects are (the larger the value, the more different objects.), so I use the dissimilarity matrix and the daisy function.
diss3 <- daisy(score_mat_unique_3, metric = "gower")
But when I try to plot hclust, some numbers are printed instead of ID.
fit3 <- hclust(diss3, method = "ward.D2")
plot(fit3)
Accordingly, information about objects in clusters is lost. How can I return information about the initial IDs and understand which IDs are in the clusters?

You just need to use the labels argument to plot. Since you don't provide your data, I will illustrate with the built-in iris data.
DAT = iris[sample(26),1:4]
fit3 = hclust(dist(DAT))
plot(fit3, labels=LETTERS)
In your case, try plot(fit3, labels=score_mat_unique_3$ID_1)

Related

Clustering leads to very concentrated clusters

To understand my problem, you will need the whole dataset: https://pastebin.com/82paf0G8
Pre-processing: I had a list of orders and 696 unique item numbers, and wanted to cluster them, based on how frequent each pair of items are ordered together. I calculated for each pair of items, number of frequency of occurence within the same order. I.e the highest number of occurrence was 489 between two items. I then "calculated" the similarity/correlation, by: Frequency / "max frequency of all pairs" (489). Now I have the dataset that I have uploaded.
Similarity/correlation: I don't know if my similarity approach is the best in this case. I also tried with something called "Jaccard’s coefficient/index", but get almost same results.
The dataset: The dataset contains material numbers V1 and V2. and N is the correlation between the two material numbers between 0 - 1.
With help from another one, I managed to create a distance matrix and use the PAM clustering.
Why PAM clustering? A data scientist suggest this: You have more than 95% of pairs without information, this makes all these materials are at the same distance and a single cluster very dispersed. This problem can be solved using a PAM algorithm, but still you will have a very concentrated group. Another solution is to increase the weight of the distances other than one.
Problem 1: The matrix is only 567x567. I think for clustering I need the 696x696 full matrix, even though a lot of them are zeros. But i'm not sure.
Problem 2: Clustering does not do very well. I get very concentrated clusters. A lot of items are clustered in the first cluster. Also, according to how you verify PAM clusters, my clustering results are poor. Is it due to the similarity analysis? What else should I use? Is it due to the 95% of data being zeros? Should I change the zeros to something else?
The whole code and results:
#Suppose X is the dataset
df <- data.table(X)
ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
ss <- ss/max(ss, na.rm = TRUE)
ss[is.na(ss)] <- 0
diag(ss) <- 1
Now using the PAM clustering
dd2 <- as.dist(1 - sqrt(ss))
pam2 <- pam(dd2, 4)
summary(as.factor(pam2$clustering))
But I get very concentrated clusters, as:
1 2 3 4
382 100 23 62
I'm not sure where you get the 696 number from. After you rbind, you have a dataframe with 567 unique values for V1 and V2, and then you perform the dcast, and end up with a matrix as expected 567 x 567. Clustering wise I see no issue with your clusters.
dim(df) # [1] 7659 3
test <- rbind(df, df[, .(V1 = V2, V2 = V1, N)])
dim(test) # [1] 15318 3
length(unique(test$V1)) # 567
length(unique(test$V2)) # 567
test2 <- dcast(test, V1~V2, value.var = "N")[,-1]
dim(test2) # [1] 567 567
#Mayo, forget what the data scientist said about PAM. Since you've mentioned this work is for a thesis. Then from an academic viewpoint, your current justification to why PAM is required, does not hold any merit. Essentially, you need to either prove or justify why PAM is a necessity for your case study. And given the nature of (continuous) variables in the dataset, V1, V2, N, I do not see the logic on why PAM is applicable here (like I mentioned in the comments, PAM works best for mixed variables).
Continuing further, See this post on correlation detection in R;
# Objective: Detect Highly Correlated variables, visualize them and remove them
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
# print the first 6 rows
head(my_data, 6)
# compute correlation matrix using the cor()
res<- cor(my_data)
round(res, 2) # Unfortunately, the function cor() returns only the correlation coefficients between variables.
# Visualize the correlation
# install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
# Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
# tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.
#Apply correlation filter at 0.80,
#install.packages("caret", dependencies = TRUE)
library(caret)
highlyCor <- colnames(my_data)[findCorrelation(res, cutoff = 0.80, verbose = TRUE)]
# show highly correlated variables
highlyCor
[1] "disp" "mpg"
removeHighCor<- findCorrelation(res, cutoff = 0.80) # returns indices of highly correlated variables
# remove highly correlated variables from the dataset
my_data<- my_data[,-removeHighCor]
[1] 32 4
Hope this helps.

Visualization on Cluster for Mixed Data

So, i'm working with fuzzy clustering for Mixed data. Then i want to do Visualization for clustering result.
Here is my data
> head(x)
x1 x2 x3 x4
A C 8.461373 27.62996
B C 10.962334 27.22474
A C 9.452127 27.57246
B D 8.196687 27.29332
A D 8.961367 26.72793
B C 8.009029 27.97227
i followed this step https://www.r-bloggers.com/clustering-mixed-data-types-in-r/
gower_dist <- daisy(x,
metric = "gower")
#type = list(logratio = 1))
tsne_obj <- Rtsne(gower_dist1, dims=2 ,is_distance = TRUE)
tsne_data = data.frame(tsne_obj1$Y, factor(g1$clusters))
colnames(tsne_data1)[3] = "cluster"
ggplot(aes(x = X1, y = X2), data = tsne_data1) +
geom_point(aes(color = cluster))
Based on the website, the first step has transformed the data using Gower distance (i guess), then applying R-tsne.
So My question is :
Is it good using Rtsne for mixed data (as Representative the points)? I have doubt, with Gower distance in the first step, its like force your categorical data to be numeric data.
But one thing that amazed me, my method always give a better result than a classic method based on the plot. so this is important for me to know better about this, can I use the plot as a tool to measure the goodness of clustering result? because based on the plot, it's not difficult to determine which method is better (by plotting clustering result), I give plot images below, I really impressed with it.
Classic Method
My Method

Specify a blocking factor in H2O

In the R version of H2O, is it possible to specify a blocking factor when splitting data in training/validation/test sets and/or when doing cross-validation?
I'm working on a clinical dataset with multiple observations from the same patient that should be kept together during these operations.
If this is not possible to do within the H2O framework then suggestions on how to achieve this in R and integrate with H2O functions would be great.
Thanks!
When using H2O-3 with cross validation, you can tell the training algorithm which fold number an observation belongs to with the fold_column parameter. See:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/fold_column.html
The code example below (copied from the link above) shows folds being assigned randomly. But you could alternately write a piece of code to assign them specifically yourself.
library(h2o)
h2o.init()
# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
# convert response column to a factor
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"])
# set the predictor names and the response column name
predictors <- c("displacement","power","weight","acceleration","year")
response <- "economy_20mpg"
# create a fold column with 5 folds
# randomly assign fold numbers 0 through 4 for each row in the column
fold_numbers <- h2o.kfold_column(cars, nfolds=5)
# rename the column "fold_numbers"
names(fold_numbers) <- "fold_numbers"
# print the fold_assignment column
print(fold_numbers)
# append the fold_numbers column to the cars dataset
cars <- h2o.cbind(cars,fold_numbers)
# try using the fold_column parameter:
cars_gbm <- h2o.gbm(x = predictors, y = response, training_frame = cars,
fold_column="fold_numbers", seed = 1234)
# print the auc for your model
print(h2o.auc(cars_gbm, xval = TRUE))

Why does gstat.predict() function often return NaN values (GSTAT Package)? (R version 3.3.2, Windows 10)

I am trying to simulate a combination of two different random fields (yy1 and yy2 with different mean and correlation length) with an irregular boundary using Gstat package in R. I have attached the picture of my expected outcome. The code is not giving such output consistently and I am frequently getting atleast one of the yy1 and yy2 as NaNs, which results in the Undesired output as shown in image.
The key steps I used are:
1) Created two gstat objects with different means and psill (rf1 and rf2)
2) Created two computational grids (one for each random field) in the form of data frame with two variables “x” and “y” coordinates.
3) Predicted two random fields using unconditional simulation.
Any help in this regard would be highly appreciated.
Attachments: 2 images (link provided) and 1 R code
1) Expected Outcome
2) Undesired Outcome
library(gstat)
xy <- expand.grid(1:150, 1:200) # 150 x 200 grid is created in the form of a dataframe with x and y vectors
names(xy)<-c('x','y') # giving names to the variables
# creating gsat objects
rf1<-gstat(formula=z~1,locations=~x+y,dummy = T,beta=c(1,0,0), model=vgm(psill=0.025, range=5, model='Exp'), nmax=20) # dummy=T treats this as a unditional simulation
rf2<-gstat(formula=z~1,locations=~x+y,dummy = T,beta=c(4,0,0), model=vgm(psill=0.025, range=10, model='Exp'), nmax=20) # dummy=T treats this as a unditional simulation
# creating two computational grid
rows<-nrow(xy)
xy_shift <- expand.grid(60:90, 75:100)
names(xy_shift)<-c('x','y')
library(dplyr) # for antijoin
xy1<-xy[1:(rows/2),]
xy1<-anti_join(xy1, xy_shift, by = c("x","y")) # creating the irregular boundary
xy2<-rbind(xy[(rows/2+1):rows,],xy_shift)
library(sp)
yy1<- predict(rf1, newdata=xy1, nsim=1) # random field 1
yy2<- predict(rf2, newdata=xy2, nsim=1) # random field 2
rf1_label<-gl(1,length(yy1[,1]),labels="field1")
rf2_label<-gl(1,length(yy2[,1]),labels="field2")
yy1<-cbind(yy1,field=rf1_label)
yy2<-cbind(yy2,field=rf2_label)
yy<-rbind(yy1,yy2)
yyplot<-yy[,c(1,2,3)]
# plotting the field
gridded(yyplot) = ~x+y
spplot(obj=yyplot[1],scales=list(draw = TRUE))

cforest party unbalanced classes

I want to measure the features importance with the cforest function from the party library.
My output variable has something like 2000 samples in class 0 and 100 samples in class 1.
I think a good way to avoid bias due to class unbalance is to train each tree of the forest using a subsample such that the number of elements of class 1 is the same of the number of element in class 0.
Is there anyway to do that? I am thinking to an option like n_samples = c(20, 20)
EDIT:
An example of code
> iris.cf <- cforest(Species ~ ., data = iris,
+ control = cforest_unbiased(mtry = 2)) #<--- Here I would like to train the forest using a balanced subsample of the data
> varimp(object = iris.cf)
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.048981818 0.002254545 0.305818182 0.271163636
>
EDIT:
Maybe my question is not clear enough.
Random forest is a set of decision trees. In general the decision trees are constructed using only a random subsample of the data. I would like that the used subsample has the same numbers of element in the class 1 and in the class 0.
EDIT:
The function that I am looking for is for sure available in the randomForest package
sampsize
Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
I need the same for the party package. Is there any way to get it?
I will assume you know what you want to accomplish, but don't know enough R to do that.
Not sure if the function provides balancing of data as an argument, but you can do it manually. Below is the code I quickly threw together. More elegant solution might exist.
# just in case
myData <- iris
# replicate everything *10* times. Replicate is just a "loop 10 times".
replicate(10,
{
# split dataset by class and add separate classes to list
splitList <- split(myData, myData$Species)
# sample *20* random rows from each matrix in a list
sampledList <- lapply(splitList, function(dat) { dat[sample(20),] })
# combine sampled rows to a data.frame
sampledData <- do.call(rbind, sampledList)
# your code below
res.cf <- cforest(Species ~ ., data = sampledData,
control = cforest_unbiased(mtry = 2)
)
varimp(object = res.cf)
}
)
Hope you can take it from here.

Resources