How to determine Dunn Index using clValid package in R? - r

I am trying to replicate the results of a journal paper, where the authors have proposed a clustering algorithm and have computed the Dunn index for the resulting cluster using the clValid in R. I was able to replicate the cluster. However, I am unable to get the Dunn index.
I have the adjacency matrix (adj) as shown below:
adj <-matrix(c(0,1,1,1,1,1,1,1,1,0,1,1,1,1,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,
1,0,1,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,
1,1,0,1,0,0,0,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,
1,1,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,
1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,
0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,1,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,1,
0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,1,1,
0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,1,0,1,0,1,1,0,0,0,0,0,1,1,1,0,1,
0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,1,0,0,1,1,1,0,1,1,0,0,1,1,1,1,1,1,1,0),
nrow=34, # number of rows
ncol=34,
byrow = TRUE)
The resulting cluster membership as
membership <- c(1, 1, 1, 1, 1, 2, 2, 1, 3, 3, 1, 1, 1, 1, 3, 3, 2, 1, 3, 1, 3, 1, 3, 3, 1, 1, 3, 3, 1, 3, 3, 1, 3, 3)
I used the following code to compute the Dunn index:
dist_dunn <- dist(adj, method = "euclidean")
dunn_value <- dunn(distance = dist_dunn, clusters = membership)
dunn_value
The resulting output is 0.2132. However, the actual output reported in the journal paper is 0.111. Can someone help me with this and let me know where I am going wrong?
Thanks in advance.

Related

joining two dataframes on matching values of two common columns R

I have a two dataframes A and B that both have multiple columns. They share the common columns "week" and "store". I would like to join these two dataframes on the matching values of the common columns.
For example this is a small subset of the data that I have:
A = data.frame(retailer = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
store = c(5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6),
week = c(2021100301, 2021092601, 2021091901, 2021091201, 2021082901, 2021082201, 2021081501, 2021080801,
2021080101, 2021072501, 2021071801, 2021071101, 2021070401, 2021062701, 2021062001, 2021061301),
dollars = c(121817.9, 367566.7, 507674.5, 421257.8, 453330.3, 607551.4, 462674.8,
464329.1, 339342.3, 549271.5, 496720.1, 554858.7, 382675.5,
373210.9, 422534.2, 381668.6))
and
B = data.frame(
week = c("2020080901", "2017111101", "2017061801", "2020090701", "2020090701", "2020090701",
"2020091201","2020082301", "2019122201", "2017102901"),
store = c(14071, 11468, 2428, 17777, 14821, 10935, 5127, 14772, 14772, 14772),
fill = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
)
I would like to join these two tables on the matching week AND store values in order to incorporate the "fill" column from B into A. Where the values don't match, I would like to have a label "0" in the fill column, instead of a 1. Is there a way I can do this? I am not sure which join to use as well, or if "merge" would be better for this? Essentially I am NOT trying to get rid of any rows that do not have the matching values for the two common columns. Thanks for any help!
We may do a left_join
library(dplyr)
library(tidyr)
A %>%
mutate(week = as.character(week)) %>%
left_join(B) %>%
mutate(fill = replace_na(fill, 0))

Extract cluster information and combine results

I am attempting to run a clustering algorithm over a list of dissimilarity matrices for different numbers of clusters k and extract some information for each run.
This first block of code produces the list of dissimilarity matrices
library(tidyverse)
library(cluster)
library(rje)
dat=mtcars[,1:3]
v_names=names(dat)
combos=rje::powerSet(v_names)
combos=combos[lengths(combos)>1]
df_list=list()
for (i in seq_along(combos)){
df_list[[i]]=dat[combos[[i]]]
}
gower_ls=lapply(df_list,daisy,metric="gower")
Here is the section of code I am having a problem with
set.seed(4)
model_num <-c(NA)
sil_width <-c(NA)
min_sil<-c(NA)
mincluster<-c(NA)
k_clusters <-c(NA)
lowest_sil <-c(NA)
maxcluster <-c(NA)
model_vars <- c(NA)
clust_4=lapply(gower_ls,pam,diss=TRUE,k=4)
for(m in 1:length(clust_4)){
sil_width[m] <-clust_4[[m]][7]$silinfo$avg.width
min_sil[m] <- min(clust_4[[m]][7]$silinfo$clus.avg.widths)
mincluster[m] <-min(clust_4[[m]][6]$clusinfo[,1])
maxcluster[m] <-max(clust_4[[m]][6]$clusinfo[,1])
k_clusters[m]<- nrow(clust_4[[m]][6]$clusinfo)
lowest_sil[m]<-min(clust_4[[m]][7]$silinfo$widths)
model_num[m] <-m
}
colresults_4=as.data.frame(cbind( sil_width, min_sil,mincluster,maxcluster,k_clusters,model_num,lowest_sil))
How can I convert this piece of code to run for a given range of k? I've tried a nested loop but I was not able to code it correctly. Here are the desired results for k= 4:6, thanks.
structure(list(sil_width = c(0.766467312788453, 0.543226669407726,
0.765018469447229, 0.705326458357873, 0.698351173575526, 0.480565022092276,
0.753366365875066, 0.644345251543097, 0.699437672202048, 0.430310752506775,
0.678224885117295, 0.576411380463116), min_sil = c(0.539324315243191,
0.508330909368204, 0.637090842537915, 0.622120627356455, 0.539324315243191,
0.334047777245833, 0.430814518122641, 0.568591550281139, 0.539324315243191,
0.295113900268025, 0.430814518122641, 0.19040716086259), mincluster = c(5,
3, 4, 5, 2, 3, 3, 3, 2, 3, 3, 3), maxcluster = c(14, 12, 11,
14, 12, 10, 11, 11, 9, 6, 7, 7), k_clusters = c(4, 4, 4, 4, 5,
5, 5, 5, 6, 6, 6, 6), model_num = c(1, 2, 3, 4, 1, 2, 3, 4, 1,
2, 3, 4), lowest_sil = c(-0.0726256983240229, 0.0367238314801671,
0.308069836672298, 0.294247157041013, -0.0726256983240229, -0.122804288130541,
-0.317748917748917, 0.218164082936686, -0.0726256983240229, -0.224849074123824,
-0.317748917748917, -0.459909237820881)), row.names = c(NA, -12L
), class = "data.frame")
I was able to come up with a solution by writing a function clus_func that extracts the cluster information and then using cross2 and map2 from the purrr package:
library(tidyverse)
library(cluster)
library(rje)
dat=mtcars[,1:3]
v_names=names(dat)
combos=rje::powerSet(v_names)
combos=combos[lengths(combos)>1]
clus_func=function(x,k){
clust=pam(x,k,diss=TRUE)
clust_stats=as.data.frame(cbind(
avg_sil_width=clust$silinfo$avg.width,
min_clus_width=min(clust$silinfo$clus.avg.widths),
min_individual_sil=min(clust$silinfo$widths[,3]),
max_individual_sil=max(clust$silinfo$widths[,3]),
mincluster= min(clust$clusinfo[,1]),
maxcluster= max(clust$clusinfo[,1]),
num_k=max(clust$clustering) ))
}
df_list=list()
for (i in seq_along(combos)){
df_list[[i]]=dat[combos[[i]]]
}
gower_ls=lapply(df_list,daisy,metric="gower")
begin_k=4
end_k=6
cross_list=cross2(gower_ls,begin_k:end_k)
k=c(NA)
for(i in 1:length(cross_list)){ k[i]=cross_list[[i]][2]}
diss=c(NA)
for(i in 1:length(cross_list)){ diss[i]=cross_list[[i]][1]}
model_stats=map2(diss, k, clus_func)
model_stats=rbindlist(model_stats)

Set bubble size according to categorical data

Keep in mind, I am very new to R.
I have a dataset from a public opinion survey, and would like to represent the answers through a bubble chart, though the data is categorical, not numeric.
From dataset "Arab4" I have question/variable "Q713" with all of the observations coded as 1, 2, 3, 4, or 5 as the response options. I would like to plot the bubbles (stacked on top of one another by "country") with the size of the bubble corresponding to the percent of the vote share that answer got. For example, if 49% of respondents in Israel voted for option 1 under question "Q", then the bubble size would represent 49% and be situated above the Israel category label with the color of the bubble corresponding to the response type (1, 2, 3, 4, or 5).
I have the following code, giving me a blank chart, and I know to eventually use the "points" command with more specifications.
What I need help with is defining the radius of the circles from the data I have.
plot(Arab4$Country, Arab4$Q713, type= "n", xlab = FALSE, ylab=FALSE)
points(Arab4$country, Arab4$q713)
Here is some dput from the data set
dput(Arab4$q713[1:50])
structure(c(3, 5, 3, 3, 1, 3, 5, 5, 5, 5, 3, 2, 2, 3, 1, 1, 4,
2, 3, 5, 5, 5, 2, 5, 4, 2, 5, 2, 5, 3, 5, 5, 2, 2, 5, 2, 1, 2,
1, 2, 5, 3, 4, 5, 1, 1, 1, 4, 5, 3), labels = structure(c(1,
2, 3, 4, 5, 98, 99), .Names = c("Promoting democracy", "Promoting economic
development",
"Resolving the Arab-Israeli conflict", "Promoting women’s rights",
"The US should not get involved", "Don't know (Do not read)",
"Decline to answer (Do not read)")), class = "labelled")
Any ideas would help! Thanks!
As others have commented, this really is not a bubble chart as you only have 2 dimensions and the size of the circle does not add anything (other than perhaps visual appeal). But with that disclaimer, here is one approach to what I think you are trying to achieve. This requires the ggplot2 and reshape2 libraries.
library(ggplot2)
library(reshape2)
# create simulated data
dat <- data.frame(Egypt=sample(c(1:5), 20), Libya=sample(c(1:5),20))
# tabulate
dat.tab <- apply(dat, 2, table)
dat.long <- melt(dat.tab)
colnames(dat.long) <- c("Response", "Count", "Country")
ggplot(dat.long, aes(x=Country, y=Count, color=Country)) +
geom_point(aes(size=Count))
EDIT Here is another approach, using the data manipulation tools of the dplyr package to get you all the way to proportions:
# using dat from above again
dat.long <- melt(dat)
colnames(dat.long) <- c("Country", "Response")
dat.tab <- dat.long %>%
group_by(Country) %>%
count(Response) %>%
mutate(prop = prop.table(n))
ggplot(dat.tab, aes(x=Country, y=prop, color=Country)) +
geom_point(aes(size=prop))
You will need to do a little additional work to remove unwanted values (98, 99) if they are truly unwanted.
hth.

determining age from min max dates for each item in dataset [duplicate]

This question is very similar to a question asked in another thread which can be found here. I'm trying to achieve something similar: within groups (events) subtract the first date from the last date. I'm using the dplyr package and code provided in the answers of this thread. Subtracting the first date from the last date works, however it does not provide satisfactory results; the resulting time difference is displayed in numbers, and there seems to be no distinction between different time units (e.g., minutes and hours) --> subtractions in first 2 events are correct, however in the 3rd one it is not i.e. should be minutes. How can I manipulate the output by dplyr so that the resulting subtractions are actually a correct reflection of the time difference? Below you will find a sample of my data (1 group only) and the code that I used:
df<- structure(list(time = structure(c(1428082860, 1428083340, 1428084840,
1428086820, 1428086940, 1428087120, 1428087240, 1428087360, 1428087480,
1428087720, 1428088800, 1428089160, 1428089580, 1428089700, 1428090120,
1428090240, 1428090480, 1428090660, 1428090780, 1428090960, 1428091080,
1428091200, 1428091500, 1428091620, 1428096060, 1428096420, 1428096540,
1428096600, 1428097560, 1428097860, 1428100440, 1428100560, 1428100680,
1428100740, 1428100860, 1428101040, 1428101160, 1428101400, 1428101520,
1428101760, 1428101940, 1428102240, 1428102840, 1428103080, 1428103620,
1428103980, 1428104100, 1428104160, 1428104340, 1428104520, 1428104700,
1428108540, 1428108840, 1428108960, 1428110340, 1428110460, 1428110640
), class = c("POSIXct", "POSIXt"), tzone = ""), event = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)), .Names = c("time",
"event"), class = "data.frame", row.names = c(NA, 57L))
df1 <- df %>%
group_by(event) %>%
summarize(first(time),last(time),difference = last(time)-first(time))
We can use difftime and specify the unit to get all the difference in the same unit.
df %>%
group_by(event) %>%
summarise(First = first(time),
Last = last(time) ,
difference= difftime(last(time), first(time), unit='hour'))

r - how to subtract first date entry from last date entry in grouped data and control output format

This question is very similar to a question asked in another thread which can be found here. I'm trying to achieve something similar: within groups (events) subtract the first date from the last date. I'm using the dplyr package and code provided in the answers of this thread. Subtracting the first date from the last date works, however it does not provide satisfactory results; the resulting time difference is displayed in numbers, and there seems to be no distinction between different time units (e.g., minutes and hours) --> subtractions in first 2 events are correct, however in the 3rd one it is not i.e. should be minutes. How can I manipulate the output by dplyr so that the resulting subtractions are actually a correct reflection of the time difference? Below you will find a sample of my data (1 group only) and the code that I used:
df<- structure(list(time = structure(c(1428082860, 1428083340, 1428084840,
1428086820, 1428086940, 1428087120, 1428087240, 1428087360, 1428087480,
1428087720, 1428088800, 1428089160, 1428089580, 1428089700, 1428090120,
1428090240, 1428090480, 1428090660, 1428090780, 1428090960, 1428091080,
1428091200, 1428091500, 1428091620, 1428096060, 1428096420, 1428096540,
1428096600, 1428097560, 1428097860, 1428100440, 1428100560, 1428100680,
1428100740, 1428100860, 1428101040, 1428101160, 1428101400, 1428101520,
1428101760, 1428101940, 1428102240, 1428102840, 1428103080, 1428103620,
1428103980, 1428104100, 1428104160, 1428104340, 1428104520, 1428104700,
1428108540, 1428108840, 1428108960, 1428110340, 1428110460, 1428110640
), class = c("POSIXct", "POSIXt"), tzone = ""), event = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)), .Names = c("time",
"event"), class = "data.frame", row.names = c(NA, 57L))
df1 <- df %>%
group_by(event) %>%
summarize(first(time),last(time),difference = last(time)-first(time))
We can use difftime and specify the unit to get all the difference in the same unit.
df %>%
group_by(event) %>%
summarise(First = first(time),
Last = last(time) ,
difference= difftime(last(time), first(time), unit='hour'))

Resources