Finding number of identical groups - r

Consider a clustering problem, where the true class labels are known (say g).
Suppose, p denotes the predicted cluster labels (can be obtained by any clustering approach).
So, both g and p splits the data set in some groups, though the number of groups need not be same in two cases.
Among these two sets of groups, in some cases one group by g will be identical to another group by p, though their labels in two cases may be different. I want to find the number of such groups, i.e. I want to find the number of cases where the clustering method is able to detect a class perfectly.
I understand this is not a standard way to evaluate clustering (Rand Index, Dunn Index, etc. are recommended), but I am interested in this. I also understand that this number will be very small in most of the real life data, may be even 0, but the data set I am currently working with has a large number (around 1500) of classes, with highest number of observations in one class being at most 15. So, in this case, this number is likely to be quite high.
Here is a reproducible example and my attempt (working) at the solution:
# true labels
g <- c(1, 1, 2, 2, 2, 1, 3, 3, 3, 4)
# predicted labels
p <- c(3, 3, 1, 1, 1, 3, 4, 4, 1, 2)
# correctly detected groups
n_correct <- 2 # (1st class and 3rd cluster), (4th class and 2nd cluster)
# attempt
distinct_class_labels <- unique(x = g)
counter <- 0
for (i in seq_along(along.with = distinct_class_labels))
{
cluster_labels_of_obs_in_ith_class <- subset(x = p,
subset = (g == distinct_class_labels[i]))
unique_cluster_labels_of_obs_in_ith_class <- unique(x = cluster_labels_of_obs_in_ith_class)
if (length(x = unique_cluster_labels_of_obs_in_ith_class) == 1)
{
class_labels_of_obs_in_this_cluster <- subset(x = g,
subset = (p == unique_cluster_labels_of_obs_in_ith_class))
if (length(x = unique(x = class_labels_of_obs_in_this_cluster)) == 1)
{
counter <- (counter + 1)
}
}
}
counter
#> [1] 2
Created on 2019-05-22 by the reprex package (v0.3.0)
This works correctly, but it takes time (and I do not like this method). I suppose one can use dplyr::group_by with both g and p separately and somehow compare the groups of these two objects. I guess there are other better approaches to this and I will highly appreciate such answers.
Thanks.

If you are also interested in the combination of the correctly detected groups you can try this
library(tidyverse)
tibble(g = g, p=p) %>%
distinct(g,p) %>% # unique combinations of g and p
add_count(g, name="g_count") %>% # count how often each class/label occurs in g and p. When it is unambiguous assigned it should be 1
add_count(p, name="p_count") %>%
filter(g_count == 1 & p_count == 1) %>%
select(g,p)
# A tibble: 2 x 2
g p
<dbl> <dbl>
1 1 3
2 4 2
The number of rows (you can use nrow()) will give you the number of correctly detected groups

Convert g and p to factor with levels specified based on their occurrence in the vector and count the frequencies that match.
sum(table(factor(p, levels = unique(p))) == table(factor(g, levels = unique(g))))
#[1] 2
To understand, see
table(factor(p, levels = unique(p)))
#3 1 4 2
#3 4 2 1
table(factor(g, levels = unique(g)))
#1 2 3 4
#3 3 3 1
We can ignore the labels (as the group labels are not same) and focus only on frequency. We can see that the first and fourth value have the same frequency hence, the count 2.
If you want to find out which groups are similar, you can do
inds <- table(factor(p, levels = unique(p))) == table(factor(g, levels = unique(g)))
unique(p)[inds]
#[1] 3 2
unique(g)[inds]
#[1] 1 4
This says that group 3 in p is similar to group 1 in g and same for 2 and 4 respectively.
Before solving it using table , I did it with split although the underlying logic is the same.
sum(lengths(split(p, factor(p, levels = unique(p)))) ==
lengths(split(g, factor(g, levels = unique(g)))))
EDIT
If there is a chance of class imbalance we need to combine the levels to include all. For example,
g1 <- c(g, 5)
p1 <- c(p, 1)
sum(table(factor(p1, levels = unique(c(p1, g1)))) ==
table(factor(g1, levels = unique(c(g1, p1)))))
#[1] 2

Related

Control which nodes to change size igraph

This may look simple but I am not able to do this. I want to plot two types of nodes, small and big using a cutoff. The values are in the attribute degree.
Here is a small toy example
g1 <- graph(edges=c(1,2, 2,3, 3, 1, 4,2), n=4, directed=F) %>%
set_vertex_attr("names", value = LETTERS[1:4])
g1_degree <- degree(g1, mode = "total")
g1_degree
[1] 2 3 2 1
g1 <- set_vertex_attr(g1, "degree", value = g1_degree)
plot(g1, vertex.size=V(g1)$degree)
This gives me every node according to the degree, but I want nodes of degree 2 and 3 big and 1 small.
So I tried to edit the values within V(g1)$degree
ifelse(V(g1)$degree < 2, yes = V(g1)$degree==1, no = V(g1)$degree==5)
FALSE FALSE FALSE TRUE
Ok, I checked my degree values, but how can I overwrite the TRUE or FALSE using the cutoffs I need?
Here are two solutions.
One with ifelse, like in the question.
g1 <- set_vertex_attr(g1, "degree", value = ifelse(V(g1)$degree < 2, 1, 5))
V(g1)$degree
#[1] 5 5 5 1
And another with findInterval. This has better performance than ifelse, which can be important in large data sets.
i <- findInterval(V(g1)$degree, c(0, 2, Inf))
g1 <- set_vertex_attr(g1, "degree", value = c(1, 5)[i])
V(g1)$degree
#[1] 5 5 5 1
With different new sizes set with the findInterval index, c(10, 50)[i], the graph would look like below.
g1 <- set_vertex_attr(g1, "degree", value = c(10, 50)[i])
plot(g1, vertex.size = V(g1)$degree)

Iterate over combinations from one row in each index by group in r

I have a dataset (example) as below,
data <- data.frame(pc = c("A","A","A","A","A","A", "B","B","B","B"), #categorical
index = c(1, 1, 2, 2, 2, 3, 4, 5, 5, 5), #categorical
g= c(1, 2, 4, 3, 6, 7, 8, 5, 9, 3), #numeric
h= c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4)) #categorical
I want to group by 'pc', iterate over all combinations based on 'index' to get the summation of values in 'g' and number of categories in 'h' columns, and keep the rows of the combination that yields the highest summation value from 'g' + number of categories from 'h'.
For example, in pc=A group, index=1 has two rows, index=2 has three, index=3 has one, so in total I have 2x3x1= 6 combinations (each combination has three rows, one with index=1, one with index=2, one with index=3). I want to keep the rows (one row from each unique index) that yields the highest (summation value from 'g' + number of categories from 'h'). The number of index and length of each index are all different in each pc group.
Just an example to visualise the combination for pc=A group,
combination sum_of_values_in_g number_of_categories_in_h
#1 12 2
#2 11 3
#3 14 3
#4 13 2
#5 12 3
#6 15 3
My desired result in this example will be
pc index g h
A 1 2 1
A 2 6 2
A 3 7 3
B 4 8 3
B 5 9 3
I have done some research on how to get combinations
(Iterate over unique combination of groups in a data frame, How to iterate through all combinations of columns and apply function by group in R? and
Combinations by group in R)..
but I couldn't figure out how to get the right combination in each group and run further operation in each combination... Any input or direction will be appreciated!
Here is a brute force solution. The run time could be really long given a large dataset.
We need functions from these packages:
library(tidyr)
library(dplyr)
library(purrr)
This is the first step, we need a function to first split your data into several groups (split(transpose(df), df[[split_by]])), then find all possible row combinations across them (cross(...)), and finally merge each of them into a single dataframe (lapply(..., bind_rows)).
perm_all <- function(df, split_by){
lapply(cross(split(transpose(df), df[[split_by]])), bind_rows)
}
(transpose turns an n-row dataframe into an n-element list of single-row dataframes)
This is the second step, we loop through all dataframes in that list to see which one satisfies your requirements.
which_max <- function(ls_of_df, numer, categ) {
test_stats <- vapply(
ls_of_df,
function(df) {
temp <- length(unique(df[[categ]]))
c(sum(df[[numer]]) + temp, temp)
},
double(2L)
)
# You could have multiple maxima for those sums
out <- which(test_stats[1L, ] == max(test_stats[1L, ]))
# but after the second test (i.e. find the greatest number of categories), you should have one and only one combination left
out[[which.max(test_stats[2L, out])]]
}
Now, we use a single function to perform these two steps.
max_of_all_perm <- function(df, group_var, numer, categ) {
l <- perm_all(df, group_var)
l[[which_max(l, numer, categ)]]
}
And run it across all groups defined by pc
data %>%
nest(data = -pc) %>%
mutate(data = lapply(data, max_of_all_perm, "index", "g", "h")) %>%
unnest(data)
Output
# A tibble: 5 x 4
pc index g h
<chr> <dbl> <dbl> <dbl>
1 A 1 2 1
2 A 2 6 2
3 A 3 7 3
4 B 4 8 3
5 B 5 9 3

Plot every 10 datapoint in a vector by different color in R

I have one dimensional vector in R which I would like to plot like :
Every 10 data points have different color. How do I do this in R with normal plot function, with ggplot and with plotly?
in base R you can try this.
I changed the data a little bit compared to the other answer
# The data
set.seed(2017);
df <- data.frame(x = 1:100, y = 0.001 * 1:100 + runif(100));
nCol <- 10;
df$col <- rep(1:10, each = 10);
# base R plot
plot(df[1:2]) #add `type="n"` to remove the points
sapply(1:nrow(df), function(x) lines(df[x+0:1,1:2], col=df$col[x], lwd=2))
As for lines the col parameter will be recycled you have to use a loop (here sapply) over the rows and plot segments.
Here is a ggplot solution; unfortunately you don't provide sample data, so I'm generating some random data.
# Sample data
set.seed(2017);
df <- data.frame(x = 1:100, y = 0.001 * 1:100 + runif(1000));
# The number of different colours
nCol <- 5;
df$col <- rep(1:nCol, each = 10);
# ggplot
library(tidyverse);
ggplot(df, aes(x = x, y = y, col = as.factor(col), group = 1)) +
geom_line();
For plotly just wrap the ggplot call within ggplotly.
This answer doesn't show you how to do it in a specific plotting package, but instead shows how to assign random colors to your data according to your specifications. The benefit of this approach is that it gives you control over which colors you use if you choose.
library(dplyr) # assumed okay given ggplot2 mention
df = data_frame(v1=rnorm(100))
n = nrow(df)
df$group = (1:n - (1:n %% -10)) / 10
colors = sample(colors(), max(df$group), replace=FALSE)
df$color = colors[df$group]
df %>% group_by(group) %>% filter(row_number() <= 2) %>% ungroup()
# A tibble: 20 x 3
v1 group color
<dbl> <dbl> <chr>
1 -0.6941434087 1 lightsteelblue2
2 -0.4559695973 1 lightsteelblue2
3 0.7567737300 2 darkgoldenrod2
4 0.9478937275 2 darkgoldenrod2
5 -1.2358486079 3 slategray3
6 -0.7068140340 3 slategray3
7 1.3625895045 4 cornsilk
8 -2.0416315923 4 cornsilk
9 -0.6273386846 5 darkgoldenrod4
10 -0.5884521130 5 darkgoldenrod4
11 0.0645078975 6 antiquewhite1
12 1.3176727205 6 antiquewhite1
13 -1.9082708004 7 khaki
14 0.2898018693 7 khaki
15 0.7276799336 8 greenyellow
16 0.2601492048 8 greenyellow
17 -0.0514811315 9 seagreen1
18 0.8122600269 9 seagreen1
19 0.0004641533 10 darkseagreen4
20 -0.9032770589 10 darkseagreen4
The above code first creates a fake dataset with 100 rows of data, and sets n equal to 100. df$group is set by taking the row numbers (1:n) performing a rather convoluted evaluation to get a vector of numbers like c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ..., 10). It then samples the colors available in base R returning as many colors as their are groups (max(df$group)) and then using the group variable to index the color vector to get the color. The final output is just the first two rows of each group to show that the colors are the same within group, but different between groups. This should now be able to be passed in as a variable in your various plotting environments.

How to add a column that gives the result of an operation on every row relative to current row?

I have a data frame with a group of x and y points. I need to calculate the euclidean distance of every point relative to every other point. Then I have to figure, for each row, how many are within a given range.
For example, if I had this data frame:
x y
- -
1 2
2 2
9 9
I should add a column that signals how many points (if we consider these points to be in a cartesian plane) are within a distance of 3 units from every other point.
x y n
- - -
1 2 1
2 2 1
9 9 0
Thus, the first point (1,2) has one other point (2,2) that is within that range, whereas the point (9,9) has 0 points at a distance of 3 units.
I could do this with a couple of nested for loops, but I am interested in solving this in R in an idiomatic way, preferably using dplyr or other library.
This is what I have:
ddply(.data=mydataframe, .variables('x', 'y'), .fun=count.in.range)
count.in.range <- function (df) {
xp <- df$x
yp <- df$y
return(nrow(filter(df, dist( rbind(c(x,y), c(xp,yp)) ) < 3 )))
}
But, for some reason, this doesn't work. I think it has to do with filter.
Given
df_ <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
You can use the function "dist":
matrix_dist <- as.matrix(dist(df_))
df_$n <- rowSums(matrix_dist <= 3)
This is base approach with straightforward application of a "distance function" but only on a row-by-row basis:
apply( df_ , 1, function(x) sum( (x[1] - df_[['x']])^2+(x[2]-df_[['y']])^2 <=9 )-1 )
#[1] 1 1 0
It's also really a "sweep" operation, although I wouldn't really expect a performance improvement.
I would suggest you work with pairs of points in the long format and then use a data.table solution, which is probably one of the fastest alternatives to work with large datasets
library(data.table)
library(reshape)
df <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
The first thing you need to do is to reshape your data to long format with all possible combinations of pairs of points:
df_long <- expand.grid.df(df,df)
# rename columns
setDT(df_long )
setnames(df_long, c("x","y","x1","y1"))
Now you only need to do this:
# calculate distance between pairs
df_long[ , mydist := dist ( matrix(c(x,x1,y,y1), ncol = 2, nrow = 2) ) , by=.(x,y,x1,y1)]
# count how many points are within a distance of 3 units
df_long[mydist <3 , .(count = .N), by=.(x,y)]
#> x y count
#> 1: 1 2 2
#> 2: 2 2 2
#> 3: 9 9 1

Wordclouds with absolute word sizes

I'm trying to make several wordclouds to compare terms, which themselves are nested within groups. I would like to make one wordcloud per group. The wordcloud package in R can make the wordclouds I need, but each new wordcloud has the size of the words scaled relatively to the maximum and minimum word frequency. This is able to be set with the scale parameter.
My aim is to make wordclouds where the size of the word is absolutely related to the frequency of the word, enabling different wordclouds to be visually compared.
library(wordcloud)
dat <- data.frame(word = rep(LETTERS[1:3], 2), freq = c(10, 5, 3, 20, 10, 6), group = c(1, 1, 1, 2, 2, 2))
dat
# word freq group
#1 A 10 1
#2 B 5 1
#3 C 3 1
#4 A 20 2
#5 B 10 2
#6 C 6 2
wordcloud(dat$word[dat$group == 1], dat$freq[dat$group == 1])
wordcloud(dat$word[dat$group == 2], dat$freq[dat$group == 2]) # Currently the same
This is the current wordcloud I get from the above command, run on both groups in the MWE (although the exact placement will vary randomly with each run). I would like each letter in the second group's wordcloud to be twice as large as the first, in line with the data (or for there to be some sensible scaled difference, even if it is not linear).
How can this be achieved?
Hm, this might be a roundabout way. But what if we set the scale for all data using a single anchor.
anchor <- max(dat$freq)
wordcloud(dat$word[dat$group == 1], dat$freq[dat$group == 1], scale = c(8*max(dat$freq[dat$group == 1])/anchor, 0.5))
wordcloud(dat$word[dat$group == 2], dat$freq[dat$group == 2], scale = c(8*max(dat$freq[dat$group == 2])/anchor, 0.5))

Resources