Clustering similar strings based on another column in R

Clustering similar strings based on another column in R - r

I have a large data frame that shows the distance between strings and their counts.
For example, in row 1, you see the distance between apple and pple as well as the times that I have counted apple (counts_col1= 100) and the times I ve counted pple (counts_col2=2).
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
distance = c(1,2,3,1,1,2),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20))
df
#> # A tibble: 6 × 5
#> col1 col2 distance counts_col1 counts_col2
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 1 100 2
#> 2 apple app 2 100 50
#> 3 pple app 3 2 50
#> 4 banana bananna 1 200 2
#> 5 banana banan 1 200 20
#> 6 bananna banan 2 2 20
Created on 2022-03-15 by the reprex package (v2.0.1)
Now I want to cluster the apples and the bananas based on the string that has the maximum number of counts, which is the apple (100) and the banana (200).
I want my data to look somehow like this
cluster elements sum_counts
apple apple 152
NA pple NA
NA app NA
banana banana 222
NA bananna NA
NA banan NA
The format of the output does not have to be like this. I am really struggling to break down this problem and cluster the groups.
Any help or comment are really appreciated!

You can try using random walk clustering from igraph:
count_df <- data.table::melt(
data.table::as.data.table(df),
measure = list(c("col1", "col2"), c("counts_col1", "counts_col2")),
value.name = c("col", "counts")
) %>%
select(col, counts) %>%
unique()
df %>%
igraph::graph_from_data_frame(directed = FALSE) %>%
igraph::walktrap.community(weights = igraph::E(.)$distance) %>%
# igraph::components() %>%
igraph::membership() %>%
split(names(.), .) %>%
map_dfr(
~tibble(col = .x) %>%
semi_join(count_df, ., by = "col") %>%
arrange(desc(counts)) %>%
summarise(cluster = first(col), elements = list(col), sum_count = sum(counts))
)
cluster elements sum_count
1 apple apple, app, pple 152
2 banana banana, banan, bananna 222
This works on this toy example, but I think your example is to simple and probably does not reflect your main problem. Or it might be even easier if you are interested in finding connected components (if two words are connected they are in same cluster). Then you would need to replace walktrap.community with components.

Here is one approach, where I initially add a group identifier for the sets (I presume you have this in your actual set), and then after making a longer type dataset, I group by this id, and identifier the "word" that has the largest value. I then use an inner join between the initial df and this resulting set of key rows that have the largest_value word, summarize, and rename. I push all the variants into a list column.
df <- df %>% mutate(id=c(1,1,1,2,2,2))
df %>% inner_join(
rbind(
df %>% select(id,distance,col=col1, counts=counts_col1),
df %>% select(id,distance,col=col2, counts=counts_col2)
) %>%
group_by(id) %>%
slice_max(counts) %>%
distinct(col),
by=c("col1"="col")
) %>%
group_by(col1) %>%
summarize(variants = list(c(col1, cur_group()$col1)),
total = min(counts_col1) + sum(counts_col2)) %>%
rename_all(~c("cluster", "elements", "sum_counts"))
# A tibble: 2 x 3
cluster elements sum_counts
<chr> <list> <dbl>
1 apple <chr [3]> 152
2 banana <chr [3]> 222
A similar approach in data.table (also depends on having that id column)
setDT(df)
df[rbind(
df[,.(id,col=col1,counts=counts_col1)],
df[,.(id,col=col2,counts=counts_col2)]
)[order(-counts),.SD[1], by=id],on=.(col1=col)][
, .(elements=list(c(col2,.BY$cluster)),
sum_counts = min(counts_col1) + sum(counts_col2)),
by=.(cluster=col1)]
cluster elements sum_counts
<char> <list> <num>
1: banana bananna,banan,banana 222
2: apple pple,app,apple 152

Related

How to aggregate a data frame based on the max value of the group in R

I have a large data with many groups that looks like this.
I want in each group to use the fruit with the most counts as the central fruit,
and aggregate the other fruits based on it!
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20),
id=c(1,1,1,2,2,2))
df
#> # A tibble: 6 × 5
#> col1 col2 counts_col1 counts_col2 id
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 100 2 1
#> 2 apple app 100 50 1
#> 3 pple app 2 50 1
#> 4 banana bananna 200 2 2
#> 5 banana banan 200 20 2
#> 6 bananna banan 2 20 2
Created on 2022-03-16 by the reprex package (v2.0.1)
I want my data frame to look like this
id central_fruit fruits counts sum_counts
1 apple apple,pple,app 100,50,2 152
2 banana banana,bananna,banan 200,20,2 222
The format of the output it does not have to be like this. This is just an example. It can be a list of characters or just characters.
Any help or guidance is appreciated

We may do this by first reshaping to 'long' format (pivot_longer), grouped by 'id', 'grp', create a frequency count (add_count), then summarise the 'central_fruit' which had the max frequency by 'id', and similarly paste (toString) the unique fruit, and unique count along with sum of unique count
library(dplyr)
library(stringr)
library(tidyr)
df %>%
rename_with(~ str_c("fruit_", .x), starts_with('col')) %>%
pivot_longer(cols = -id, names_to = c(".value", "grp"),
names_pattern = "(.*)_(col\\d+)") %>%
group_by(id, grp) %>%
add_count(fruit) %>%
group_by(id) %>%
summarise(central_fruit = fruit[which.max(n)],
fruits = toString(unique(fruit)),
sum_counts = sum(unique(counts)),
counts = toString(sort(unique(counts), decreasing = TRUE)),
.groups = 'drop' ) %>%
relocate(counts, .before = 'sum_counts')
-output
# A tibble: 2 × 5
id central_fruit fruits counts sum_counts
<dbl> <chr> <chr> <chr> <dbl>
1 1 apple apple, pple, app 100, 50, 2 152
2 2 banana banana, bananna, banan 200, 20, 2 222
NOTE: It may be better to wrap the values of 'counts' in a list instead of pasteing. i.e. instead of counts = toString(sort(unique(counts), decreasing = TRUE)), it would be
counts = list(sort(unique(counts), decreasing = TRUE))

Using data.table, you could do:
Reprex
Code
library(tidyverse) # to read your tibble
library(data.table)
setDT(df)[, .(central_fruit = col1[which.max(counts_col1)],
fruits = .(unique(c(col1, col2))),
counts = .(sort(unique(c(counts_col1, counts_col2)), decreasing = TRUE)),
sum_counts = unlist(lapply(.(unique(c(counts_col1, counts_col2))), sum))),
by = id]
Output
#> id central_fruit fruits counts sum_counts
#> <num> <char> <list> <list> <num>
#> 1: 1 apple apple,pple,app 100, 50, 2 152
#> 2: 2 banana banana,bananna,banan 200, 20, 2 222
Created on 2022-03-16 by the reprex package (v2.0.1)

Sum data frame rows according to column date

I have a data frame resembling this structure:
Name 2021-01-01 2021-01-02 2021-01-03
Banana 5 23 23
Apple 90 2 15
Pear 39 7 18
The actual dataframe has dates spanning a much larger period of time.
How do I aggregate the columns together so that each column represents a week, with the data from each day being summed to form the weekly value? Giving something like this:
Name 2021-01-01 2021-01-08 2021-01-15
Banana 50 23 62
Apple 34 34 81
Pear 13 18 29
I've looked at the aggregate function but it doesn't seem quite right for this purpose.

I found a nice solution from which I learnt a lot. R really is powerful. After the edit, the output now has as column names the dates of the start of the respective weeks, see below.
Data
example <- data.frame(Name = "Banana",
"2021-01-01" = 1,
"2021-01-02" = 3,
"2021-01-10" = 2,
"2021-02-02" = 3)
> example
Name X2021.01.01 X2021.01.02 X2021.01.10 X2021.02.02
1 Banana 1 3 2 3
Code
out <- example %>%
tidyr::pivot_longer(cols = c(-Name)) %>%
mutate(Name2 = as.Date(name, format = "X%Y.%m.%d")) %>%
mutate(week = lubridate::week(Name2)) %>%
group_by(week) %>%
mutate(Sum = sum(value)) %>%
mutate(Dates = lubridate::ymd("2021-01-01") + lubridate::weeks(week - 1)) %>%
ungroup %>%
select(-name, -value, -Name2, -week) %>%
group_by_all %>%
unique %>%
tidyr::pivot_wider(id_cols = Name, values_from = Sum, names_from = Dates)
Output
# A tibble: 1 x 4
# Groups: Name [1]
Name `2021-01-01` `2021-01-08` `2021-01-29`
<chr> <dbl> <dbl> <dbl>
1 Banana 4 2 3

Count string length using external table

Suppose you have a table of data:
df<-tibble(person = c("Alice", "Bob", "Mary"),
colour = c("Red", "Green", "Blue"),
city = c("London", "Paris", "New York"))
# A tibble: 3 x 3
person colour city
<chr> <chr> <chr>
1 Alice Red London
2 Bob Green Paris
3 Mary Blue New York
And a second table which contains the field names and the maximum string length of each field:
len<-tibble(field_name = c("person", "colour", "city"),
field_length = c(12, 4, 6))
# A tibble: 3 x 2
field_name field_length
<chr> <dbl>
1 person 12
2 colour 4
3 city 6
How can I check, for each field in len, whether a string in df is less than or equal to len$field_length, returning rows which fail the test?
As an example:
Output Row 1 in df would pass because:
'Alice' <= 12 characters long,
'Red' is <= 4 characters long and
'London' is <= 6 characters long.
However,
Row 2 would fail because:
'Green' > 4 characters long and
Row 3 would fail because:
'New York' > 6 characters long.
Thus the returned data frame should only display Rows 2 and Row 3 of the original df.

A dplyr solution with c_across():
library(dplyr)
df %>%
rowwise() %>%
filter(any(nchar(c_across(everything())) > len$field_length)) %>%
ungroup()
# # A tibble: 2 x 3
# person colour city
# <chr> <chr> <chr>
# 1 Bob Green Paris
# 2 Mary Blue New York

Using base R with mapply :
df[rowSums(mapply(function(x, y) nchar(x) > y, df, len$field_length)) > 0, ]
# A tibble: 2 x 3
# person colour city
# <chr> <chr> <chr>
#1 Bob Green Paris
#2 Mary Blue New York
If column names in df are not in the same order as len$field_name use df[len$field_name] in mapply.
In tidyverse we can get data in long format join it with len data by column name, select the rows which fail and get data in wide format again.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
left_join(len, by = c('name' = 'field_name')) %>%
group_by(row) %>%
filter(any(nchar(value) > field_length)) %>%
dplyr::select(-field_length) %>%
pivot_wider()

It's easier to solve your problem in terms of 2 matrices, first the length of each of your entries:
nchar(as.matrix(df))
person colour city
[1,] 5 3 6
[2,] 3 5 5
[3,] 4 4 8
And a corresponding matrix of allowed length:
allowed = replicate(nrow(df),len$field_length[match(colnames(df),len$field_name)])
allowed
[,1] [,2] [,3]
[1,] 12 12 12
[2,] 4 4 4
[3,] 6 6 6
Then matrix wise comparison, and only keep those where the rowSums() are
df[rowMeans(nchar(as.matrix(df)) > allowed)>0,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York
If your two data.frames are already in the same order like your example, you can do (thanks to #zx8754) for pointing it out:
df[rowMeans(nchar(as.matrix(df)) > len$field_length)>0,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York

Pivot df into the same format as len and join the two. After this, it is trivial to compare each string to the field_length.
library(tidyverse)
test_result_df <- df %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = 'field_name') %>%
left_join(len, by = 'field_name') %>%
mutate(test_passed = str_length(value) <= field_length) %>%
group_by(id) %>%
summarise(all_passed = all(test_passed))
df[!test_result_df$all_passed,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York

R collapse rows by group with non-missing values when values are character

I'm trying to collapse/aggregate/summarise rows by group keeping only non-missing values, where values are characters. Here's a reproducible example.
df = data.frame(store = c("A","A", "B","B"),
item1=c("apple","","milk",""),
item2=c("","pear","","bread"))
df
store item1 item2
1 A apple
2 A pear
3 B milk
4 B bread
I hope to change df as the following
df2
store item1 item2
1 A apple pear
2 B milk bread
I've tried using summarise_all with nchar(.) > 0as the following, but it doesn't seem to work.
df %>%
group_by(store) %>%
summarise_all( ~ + any(nchar(.) > 0))
Any comments would be appreciated!

You can do:
df %>%
group_by(store) %>%
summarise_all(~ .[nchar(.) > 1])
store item1 item2
<chr> <chr> <chr>
1 A apple pear
2 B milk bread

1) pivot Reshape to long form, remove the "" elements and reshape back.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-1) %>%
filter(value != "") %>%
pivot_wider
giving:
# A tibble: 2 x 3
store item1 item2
<fct> <fct> <fct>
1 A apple pear
2 B milk bread
2) max Another approach is to take the maximum value within group excluding NA's. This converts the item columns to character whereas the output of (1) is factor columns.
library(dplyr)
df %>%
group_by(store) %>%
summarize_all(~ max(as.character(.), na.rm = TRUE)) %>%
ungroup
giving:
# A tibble: 2 x 3
store item1 item2
<fct> <chr> <chr>
1 A apple pear
2 B milk bread

R: counting distinct combinations found in a data frame where columns are interchangable

I'm not sure what this problem is even called. Let's say I'm counting distinct combinations of 2 columns, but I want distinct across the order of the two columns. Here's what I mean:
df = data.frame(fruit1 = c("apple", "orange", "orange", "banana", "kiwi"),
fruit2 = c("orange", "apple", "banana", "orange", "apple"),
stringsAsFactors = FALSE)
# What I want: total number of fruit combinations, regardless of
# which fruit comes first and which second.
# Eg 2 apple-orange, 2 banana-orange, 1 kiwi-apple
# What I know *doesn't* work:
table(df$fruit1, df$fruit2)
# What *does* work:
library(dplyr)
df %>% group_by(fruit1, fruit2) %>%
transmute(fruitA = sort(c(fruit1, fruit2))[1],
fruitB = sort(c(fruit1, fruit2))[2]) %>%
group_by(fruitA, fruitB) %>%
summarise(combinations = n())
I've got a way to make this work, as you can see, but is there a name for this general problem? It's sort of a combinatorics problem but counting, not generating combinations. And what if I had three or four columns of similar type? The above method is poorly generalizable. Tidyverse approaches most welcome!

By using apply and sort order your dataframe then we just using group_by count
data.frame(t(apply(df,1,sort)))%>%group_by_all(.)%>%count()
# A tibble: 3 x 3
# Groups: X1, X2 [3]
X1 X2 n
<fctr> <fctr> <int>
1 apple kiwi 1
2 apple orange 2
3 banana orange 2

Here is an option using pmap with count
library(tidyverse)
library(rlang)
pmap_df(df, ~ sort(c(...)) %>%
as.list %>%
as_tibble %>%
set_names(names(df))) %>%
count(!!! rlang::syms(names(.)))
# A tibble: 3 x 3
# fruit1 fruit2 n
# <chr> <chr> <int>
#1 apple kiwi 1
#2 apple orange 2
#3 banana orange 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Clustering similar strings based on another column in R - r

Related

How to aggregate a data frame based on the max value of the group in R

Sum data frame rows according to column date

Count string length using external table

R collapse rows by group with non-missing values when values are character

R: counting distinct combinations found in a data frame where columns are interchangable

Categories

Resources