I want to sum a subset of categories contained within a single variable, organized as tidy data in r.
It seems like it should be simple, but I can only think of a large number of lines of code to do it.
Here is an example:
df = data.frame(food = c("carbs", "protein", "apple", "pear"), value = c(10, 12, 4, 3))
df
food value
1 carbs 10
2 protein 12
3 apple 4
4 pear 3
I want the data frame to look like this (combining apple and pear into fruit):
food value
1 carbs 10
2 protein 12
3 fruit 7
The way I can think to do this is:
library(dplyr)
library(tidyr)
df %>%
spread(key = "food", value = "value") %>%
mutate(fruit = apple + pear) %>%
select(-c(apple, pear)) %>%
gather(key = "food", value = "value")
food value
1 carbs 10
2 protein 12
3 fruit 7
This seems too long for something so simple. I could also subset the data, sum the rows and then rbind, but that also seems laborious.
Any quicker options?
A factor can be recoded with forcats::fct_recode but this isn't necessarily shorter.
library(dplyr)
library(forcats)
df %>%
mutate(food = fct_recode(food, fruit = 'apple', fruit = 'pear')) %>%
group_by(food) %>%
summarise(value = sum(value))
## A tibble: 3 x 2
# food value
# <fct> <dbl>
#1 fruit 7
#2 carbs 10
#3 protein 12
Edit.
I will post the code in this comment here, since comments are more often deleted than answers. The result is the same as above.
df %>%
group_by(food = fct_recode(food, fruit = 'apple', fruit = 'pear')) %>%
summarise(value = sum(value))
What about:
df %>%
group_by(food = if_else(food %in% c("apple", "pear"), "fruit", food)) %>%
summarise_all(sum)
food value
<chr> <dbl>
1 carbs 10
2 fruit 7
3 protein 12
Related
I have a large data frame that shows the distance between strings and their counts.
For example, in row 1, you see the distance between apple and pple as well as the times that I have counted apple (counts_col1= 100) and the times I ve counted pple (counts_col2=2).
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
distance = c(1,2,3,1,1,2),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20))
df
#> # A tibble: 6 × 5
#> col1 col2 distance counts_col1 counts_col2
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 1 100 2
#> 2 apple app 2 100 50
#> 3 pple app 3 2 50
#> 4 banana bananna 1 200 2
#> 5 banana banan 1 200 20
#> 6 bananna banan 2 2 20
Created on 2022-03-15 by the reprex package (v2.0.1)
Now I want to cluster the apples and the bananas based on the string that has the maximum number of counts, which is the apple (100) and the banana (200).
I want my data to look somehow like this
cluster elements sum_counts
apple apple 152
NA pple NA
NA app NA
banana banana 222
NA bananna NA
NA banan NA
The format of the output does not have to be like this. I am really struggling to break down this problem and cluster the groups.
Any help or comment are really appreciated!
You can try using random walk clustering from igraph:
count_df <- data.table::melt(
data.table::as.data.table(df),
measure = list(c("col1", "col2"), c("counts_col1", "counts_col2")),
value.name = c("col", "counts")
) %>%
select(col, counts) %>%
unique()
df %>%
igraph::graph_from_data_frame(directed = FALSE) %>%
igraph::walktrap.community(weights = igraph::E(.)$distance) %>%
# igraph::components() %>%
igraph::membership() %>%
split(names(.), .) %>%
map_dfr(
~tibble(col = .x) %>%
semi_join(count_df, ., by = "col") %>%
arrange(desc(counts)) %>%
summarise(cluster = first(col), elements = list(col), sum_count = sum(counts))
)
cluster elements sum_count
1 apple apple, app, pple 152
2 banana banana, banan, bananna 222
This works on this toy example, but I think your example is to simple and probably does not reflect your main problem. Or it might be even easier if you are interested in finding connected components (if two words are connected they are in same cluster). Then you would need to replace walktrap.community with components.
Here is one approach, where I initially add a group identifier for the sets (I presume you have this in your actual set), and then after making a longer type dataset, I group by this id, and identifier the "word" that has the largest value. I then use an inner join between the initial df and this resulting set of key rows that have the largest_value word, summarize, and rename. I push all the variants into a list column.
df <- df %>% mutate(id=c(1,1,1,2,2,2))
df %>% inner_join(
rbind(
df %>% select(id,distance,col=col1, counts=counts_col1),
df %>% select(id,distance,col=col2, counts=counts_col2)
) %>%
group_by(id) %>%
slice_max(counts) %>%
distinct(col),
by=c("col1"="col")
) %>%
group_by(col1) %>%
summarize(variants = list(c(col1, cur_group()$col1)),
total = min(counts_col1) + sum(counts_col2)) %>%
rename_all(~c("cluster", "elements", "sum_counts"))
# A tibble: 2 x 3
cluster elements sum_counts
<chr> <list> <dbl>
1 apple <chr [3]> 152
2 banana <chr [3]> 222
A similar approach in data.table (also depends on having that id column)
setDT(df)
df[rbind(
df[,.(id,col=col1,counts=counts_col1)],
df[,.(id,col=col2,counts=counts_col2)]
)[order(-counts),.SD[1], by=id],on=.(col1=col)][
, .(elements=list(c(col2,.BY$cluster)),
sum_counts = min(counts_col1) + sum(counts_col2)),
by=.(cluster=col1)]
cluster elements sum_counts
<char> <list> <num>
1: banana bananna,banan,banana 222
2: apple pple,app,apple 152
I'm fairly new to R and am sure there's a way to do the following without using loops, which I'm more familiar with.
Take the following example where you have a bunch of names and fruits each person likes:
name <- c("Alice", "Bob")
preference <- list(c("apple", "pear"), c("banana", "apple"))
df <- as.data.frame(cbind(name, preference))
How to I convert it to the following?
apple <- c(1, 1)
pear <- c(1, 0)
banana <- c(0, 1)
df2 <- data.frame(name, apple, pear, banana)
My basic instinct is to first extract all the fruits then do a loop to check if each fruit is in each row's preference:
fruits <- unique(unlist(df$preference))
for (fruit in fruits) {
df <- df %>% rowwise %>% mutate("{fruit}" := fruit %in% preference)
}
This seems to work, but I'm pretty sure there's a better way to do this.
df %>%
unnest(everything()) %>%
xtabs(~., .) %>%
as.data.frame.matrix() %>%
rownames_to_column('name')
name apple banana pear
1 Alice 1 0 1
2 Bob 1 1 0
In tidyverse (assuming the 'preference' is a list column), unnest the 'preference' and then use pivot_wider to reshape back to 'wide' format with values_fn as length
library(dplyr)
library(tidyr)
df %>%
unnest_longer(preference) %>%
pivot_wider(names_from = preference, values_from = preference,
values_fn = length, values_fill = 0)
-output
# A tibble: 2 × 4
name apple pear banana
<chr> <int> <int> <int>
1 Alice 1 1 0
2 Bob 1 0 1
data
df <- data.frame(name, preference = I(preference))
Another possible solution, based on tidyr::separate_rows and janitor::tabyl:
library(tidyverse)
df %>%
separate_rows(everything(), sep="(?<=\\w), (?=\\w)") %>%
janitor::tabyl(name, preference)
#> name apple banana pear
#> Alice 1 0 1
#> Bob 1 1 0
I have a data frame resembling this structure:
Name 2021-01-01 2021-01-02 2021-01-03
Banana 5 23 23
Apple 90 2 15
Pear 39 7 18
The actual dataframe has dates spanning a much larger period of time.
How do I aggregate the columns together so that each column represents a week, with the data from each day being summed to form the weekly value? Giving something like this:
Name 2021-01-01 2021-01-08 2021-01-15
Banana 50 23 62
Apple 34 34 81
Pear 13 18 29
I've looked at the aggregate function but it doesn't seem quite right for this purpose.
I found a nice solution from which I learnt a lot. R really is powerful. After the edit, the output now has as column names the dates of the start of the respective weeks, see below.
Data
example <- data.frame(Name = "Banana",
"2021-01-01" = 1,
"2021-01-02" = 3,
"2021-01-10" = 2,
"2021-02-02" = 3)
> example
Name X2021.01.01 X2021.01.02 X2021.01.10 X2021.02.02
1 Banana 1 3 2 3
Code
out <- example %>%
tidyr::pivot_longer(cols = c(-Name)) %>%
mutate(Name2 = as.Date(name, format = "X%Y.%m.%d")) %>%
mutate(week = lubridate::week(Name2)) %>%
group_by(week) %>%
mutate(Sum = sum(value)) %>%
mutate(Dates = lubridate::ymd("2021-01-01") + lubridate::weeks(week - 1)) %>%
ungroup %>%
select(-name, -value, -Name2, -week) %>%
group_by_all %>%
unique %>%
tidyr::pivot_wider(id_cols = Name, values_from = Sum, names_from = Dates)
Output
# A tibble: 1 x 4
# Groups: Name [1]
Name `2021-01-01` `2021-01-08` `2021-01-29`
<chr> <dbl> <dbl> <dbl>
1 Banana 4 2 3
I have a large df like the one below, where I want to know (using the terms in the made up df) know which id that have the same fruit for the longest period of time in this biannually event. I.e. the opportunity to hold a fruit only occurs every other year.
df<-data.frame("id"=c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
"Year"=c(1981, 1981, 1985, 2011, 2011, 2013, 2015, 1921, 1923, 1955),
"fruit"=c("banana", "apple", "banana", "orange", "melon", "orange",
"orange", "melon", "melon", "melon"))
I have tried different kinds of group_by, and cumsum see below.
df<-df %>% mutate(year_diff=cumsum(c(1, diff(df$Year)>1)))
df %>% group_by(id, fruit) %>% filter(year_diff==2)
And the one below (after reloading the df)
df %>% group_by(id, fruit) %>% mutate(year_diff=cumsum(c(1, diff(df$Year)>1)))
And played around with:
df %>% group_by(id, fruit) %>% mutate(summarise(n_years=n_distinct(Year)))
In the end I ideally want a tibble like the one below arranging the id's (with their fruits) in order of who have the most consecutive "holds" of a fruit in the events (over time). Remember that the event only takes place every other year.
id fruit occurence
2 orange 3
3 melon 2
1 banana 1
1 apple 1
2 melon 1
3 melon 1
I understand that there are several steps.
EDIT:
Maybe there is a way to modify this:
df %>% group_by(id, fruit) %>% summarise(n_years=n_distinct(Year)) %>% arrange(desc(n_years)) %>% ungroup()
so that it creates a new column in the original tibble (which I am unable to do, but you might be), and then I can filter consecutive events?
Using dplyr we arrange rows by id, fruit and Year and create a new grouping variable (group) showing consecutive years for each id and fruit and then count the number of rows in each group.
library(dplyr)
df %>%
arrange(id, fruit, Year) %>%
group_by(id, fruit, group = cumsum(c(2, diff(Year)) != 2)) %>%
summarise(n = n()) %>%
ungroup() %>%
select(-group)
# id fruit n
# <dbl> <fct> <int>
#1 1 apple 1
#2 1 banana 1
#3 1 banana 1
#4 2 melon 1
#5 2 orange 3
#6 3 melon 2
#7 3 melon 1
I'm not sure what this problem is even called. Let's say I'm counting distinct combinations of 2 columns, but I want distinct across the order of the two columns. Here's what I mean:
df = data.frame(fruit1 = c("apple", "orange", "orange", "banana", "kiwi"),
fruit2 = c("orange", "apple", "banana", "orange", "apple"),
stringsAsFactors = FALSE)
# What I want: total number of fruit combinations, regardless of
# which fruit comes first and which second.
# Eg 2 apple-orange, 2 banana-orange, 1 kiwi-apple
# What I know *doesn't* work:
table(df$fruit1, df$fruit2)
# What *does* work:
library(dplyr)
df %>% group_by(fruit1, fruit2) %>%
transmute(fruitA = sort(c(fruit1, fruit2))[1],
fruitB = sort(c(fruit1, fruit2))[2]) %>%
group_by(fruitA, fruitB) %>%
summarise(combinations = n())
I've got a way to make this work, as you can see, but is there a name for this general problem? It's sort of a combinatorics problem but counting, not generating combinations. And what if I had three or four columns of similar type? The above method is poorly generalizable. Tidyverse approaches most welcome!
By using apply and sort order your dataframe then we just using group_by count
data.frame(t(apply(df,1,sort)))%>%group_by_all(.)%>%count()
# A tibble: 3 x 3
# Groups: X1, X2 [3]
X1 X2 n
<fctr> <fctr> <int>
1 apple kiwi 1
2 apple orange 2
3 banana orange 2
Here is an option using pmap with count
library(tidyverse)
library(rlang)
pmap_df(df, ~ sort(c(...)) %>%
as.list %>%
as_tibble %>%
set_names(names(df))) %>%
count(!!! rlang::syms(names(.)))
# A tibble: 3 x 3
# fruit1 fruit2 n
# <chr> <chr> <int>
#1 apple kiwi 1
#2 apple orange 2
#3 banana orange 2