Using group_by to count difference in values - r

I have a large df like the one below, where I want to know (using the terms in the made up df) know which id that have the same fruit for the longest period of time in this biannually event. I.e. the opportunity to hold a fruit only occurs every other year.
df<-data.frame("id"=c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
"Year"=c(1981, 1981, 1985, 2011, 2011, 2013, 2015, 1921, 1923, 1955),
"fruit"=c("banana", "apple", "banana", "orange", "melon", "orange",
"orange", "melon", "melon", "melon"))
I have tried different kinds of group_by, and cumsum see below.
df<-df %>% mutate(year_diff=cumsum(c(1, diff(df$Year)>1)))
df %>% group_by(id, fruit) %>% filter(year_diff==2)
And the one below (after reloading the df)
df %>% group_by(id, fruit) %>% mutate(year_diff=cumsum(c(1, diff(df$Year)>1)))
And played around with:
df %>% group_by(id, fruit) %>% mutate(summarise(n_years=n_distinct(Year)))
In the end I ideally want a tibble like the one below arranging the id's (with their fruits) in order of who have the most consecutive "holds" of a fruit in the events (over time). Remember that the event only takes place every other year.
id fruit occurence
2 orange 3
3 melon 2
1 banana 1
1 apple 1
2 melon 1
3 melon 1
I understand that there are several steps.
EDIT:
Maybe there is a way to modify this:
df %>% group_by(id, fruit) %>% summarise(n_years=n_distinct(Year)) %>% arrange(desc(n_years)) %>% ungroup()
so that it creates a new column in the original tibble (which I am unable to do, but you might be), and then I can filter consecutive events?

Using dplyr we arrange rows by id, fruit and Year and create a new grouping variable (group) showing consecutive years for each id and fruit and then count the number of rows in each group.
library(dplyr)
df %>%
arrange(id, fruit, Year) %>%
group_by(id, fruit, group = cumsum(c(2, diff(Year)) != 2)) %>%
summarise(n = n()) %>%
ungroup() %>%
select(-group)
# id fruit n
# <dbl> <fct> <int>
#1 1 apple 1
#2 1 banana 1
#3 1 banana 1
#4 2 melon 1
#5 2 orange 3
#6 3 melon 2
#7 3 melon 1

Related

Turning vectors of strings in a dataframe into categorical variables in R

I'm fairly new to R and am sure there's a way to do the following without using loops, which I'm more familiar with.
Take the following example where you have a bunch of names and fruits each person likes:
name <- c("Alice", "Bob")
preference <- list(c("apple", "pear"), c("banana", "apple"))
df <- as.data.frame(cbind(name, preference))
How to I convert it to the following?
apple <- c(1, 1)
pear <- c(1, 0)
banana <- c(0, 1)
df2 <- data.frame(name, apple, pear, banana)
My basic instinct is to first extract all the fruits then do a loop to check if each fruit is in each row's preference:
fruits <- unique(unlist(df$preference))
for (fruit in fruits) {
df <- df %>% rowwise %>% mutate("{fruit}" := fruit %in% preference)
}
This seems to work, but I'm pretty sure there's a better way to do this.
df %>%
unnest(everything()) %>%
xtabs(~., .) %>%
as.data.frame.matrix() %>%
rownames_to_column('name')
name apple banana pear
1 Alice 1 0 1
2 Bob 1 1 0
In tidyverse (assuming the 'preference' is a list column), unnest the 'preference' and then use pivot_wider to reshape back to 'wide' format with values_fn as length
library(dplyr)
library(tidyr)
df %>%
unnest_longer(preference) %>%
pivot_wider(names_from = preference, values_from = preference,
values_fn = length, values_fill = 0)
-output
# A tibble: 2 × 4
name apple pear banana
<chr> <int> <int> <int>
1 Alice 1 1 0
2 Bob 1 0 1
data
df <- data.frame(name, preference = I(preference))
Another possible solution, based on tidyr::separate_rows and janitor::tabyl:
library(tidyverse)
df %>%
separate_rows(everything(), sep="(?<=\\w), (?=\\w)") %>%
janitor::tabyl(name, preference)
#> name apple banana pear
#> Alice 1 0 1
#> Bob 1 1 0

Sum data frame rows according to column date

I have a data frame resembling this structure:
Name 2021-01-01 2021-01-02 2021-01-03
Banana 5 23 23
Apple 90 2 15
Pear 39 7 18
The actual dataframe has dates spanning a much larger period of time.
How do I aggregate the columns together so that each column represents a week, with the data from each day being summed to form the weekly value? Giving something like this:
Name 2021-01-01 2021-01-08 2021-01-15
Banana 50 23 62
Apple 34 34 81
Pear 13 18 29
I've looked at the aggregate function but it doesn't seem quite right for this purpose.
I found a nice solution from which I learnt a lot. R really is powerful. After the edit, the output now has as column names the dates of the start of the respective weeks, see below.
Data
example <- data.frame(Name = "Banana",
"2021-01-01" = 1,
"2021-01-02" = 3,
"2021-01-10" = 2,
"2021-02-02" = 3)
> example
Name X2021.01.01 X2021.01.02 X2021.01.10 X2021.02.02
1 Banana 1 3 2 3
Code
out <- example %>%
tidyr::pivot_longer(cols = c(-Name)) %>%
mutate(Name2 = as.Date(name, format = "X%Y.%m.%d")) %>%
mutate(week = lubridate::week(Name2)) %>%
group_by(week) %>%
mutate(Sum = sum(value)) %>%
mutate(Dates = lubridate::ymd("2021-01-01") + lubridate::weeks(week - 1)) %>%
ungroup %>%
select(-name, -value, -Name2, -week) %>%
group_by_all %>%
unique %>%
tidyr::pivot_wider(id_cols = Name, values_from = Sum, names_from = Dates)
Output
# A tibble: 1 x 4
# Groups: Name [1]
Name `2021-01-01` `2021-01-08` `2021-01-29`
<chr> <dbl> <dbl> <dbl>
1 Banana 4 2 3

Sum subset of a variable for tidy data r

I want to sum a subset of categories contained within a single variable, organized as tidy data in r.
It seems like it should be simple, but I can only think of a large number of lines of code to do it.
Here is an example:
df = data.frame(food = c("carbs", "protein", "apple", "pear"), value = c(10, 12, 4, 3))
df
food value
1 carbs 10
2 protein 12
3 apple 4
4 pear 3
I want the data frame to look like this (combining apple and pear into fruit):
food value
1 carbs 10
2 protein 12
3 fruit 7
The way I can think to do this is:
library(dplyr)
library(tidyr)
df %>%
spread(key = "food", value = "value") %>%
mutate(fruit = apple + pear) %>%
select(-c(apple, pear)) %>%
gather(key = "food", value = "value")
food value
1 carbs 10
2 protein 12
3 fruit 7
This seems too long for something so simple. I could also subset the data, sum the rows and then rbind, but that also seems laborious.
Any quicker options?
A factor can be recoded with forcats::fct_recode but this isn't necessarily shorter.
library(dplyr)
library(forcats)
df %>%
mutate(food = fct_recode(food, fruit = 'apple', fruit = 'pear')) %>%
group_by(food) %>%
summarise(value = sum(value))
## A tibble: 3 x 2
# food value
# <fct> <dbl>
#1 fruit 7
#2 carbs 10
#3 protein 12
Edit.
I will post the code in this comment here, since comments are more often deleted than answers. The result is the same as above.
df %>%
group_by(food = fct_recode(food, fruit = 'apple', fruit = 'pear')) %>%
summarise(value = sum(value))
What about:
df %>%
group_by(food = if_else(food %in% c("apple", "pear"), "fruit", food)) %>%
summarise_all(sum)
food value
<chr> <dbl>
1 carbs 10
2 fruit 7
3 protein 12

dplyr: include all elements in filter list, even if not in data set

df1
Row Taste Quantity
#1 Vanilla 3
#2 Chocolate 1
#3 Strawberry 6
I would like to filter the list and include a c(list) that has more flavors. But if the flavors in the list dont exist in the Taste column I would like to add a new row.
df1 %>% filter(Taste %in% c("Chocolate", "Strawberry", "Banana"))
but this only returns the chocolate and strawberry rows. I would like it to return:
Row Taste Quantity
#2 Chocolate 1
#3 Strawberry 6
#4 Banana 0 (or could be NA)
Is there a way to append the items in the list to the results even if the data doesn't exist in df1?
# example data frame
df = read.table(text = "
Row Taste Quantity
1 Vanilla 3
2 Chocolate 1
3 Strawberry 6
", header=T)
# vector of tastes to have in output
taste_vector = c("Chocolate", "Strawberry", "Banana")
library(dplyr)
data.frame(taste_vector) %>% # start with the vector of tastes you want to have
left_join(df, by=c("taste_vector"="Taste")) %>% # join original data to see what was found and what wasn't
mutate(Row = ifelse(is.na(Row), max(Row, na.rm = T) + cumsum(is.na(Row)), Row)) # update Row column
# taste_vector Row Quantity
# 1 Chocolate 2 1
# 2 Strawberry 3 6
# 3 Banana 4 NA
You can add mutate(Quantity = coalesce(Quantity, 0L)) if you don't want NAs in your Quantity column.
Using tidyverse (dplyr, forcats and tidyr)
First create a filter object (filter_vals) of the variables you want to filter on. In a mutate (assuming the variable is not a factor), we mutate Taste into a factor and expand the factor levels with values from the filter object. Next we use complete to expand the values in the data.frame with the missing levels that are in the factor and set empty values to 0. Finally filter the data.frame with the filter object.
library(tidyverse)
filter_vals <- c("Chocolate", "Strawberry", "Banana")
df1 %>%
mutate(Taste = as_factor(Taste),
Taste = fct_expand(Taste, filter_vals)) %>%
complete(Taste, fill = list(Quantity = 0))
filter(Taste %in% filter_vals)
# A tibble: 3 x 2
Taste Quantity
<fct> <dbl>
1 Chocolate 1
2 Strawberry 6
3 Banana 0

R: counting distinct combinations found in a data frame where columns are interchangable

I'm not sure what this problem is even called. Let's say I'm counting distinct combinations of 2 columns, but I want distinct across the order of the two columns. Here's what I mean:
df = data.frame(fruit1 = c("apple", "orange", "orange", "banana", "kiwi"),
fruit2 = c("orange", "apple", "banana", "orange", "apple"),
stringsAsFactors = FALSE)
# What I want: total number of fruit combinations, regardless of
# which fruit comes first and which second.
# Eg 2 apple-orange, 2 banana-orange, 1 kiwi-apple
# What I know *doesn't* work:
table(df$fruit1, df$fruit2)
# What *does* work:
library(dplyr)
df %>% group_by(fruit1, fruit2) %>%
transmute(fruitA = sort(c(fruit1, fruit2))[1],
fruitB = sort(c(fruit1, fruit2))[2]) %>%
group_by(fruitA, fruitB) %>%
summarise(combinations = n())
I've got a way to make this work, as you can see, but is there a name for this general problem? It's sort of a combinatorics problem but counting, not generating combinations. And what if I had three or four columns of similar type? The above method is poorly generalizable. Tidyverse approaches most welcome!
By using apply and sort order your dataframe then we just using group_by count
data.frame(t(apply(df,1,sort)))%>%group_by_all(.)%>%count()
# A tibble: 3 x 3
# Groups: X1, X2 [3]
X1 X2 n
<fctr> <fctr> <int>
1 apple kiwi 1
2 apple orange 2
3 banana orange 2
Here is an option using pmap with count
library(tidyverse)
library(rlang)
pmap_df(df, ~ sort(c(...)) %>%
as.list %>%
as_tibble %>%
set_names(names(df))) %>%
count(!!! rlang::syms(names(.)))
# A tibble: 3 x 3
# fruit1 fruit2 n
# <chr> <chr> <int>
#1 apple kiwi 1
#2 apple orange 2
#3 banana orange 2

Resources