How to add additional columns using tidyr group_by function in R? - r

This question is a follow up to my post from this answer.
Data
df1 <- structure(list(Date = c("6/24/2020", "6/24/2020", "6/24/2020",
"6/24/2020", "6/25/2020", "6/25/2020"), Market = c("A", "A",
"A", "A", "A", "A"), Salesman = c("MF", "RP", "RP", "FR", "MF",
"MF"), Product = c("Apple", "Apple", "Banana", "Orange", "Apple",
"Banana"), Quantity = c(20L, 15L, 20L, 20L, 10L, 15L), Price = c(1L,
1L, 2L, 3L, 1L, 1L), Cost = c(0.5, 0.5, 0.5, 0.5, 0.6, 0.6)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Solution
library(dplyr) # 1.0.0
library(tidyr)
df1 %>%
group_by(Date, Market) %>%
group_by(Revenue = c(Quantity %*% Price),
TotalCost = c(Quantity %*% Cost),
Product, .add = TRUE) %>%
summarise(Sold = sum(Quantity)) %>%
pivot_wider(names_from = Product, values_from = Sold)
# A tibble: 2 x 7
# Groups: Date, Market, Revenue, TotalCost [2]
# Date Market Revenue TotalCost Apple Banana Orange
# <chr> <chr> <dbl> <dbl> <int> <int> <int>
#1 6/24/2020 A 135 37.5 35 20 20
#2 6/25/2020 A 25 15 10 15 NA
#akrun's solution works well. Now I'd like to know how to add three more columns for quantity sold by salesmen to the existing results so the final output will look like this:
Date Market Revenue Total Cost Apples Sold Bananas Sold Oranges Sold MF RP FR
6/24/2020 A 135 37.5 35 20 20 20 35 20
6/25/2020 A 25 15 15 25 NA 25 NA NA

One option would be to do the group by operations separately as these are done on separate columns and then do a join by the common columns i.e. 'Date', 'Market'
library(dplyr)
library(tidyr)
out1 <- df1 %>%
group_by(Date, Market) %>%
group_by(Revenue = c(Quantity %*% Price),
TotalCost = c(Quantity %*% Cost),
Product, .add = TRUE) %>%
summarise(Sold = sum(Quantity)) %>%
pivot_wider(names_from = Product, values_from = Sold)
out2 <- df1 %>%
group_by(Date, Market, Salesman) %>%
summarise(SalesSold = sum(Quantity)) %>%
pivot_wider(names_from = Salesman, values_from = SalesSold)
left_join(out1, out2)
# A tibble: 2 x 10
# Groups: Date, Market, Revenue, TotalCost [2]
# Date Market Revenue TotalCost Apple Banana Orange FR MF RP
# <chr> <chr> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
#1 6/24/2020 A 135 37.5 35 20 20 20 20 35
#2 6/25/2020 A 25 15 10 15 NA NA 25 NA

Related

Expand table of counts to a dataframe

Given a table of counts specified in 'dat' I would like to create a dataframe with 3 columns (race, grp and outcome) and 206 rows. The variable outcome would be 1 if for ascertained, and 0 if 'missed'.
dat <- structure(list(race = structure(c(1L, 2L, 1L, 2L), levels = c("black",
"nonblack"), class = "factor"), grp = structure(c(1L, 1L, 2L,
2L), levels = c("hbpm", "uc"), class = "factor"), ascertained = c(63,
32, 24, 21), missed = c(5, 3, 49, 9), total = c(68, 35, 73, 30
)), class = "data.frame", row.names = c(NA, -4L))
1) For each row set race in the output to that race, grp in the output to that group and then generate the appropriate number of 1s and 0s for outcome. The result is 206 x 3.
library(dplyr)
dat %>%
rowwise %>%
summarize(race = race, grp = grp, outcome = rep(1:0, c(ascertained, missed)))
2) In the example data there are no duplicate race/grp and if that is true in general then it can alternately be written as::
dat %>%
group_by(race, grp) %>%
summarize(outcome = rep(1:0, c(ascertained, missed)), .groups = "drop")
3) A base R solution would be the following. If each combination of race/grp occurs on only one row of the input then 1:nrow(dat) could optionally be replaced with dat[1:2].
do.call("rbind",
by(dat,
1:nrow(dat),
with,
data.frame(race = race, grp = grp, outcome = rep(1:0, c(ascertained, missed)))
)
)
How about this:
library(tidyverse)
dat <- structure(list(race = structure(c(1L, 2L, 1L, 2L), levels = c("black",
"nonblack"), class = "factor"), grp = structure(c(1L, 1L, 2L,
2L), levels = c("hbpm", "uc"), class = "factor"), ascertained = c(63,
32, 24, 21), missed = c(5, 3, 49, 9), total = c(68, 35, 73, 30
)), class = "data.frame", row.names = c(NA, -4L))
dat2 <- dat %>% select(-total) %>%
pivot_longer(c(ascertained, missed), names_to = "var", values_to="vals") %>%
uncount(vals) %>%
mutate(outcome = case_when(var == "ascertained" ~ 1,
TRUE ~ 0)) %>%
select(-var)
head(dat2)
#> # A tibble: 6 × 3
#> race grp outcome
#> <fct> <fct> <dbl>
#> 1 black hbpm 1
#> 2 black hbpm 1
#> 3 black hbpm 1
#> 4 black hbpm 1
#> 5 black hbpm 1
#> 6 black hbpm 1
dat2 %>%
group_by(race, grp, outcome) %>%
tally()
#> # A tibble: 8 × 4
#> # Groups: race, grp [4]
#> race grp outcome n
#> <fct> <fct> <dbl> <int>
#> 1 black hbpm 0 5
#> 2 black hbpm 1 63
#> 3 black uc 0 49
#> 4 black uc 1 24
#> 5 nonblack hbpm 0 3
#> 6 nonblack hbpm 1 32
#> 7 nonblack uc 0 9
#> 8 nonblack uc 1 21
This is based partially on the linked question from Limey in the comments:
library(tidyverse)
bind_rows(
dat %>% uncount(ascertained) %>% mutate(outcome = 1) %>% select(-missed, -total),
dat %>% uncount(missed) %>% mutate(outcome = 0) %>% select(-ascertained, -total)
)
Here is a relatively simple answer that is based on, in part, the answer suggested in a comment, but adapted to work for your problem, since you need multiple "uncounts". This answer uses function from the packages tibble, dplyr, and tidyr. These are all in the tidyverse.
The exact method is to create two sub-lists, one listing out the "ascertained", and one listing out the "missed", formatting the ascertained column as you wanted, and then mashing these two together with a basic tibble::add_row.
The relevant code is:
library(tidyverse)
dat2 <- uncount(dat, ascertained, .remove = F) %>%
mutate(ascertained = 1) %>%
select(-missed)
dat3 <- uncount(dat, missed, .remove = T) %>%
mutate(ascertained = 0)
dat4 <- add_row(dat2, dat3) %>% select(-total) %>%
rename(outcome = ascertained)
dat4 should be the data as you asked for it. I would suggest also generating an id column to make things easier to work with, but obviously that is up to you.

group categories according to whether they meet a sequence of conditions with tidyverse

I have a problem on how to recategorize a variable according to whether it meets a certain condition or not. That is, if the category does not meet the criteria, it is assigned to another category that does.
My data has the following form:
data = data.frame(firm_size = c("Micro", "Small", "Medium","Big"),
employees = c(5,10,100,1000))
> data
firm_size employees
1 Micro 5
2 Small 10
3 Medium 100
4 Big 1000
So, if my condition is that I must group the companies that have less than 10 employees and then combine them with the other category that does meet the criteria
> new_data
firm_size employees
1 Micro-Small 15
3 Medium 100
4 Big 1000
What I'm trying to do is write a function that generalizes this procedure, for example, that also works if my data is
> data
firm_size employees
1 Micro 5
2 Small 8
3 Medium 9
4 Big 1000
> new_data
firm_size employees
1 Micro-Small-Medium 22
4 Big 1000
I think that this can be done with the tools of the tidyverse.
Thanks in advance
Here's an approach with tally:
library(dplyr)
size <- 10
data %>%
arrange(firm_size,desc(employees)) %>%
group_by(firm_size = c(as.character(firm_size[employees > size]),
rep(paste(firm_size[employees <= size], collapse = "-"),
sum(employees <= size)))) %>%
tally(employees, name = "employees")
## A tibble: 3 x 2
# firm_size employees
# <chr> <dbl>
#1 Big 1000
#2 Medium 100
#3 Small-Micro 15
And for your second set of data:
data2 %>%
arrange(firm_size,desc(employees)) %>%
group_by(firm_size = c(as.character(firm_size[employees > size]),
rep(paste(firm_size[employees <= size], collapse = "-"),
sum(employees <= size)))) %>%
tally(employees, name = "employees")
## A tibble: 2 x 2
# firm_size employees
# <chr> <int>
#1 Big 1000
#2 Medium-Small-Micro 22
Data
data <- structure(list(firm_size = structure(c(3L, 4L, 2L, 1L), .Label = c("Big",
"Medium", "Micro", "Small"), class = "factor"), employees = c(5,
10, 100, 1000)), class = "data.frame", row.names = c(NA, -4L))
data2 <- structure(list(firm_size = structure(c(3L, 4L, 2L, 1L), .Label = c("Big",
"Medium", "Micro", "Small"), class = "factor"), employees = c(5L,
8L, 9L, 1000L)), class = "data.frame", row.names = c("1", "2",
"3", "4"))
You can use the great forcats package
library(tidyverse)
data <- data.frame(
firm_size = c("Micro", "Small", "Medium", "Big", "Small"),
employees = c(5, 10, 100, 1000, 10)
)
# If you need n groups
data %>%
mutate(firm_size2 = firm_size %>% as_factor() %>% fct_lump(n = 2, w = employees)) %>%
group_by(firm_size2) %>%
summarise(sum_emp = sum(employees),.groups = "drop")
#> # A tibble: 3 x 2
#> firm_size2 sum_emp
#> <fct> <dbl>
#> 1 Medium 100
#> 2 Big 1000
#> 3 Other 25
# If you need at least x on the sum of a vector
data %>%
mutate(firm_size2 = firm_size %>% as_factor() %>% fct_lump_min(min = 10, w = employees)) %>%
group_by(firm_size2) %>%
summarise(sum_emp = sum(employees),.groups = "drop")
#> # A tibble: 4 x 2
#> firm_size2 sum_emp
#> <fct> <dbl>
#> 1 Small 20
#> 2 Medium 100
#> 3 Big 1000
#> 4 Other 5
Created on 2020-06-11 by the reprex package (v0.3.0)
Yet another solution, set into a custom function:
library(tidyverse)
mymerge <- function(dat, min) {
merged_dat <- dat %>%
filter(if_else(employees <= min, TRUE, FALSE)) %>%
summarize(firm_size = str_flatten(firm_size, collapse = " - "),
employees = sum(employees))
dat %>%
filter(if_else(employees <= min, FALSE, TRUE)) %>%
bind_rows(merged_dat)
}
mymerge(data, 30)
firm_size employees
1 Medium 100
2 Big 1000
3 Micro - Small 15
mymerge(data, 300)
firm_size employees
1 Big 1000
2 Micro - Small - Medium 115

How do I cast data into non-equi columns?

I have a dataset of events, grouped by let like so:
set.seed(3)
events <- data.frame(
let = rep(LETTERS[1:2], each=3),
age = c(0,sample(1:20, size=2),
0,sample(1:20, size=2)),
value = sample(1:100, size=6))
let age value
1 A 0 61
2 A 4 60
3 A 16 13
4 B 0 29
5 B 8 56
6 B 7 99
How can I cast the data frame so that age is multiple columns grouped into weeks? So for each column, take the value of the largest age that is less than or equal to 0, 7, 14, 21 days.
events.cast <- data.frame(
let = LETTERS[1:2],
T0_value = c(61,29),
T1_value = c(60,99),
T2_value = c(60,56),
T3_value = c(13,56))
let T0_value T1_value T2_value T3_value
1 A 61 60 60 13
2 B 29 99 56 56
One option is to cut the 'age' into buckets, get the max row by that group and 'let', then reshape into 'wide' format
library(dplyr)
library(tidyr)
library(stringr)
events %>%
group_by(grp = cut(age, breaks = c(-Inf,0, 7, 14, 21),
labels = str_c("T", 0:3, "_value")), let) %>%
slice(which.max(value)) %>%
ungroup %>%
select(-age) %>%
group_by(let) %>%
complete(grp = unique(.$grp)) %>%
fill(value) %>%
pivot_wider(names_from = grp, values_from = value)
# A tibble: 2 x 5
# Groups: let [2]
# let T0_value T1_value T2_value T3_value
# <chr> <int> <int> <int> <int>
#1 A 61 60 60 13
#2 B 29 99 56 56
data
events <- structure(list(let = c("A", "A", "A", "B", "B", "B"), age = c(0L,
4L, 16L, 0L, 8L, 7L), value = c(61L, 60L, 13L, 29L, 56L, 99L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

creating distinct values column till certain time

I have a question on how to count unique values till certain point in time. For example, I want to know how many unique location a person has lived till that point.
created<- c(2009,2010,2010,2011, 2012, 2011)
person <- c(A, A, A, A, B, B)
location<- c('London','Geneva', 'London', 'New York', 'London', 'London')
df <- data.frame (created, person, location)
I want to create a variable called unique that takes into consideration how many distinct places he has lived till that point in time. I have tried the following. Any suggestions?
library(dplyr)
df %>% group_by(person, location) %>% arrange(Created,.by_group = TRUE) %>% mutate (unique=distinct (location))
unique <- c(1, 2, 2, 3,1,1)
One way is to use cumsum and duplicated
library(dplyr)
df %>% group_by(person) %>% mutate(unique = cumsum(!duplicated(location)))
# created person location unique
# <dbl> <fct> <fct> <int>
#1 2009 A London 1
#2 2010 A Geneva 2
#3 2010 A London 2
#4 2011 A New York 3
#5 2012 B London 1
#6 2011 B London 1
We can use cummax
library(dplyr)
df %>%
group_by(person) %>%
mutate(unique = cummax(match(location, unique(location))))
# A tibble: 6 x 4
# Groups: person [2]
# created person location unique
# <dbl> <fct> <fct> <int>
#1 2009 A London 1
#2 2010 A Geneva 2
#3 2010 A London 2
#4 2011 A New York 3
#5 2012 B London 1
#6 2011 B London 1
Or with base R
df$unique <- with(df, ave(location, person, FUN =
function(x) cummax(match(x, unique(x)))))
data
df <- structure(list(created = c(2009, 2010, 2010, 2011, 2012, 2011
), person = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), location = structure(c(2L, 1L, 2L, 3L,
2L, 2L), .Label = c("Geneva", "London", "New York"), class = "factor")),
class = "data.frame", row.names = c(NA,
-6L))

Calculating cumulative proportion sales product count

I have a dataframe which has sales at a ppg , product level, I want to find out how many products contribute towards particular % (ex 75%) of the sale like testing a pareto principle.
The data is
df= structure(list(Ppg = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1",
"p2"), class = "factor"), product = structure(c(1L, 2L, 3L, 4L,
1L, 2L, 3L), .Label = c("A", "B", "C", "D"), class = "factor"),
sales = c(50, 40, 30, 80, 100, 70, 30)), .Names = c("Ppg",
"product", "sales"), row.names = c(NA, -7L), class = "data.frame")
> df
Ppg product sales
1 p1 A 50
2 p1 B 40
3 p1 C 30
4 p1 D 80
5 p2 A 100
6 p2 B 70
7 p2 C 30
I retrieved the cumulative sum using dplyr
df %>% group_by(Ppg) %>% summarise(sale = sum(sales) %>% mutate(c1 = cumsum(sales))
Ppg product sales c1
<fctr> <fctr> <dbl> <dbl>
1 p1 A 50 50
2 p1 B 40 90
3 p1 C 30 120
4 p1 D 80 200
5 p2 A 100 100
6 p2 B 70 170
7 p2 C 30 200
Is there any way to
i) calculate proportion of sales (based on cumsum)
ii) How many distinct products contributed toward certain % of sales.
exmple for ppg p1, 2 distinct products (A & B combind give 75 % of sales)
so finally something like below would be ideal
ppg Number_Products_towards_75%
p1 2
p2 1
Assuming you're fine using the order that the product are currently in to get you answer (since reordering the rows would get you different results):
For 1, you can get the result with an extra mutate. Just divide the cumulative sum by the sum of all the sales in that group:
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales))
Gets you:
# A tibble: 7 x 5
# Groups: Ppg [2]
Ppg product sales c1 percent
<fctr> <fctr> <dbl> <dbl> <dbl>
1 p1 A 50.0 50.0 0.250
2 p1 B 40.0 90.0 0.450
3 p1 C 30.0 120 0.600
4 p1 D 80.0 200 1.00
5 p2 A 100 100 0.500
6 p2 B 70.0 170 0.850
7 p2 C 30.0 200 1.00
For 2, you could then use a mutate to add a column for if that product is below the threshold and summarize to count the products below the threshold (and then add one to the count since one more would get you over it).
threshold <- 0.5
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales)) %>%
mutate(isbelowthreshold = percent < threshold) %>% # add a column for if it's below the threshold
summarize(count = sum(isbelowthreshold) + 1) # we need to add one since one extra product will put you over the threshold
gets you:
# A tibble: 2 x 2
Ppg count
<fctr> <dbl>
1 p1 3.00
2 p2 1.00
But again this depends on the order of the products. Consider ordering them from highest to lowest value first? Something like
df %>%
group_by(Ppg) %>%
arrange(Ppg, desc(sales))

Resources