Summarise? Count occurences in column based on another column - r

I believe this may have a simple solution but I'm having trouble describing what I need to do (and hence what to search for). I think I need the summarize function. My goal output is at the very bottom.
I'm trying to count the occurrences of a value between each unique value in another column. Here is an example df that hopefully illustrates what I need todo.
library(dplyr)
set.seed(1)
df <- tibble("name" = c(rep("dinah",2),rep("lucy",4),rep("sora",9)),
"meal" = c(rep(c("chicken","beef","fish"),5)),
"date" = seq(as.Date("1999/1/1"),as.Date("2000/1/1"),25),
"num.wins" = sample(0:30)[1:15])
Among other things, I'm trying to summarize (sum) the types of meals each name had using this data.
df
# A tibble: 15 x 4
name meal date num.wins
<chr> <chr> <date> <int>
1 dinah chicken 1999-01-01 8
2 dinah beef 1999-01-26 11
3 lucy fish 1999-02-20 16
4 lucy chicken 1999-03-17 25
5 lucy beef 1999-04-11 5
6 lucy fish 1999-05-06 23
7 sora chicken 1999-05-31 27
8 sora beef 1999-06-25 15
9 sora fish 1999-07-20 14
10 sora chicken 1999-08-14 1
11 sora beef 1999-09-08 4
12 sora fish 1999-10-03 3
13 sora chicken 1999-10-28 13
14 sora beef 1999-11-22 6
15 sora fish 1999-12-17 18
I've made progress with other calculations I'm interested in, below:
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins))
# A tibble: 3 x 5
name count medianDate life wins
<chr> <int> <date> <time> <int>
1 dinah 2 1999-01-13 25 days 19
2 lucy 4 1999-03-29 75 days 69
3 sora 9 1999-09-08 200 days 101
My goal is to add an additional column for each type of food, and have the sum of the occurrences of that food displayed in each row, like so:
name count medianDate life wins chicken beef fish
1 dinah 2 1999-01-13 25 days 19 1 1 0
2 lucy 4 1999-03-29 75 days 69 1 1 2
3 sora 9 1999-09-08 200 days 101 3 3 3

Though older, and possibly on a deprecation path, reshape2::dcast does this nicely:
reshape2::dcast(df, name ~ meal)
# name beef chicken fish
# 1 dinah 1 1 0
# 2 lucy 1 1 2
# 3 sora 3 3 3
You can understand the formula as rows ~ columns. By default, it will aggregate the values in the columns using the length function---which gives exactly what you want, the count of each.
This can be easily joined to your summary data:
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins)) %>%
left_join(reshape2::dcast(df, name ~ meal))
# # A tibble: 3 x 8
# name count medianDate life wins beef chicken fish
# <chr> <int> <date> <time> <int> <int> <int> <int>
# 1 dinah 2 1999-01-13 25 days 19 1 1 0
# 2 lucy 4 1999-03-29 75 days 69 1 1 2
# 3 sora 9 1999-09-08 200 days 101 3 3 3

One option is to use table inside summarise as a list column, unnest and then spread it to 'wide' format
library(tidyverse)
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins),
n = list(enframe(table(meal))) ) %>%
unnest %>%
spread(name1, value, fill = 0)
# A tibble: 3 x 8
# name count medianDate life wins beef chicken fish
# <chr> <int> <date> <time> <int> <dbl> <dbl> <dbl>
#1 dinah 2 1999-01-13 25 days 19 1 1 0
#2 lucy 4 1999-03-29 75 days 69 1 1 2
#3 sora 9 1999-09-08 200 days 101 3 3 3

I'm not entirely sure why I'm getting the funky formatting for life, but I think this gets at your need for a count of the meal types.
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins),
chicken = sum(meal == "chicken"),
beef = sum(meal == "beef"),
fish = sum(meal == "fish"))
# A tibble: 3 x 8
name count medianDate life wins chicken beef fish
<chr> <int> <date> <time> <int> <int> <int> <int>
1 dinah 2 1999-01-13 " 25 days" 19 1 1 0
2 lucy 4 1999-03-29 " 75 days" 69 1 1 2
3 sora 9 1999-09-08 200 days 101 3 3 3

Related

R - Number of days since last event in another dataframe

I have the following two data frames:
> head(Reaction_per_park_per_day_3)
Park Date Type_1_2 Number_AC_events
<chr> <date> <chr> <int>
1 Beaverdam Flats 2018-09-25 0 1
2 Nosehill 64 ave 2018-09-26 0 1
3 Nosehill 64 ave 2018-09-26 0 1
4 Nosehill Macewin 2018-09-26 0 1
5 Crestmont 2018-09-27 0 2
6 Country Hills G.C. - Nose Creek 2018-09-28 0 1
> head(All_reports_per_month2)
Month Park Code Reports_per_month
<date> <chr> <chr> <dbl>
1 2018-09-29 Beaverdam Flats 1 1
2 2018-10-12 Nosehill 64 ave 2 1
3 2018-10-25 Nosehill 64 ave 1 2
4 2018-09-21 Crestmont 1 1
5 2018-09-29 Crestmont 2 1
I would like to add a "days since last AC event" column to All_reports_per_month2 that would take into account the date and the park of the AC event as well as the date and park of the report. If the report data is prior to the first AC event in a certain park, NA would appear. See example below:
Month Park Code Reports_per_month Days_since_last_AC
<date> <chr> <chr> <dbl> <chr>
1 2018-09-29 Beaverdam Flats 1 1 4
2 2018-10-12 Nosehill 64 ave 2 1 16
3 2018-10-25 Nosehill 64 ave 1 2 29
4 2018-09-21 Crestmont 1 1 NA
5 2018-09-29 Crestmont 2 1 2
Any help would be appreciated!
This is a joining and filtering operation that will use the dplyr package.
# import the packages
library( dplyr )
# join the data tables and filter so that we are always looking back in time
All_reports_per_month2 %>%
left_join( Reaction_per_park_per_day_3, by="Park" ) %>%
filter( Date <= Month ) %>%
group_by( Park, Month ) %>%
summarize( Days_since_last_AC = Month - max(Date) )

Convert monthly pay data to weekly using complete and fill in dplyr

I have data on worker pay and some workers are paid monthly and others weekly. I would like to combine the data into a panel by worker and week (of year). To do that, I need to expand the monthly rows.
The data look like:
pay_data <- tibble(worker="Jim", start=ymd("2020-1-3"), end=ymd("2020-2-2"), rate=10, hours=50, wages=rate*hours) %>%
mutate(f_week=week(start), l_week=week(end))
# A tibble: 1 x 8
worker start end rate hours wages f_week l_week
<chr> <date> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Jim 2020-01-03 2020-02-02 10 50 500 1 5
Is there a way to use complete, fill or any other dplyr function to get the data to look like the below?
# A tibble: 5 x 5
worker week rate hours wage
<chr> <int> <dbl> <dbl> <dbl>
1 Jim 1 10 50 500
2 Jim 2 10 50 500
3 Jim 3 10 50 500
4 Jim 4 10 50 500
5 Jim 5 10 50 500
(I would then of course divide the amounts to put them all in common units).
Thanks!
Another tidyverse way would be :
library(tidyverse)
pay_data %>%
mutate(week = map2(f_week, l_week, seq)) %>%
unnest(week) %>%
select(worker, rate:wages, week)
# worker rate hours wages week
# <chr> <dbl> <dbl> <dbl> <int>
#1 Jim 10 50 500 1
#2 Jim 10 50 500 2
#3 Jim 10 50 500 3
#4 Jim 10 50 500 4
#5 Jim 10 50 500 5
A tidyverse approach making use of tidyr::separate_rows may look like so. To make the data more interesting I added data for a second worker.
library(tidyverse)
tbl %>%
rowwise() %>%
mutate(weeks = paste(seq(f_week, l_week, by = 1), collapse = ", ")) %>%
ungroup() %>%
separate_rows(weeks) %>%
select(-ends_with("_week"), -start, -end)
#> # A tibble: 13 x 5
#> worker rate hours wages weeks
#> <chr> <int> <int> <int> <chr>
#> 1 Jim 10 50 500 1
#> 2 Jim 10 50 500 2
#> 3 Jim 10 50 500 3
#> 4 Jim 10 50 500 4
#> 5 Jim 10 50 500 5
#> 6 John 20 100 1000 1
#> 7 John 20 100 1000 2
#> 8 John 20 100 1000 3
#> 9 John 20 100 1000 4
#> 10 John 20 100 1000 5
#> 11 John 20 100 1000 6
#> 12 John 20 100 1000 7
#> 13 John 20 100 1000 8
DATA
tbl <- read.table(text="worker start end rate hours wages f_week l_week
1 Jim 2020-01-03 2020-02-02 10 50 500 1 5\n
2 John 2020-01-03 2020-02-02 20 100 1000 1 8", header = TRUE)
tbl
#> worker start end rate hours wages f_week l_week
#> 1 Jim 2020-01-03 2020-02-02 10 50 500 1 5
#> 2 John 2020-01-03 2020-02-02 20 100 1000 1 8
Try this:
#Code
pay_data <- pay_data[rep(seq_len(nrow(pay_data)), unique(pay_data$l_week)),
c('worker','rate','hours','wages')]
pay_data$week <- 1:nrow(pay_data)
Output:
# A tibble: 5 x 5
worker rate hours wages week
<chr> <dbl> <dbl> <dbl> <int>
1 Jim 10 50 500 1
2 Jim 10 50 500 2
3 Jim 10 50 500 3
4 Jim 10 50 500 4
5 Jim 10 50 500 5

Group records with time interval overlap

I have a data frame (with N=16) contains ID (character), w_from (date), and w_to (date). Each record represent a task.
Here’s the data in R.
ID <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
w_from <- c("2010-01-01","2010-01-05","2010-01-29","2010-01-29",
"2010-03-01","2010-03-15","2010-07-15","2010-09-10",
"2010-11-01","2010-11-30","2010-12-15","2010-12-31",
"2011-02-01","2012-04-01","2011-07-01","2011-07-01")
w_to <- c("2010-01-31","2010-01-15", "2010-02-13","2010-02-28",
"2010-03-16","2010-03-16","2010-08-14","2010-10-10",
"2010-12-01","2010-12-30","2010-12-20","2011-02-19",
"2011-03-23","2012-06-30","2011-07-31","2011-07-06")
df <- data.frame(ID, w_from, w_to)
df$w_from <- as.Date(df$w_from)
df$w_to <- as.Date(df$w_to)
I need to generate a group number by ID for the records that their time intervals overlap. As an example, and in general terms, if record#1 overlaps with record#2, and record#2 overlaps with record#3, then record#1, record#2, and record#3 overlap.
Also, if record#1 overlaps with record#2 and record#3, but record#2 doesn't overlap with record#3, then record#1, record#2, record#3 are all overlap.
In the example above and for ID=1, the first four records overlap.
Here is the final output:
Also, if this can be done using dplyr, that would be great!
Try this:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(w_from) %>%
mutate(group = 1+cumsum(
cummax(lag(as.numeric(w_to), default = first(as.numeric(w_to)))) < as.numeric(w_from)))
# A tibble: 16 x 4
# Groups: ID [2]
ID w_from w_to group
<dbl> <date> <date> <dbl>
1 1 2010-01-01 2010-01-31 1
2 1 2010-01-05 2010-01-15 1
3 1 2010-01-29 2010-02-13 1
4 1 2010-01-29 2010-02-28 1
5 1 2010-03-01 2010-03-16 2
6 1 2010-03-15 2010-03-16 2
7 1 2010-07-15 2010-08-14 3
8 1 2010-09-10 2010-10-10 4
9 1 2010-11-01 2010-12-01 5
10 1 2010-11-30 2010-12-30 5
11 1 2010-12-15 2010-12-20 5
12 1 2010-12-31 2011-02-19 6
13 1 2011-02-01 2011-03-23 6
14 2 2011-07-01 2011-07-31 1
15 2 2011-07-01 2011-07-06 1
16 2 2012-04-01 2012-06-30 2

Get difference with closest previous row in a group which meets criterion

I'm trying, for each row, to calculate the difference with the closest previous row belonging to the same group which meets a certain criterion.
Suppose I have the following dataframe:
s <- read.table(text = "Visit_num Patient Day Admitted
1 1 2015/01/01 Yes
2 1 2015/01/10 No
3 1 2015/01/15 Yes
4 1 2015/02/10 No
5 1 2015/03/08 Yes
6 2 2015/01/01 Yes
7 2 2015/04/01 No
8 2 2015/04/10 No
9 3 2015/04/01 No
10 3 2015/04/10 No", header = T, sep = "")
For each Visit_num and for each Patient, I'd like to get the difference with the closest row for which the patient was admitted (i.e. Yes). Note column day is ordered by day, and time unit for this example is days.
Here is what I wanted my dataframe to look like:
Visit_num Patient Day Admitted Diff_days
1 1 2015/01/01 Yes NA
2 1 2015/01/10 No 9
3 1 2015/01/15 Yes 14
4 1 2015/02/10 No 26
5 1 2015/03/08 Yes 52
6 2 2015/01/01 Yes NA
7 2 2015/04/01 No 90
8 2 2015/04/10 No 99
9 3 2015/04/01 No NA
10 3 2015/04/10 No NA
Any help is appreciated.
Here is an option with tidyverse. Convert the 'Day' to Date class, arrange by 'Patient', 'Day', grouped by 'Patient' get the difference of adjacent 'Day', create a group 'grp' based on the occurrence of 'Yes' in 'Admitted' and take the cumulative sum of 'Diff_days'
library(tidyverse)
s %>%
mutate(Day = ymd(Day)) %>%
arrange(Patient, Day) %>%
group_by(Patient) %>%
mutate(Diff_days = c(NA, diff(Day))) %>%
group_by(grp = cumsum(lag(Admitted == "Yes", default = TRUE)), add = TRUE) %>%
mutate(Diff_days = cumsum(replace_na(Diff_days, 0))) %>%
ungroup %>%
select(-grp) %>%
mutate(Diff_days = na_if(Diff_days, 0))
# A tibble: 8 x 5
# Visit_num Patient Day Admitted Diff_days
# <int> <int> <date> <fct> <dbl>
#1 1 1 2015-01-01 Yes NA
#2 2 1 2015-01-10 No 9
#3 3 1 2015-01-15 Yes 14
#4 4 1 2015-02-10 No 26
#5 5 1 2015-03-08 Yes 52
#6 6 2 2015-01-01 Yes NA
#7 7 2 2015-04-01 No 90
#8 8 2 2015-04-10 No 99

Create weekly cumulative totals from R dataframe

I have a dataframe that has launch weeks for products across markets. Here is a snapshot of the dataframe.
Prod_ID Market_Name START_WEEK
11044913000 PHOENIX, AZ 1397
11044913000 WEST TEX/NEW MEX 1206
11159402003 PORTLAND,OR 1188
11159402003 SEATTLE/TACOMA 1188
11159402003 SPOKANE 1195
11159410010 PORTLAND,OR 1186
11159410010 SALT LAKE CITY 1190
11159410010 SEATTLE/TACOMA 1186
11159410010 SPOKANE 1187
11159410010 WEST TEX/NEW MEX 1197
11159410014 PORTLAND,OR 1198
11159410014 SEATTLE/TACOMA 1239
I would like to create another dataframe which will give me for each Prod_ID, cumulative totals of number of markets a product has been launched in on a weekly basis for first 6 weeks. For the above snippet of data, the output should like something like this.
Prod_ID Week1 Week2 Week3 Week4 Week5 Week6
11044913000 1 1 1 1 1 1
11159402003 2 2 2 2 2 2
11159410010 2 3 3 3 4 4
11159410014 1 1 1 1 1 1
For ease of displaying, I have shown the output only till Week 6, but I need to track till Week 12 for my need. Week is denoted by a 4 digit number in my dataset and is not in date format. Please note that not all products have the same starting week, so I need to infer the earliest week for a Prod_IDfrom the START_WEEK variable. And then identify the next 6 weeks to generate the total number of markets launched in each week.
Any help to do this is appreciated.
I think I understand your problem. Here is my shot. There are several phases to this solution.
The first step is to calculate the cumulative sum of markets for the weeks and the week number for each Prod_ID since they opened. This is done with the following code chunk.
df1 <- df %>%
group_by(Prod_ID, START_WEEK) %>%
count() %>%
arrange(Prod_ID, START_WEEK) %>%
ungroup() %>%
group_by(Prod_ID) %>%
mutate(tot_market = cumsum(n)) %>%
ungroup() %>%
group_by(Prod_ID) %>%
mutate(min_START_WEEK = min(START_WEEK)) %>%
mutate(week = START_WEEK - min_START_WEEK + 1)
df1
# # A tibble: 10 x 6
# # Groups: Prod_ID [4]
# Prod_ID START_WEEK n tot_market min_START_WEEK week
# <dbl> <int> <int> <int> <dbl> <dbl>
# 1 11044913000. 1206 1 1 1206. 1.
# 2 11044913000. 1397 1 2 1206. 192.
# 3 11159402003. 1188 2 2 1188. 1.
# 4 11159402003. 1195 1 3 1188. 8.
# 5 11159410010. 1186 2 2 1186. 1.
# 6 11159410010. 1187 1 3 1186. 2.
# 7 11159410010. 1190 1 4 1186. 5.
# 8 11159410010. 1197 1 5 1186. 12.
# 9 11159410014. 1198 1 1 1198. 1.
# 10 11159410014. 1239 1 2 1198. 42.
The second phase is to expand the week and Prod_ID to the maximum number of weeks in week.
df2 <- expand.grid(min(df1$week):max(df1$week), unique(df1$Prod_ID))
colnames(df2) <- c("week", "Prod_ID")
The third phase is done by merging df1 and df2 and using zoo::locf to fill the NA's in tot_market (total market) by Prod_ID with the preceding value.
df2 %>% left_join(df1) %>% select(-START_WEEK, -n, -min_START_WEEK) %>%
group_by(Prod_ID) %>%
arrange(Prod_ID, week) %>%
mutate(tot_market = zoo::na.locf(tot_market)) %>%
spread(week, tot_market) %>%
ungroup() %>%
mutate_at(vars(Prod_ID), as.character) %>%
rename_if(is.integer, function(x) paste0("Week", x))
# # A tibble: 4 x 193
# Prod_ID Week1 Week2 Week3 Week4 Week5 Week6 Week7 Week8 Week9 Week10 Week11
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 11044913000 1 1 1 1 1 1 1 1 1 1 1
# 2 11159402003 2 2 2 2 2 2 2 3 3 3 3
# 3 11159410010 2 3 3 3 4 4 4 4 4 4 4
# 4 11159410014 1 1 1 1 1 1 1 1 1 1 1
# # ... with 181 more variables

Resources