Use dplyr to summarize but preserve date of group row - r

I have a data frame like the following:
Date Flare Painmed_Use
1 2015-12-01 0 0
2 2015-12-02 0 0
3 2015-12-03 0 0
4 2015-12-04 0 0
5 2015-12-05 0 0
6 2015-12-06 0 1
7 2015-12-07 1 4
8 2015-12-08 1 3
9 2015-12-09 1 1
10 2015-12-10 1 0
11 2015-12-11 0 0
12 2015-12-12 0 0
13 2015-12-13 1 2
14 2015-12-14 1 3
15 2015-12-15 1 1
16 2015-12-16 0 0
I'm trying to find the length of each flare as well as the total med use during each flare using dplyr. My current solution (inspired by Use rle to group by runs when using dplyr),
df %>%
group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths), yy$lengths)}, Flare) %>%
summarize(Painmed_UseCum = sum(Painmed_Use),FlareLength = n())
gives the following output:
yy Flare Painmed_UseCum FlareLength
<int> <int> <dbl> <int>
1 1 0 1 6
2 2 1 8 4
3 3 0 0 2
4 4 1 6 3
5 5 0 0 1
This is almost exactly what I need. However, I can't figure out how to preserve other columns, the critical one being the date that corresponds to the last row of a particular flare. So, the output I'm seeking is the same as above but with the addition of the Dates, like so:
Date yy Flare Painmed_UseCum FlareLength
<int> <int> <dbl> <int>
1 2015-12-06 1 0 1 6
2 2015-12-10 2 1 8 4
3 2015-12-12 3 0 0 2
4 2015-12-15 4 1 6 3
5 2015-12-16 5 0 0 1
Note: In some ways this is a follow up from a previous question of mine (R code to get max count of time series data by group) but my attempt to keep that question simpler, though perhaps useful to others, ended up necessitating this further question.

You could either include Date in summarise
library(dplyr)
df %>%
group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths),yy$lengths)}) %>%
summarize(Painmed_UseCum = sum(Painmed_Use),FlareLength = n(), Date = max(Date))
# Groups: yy, Flare [5]
# Date Flare Painmed_Use yy
# <date> <int> <int> <int>
#1 2015-12-06 0 1 1
#2 2015-12-10 1 0 2
#3 2015-12-12 0 0 3
#4 2015-12-15 1 1 4
#5 2015-12-16 0 0 5
Or if there are more columns to preserve better approach is to use mutate and select the last row in each group.
df %>%
group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths), yy$lengths)}) %>%
mutate(Painmed_UseCum = sum(Painmed_Use),FlareLength = n()) %>%
slice(n())
To create groups, we can replace rle with rleid from data.table which would be simpler.
group_by(yy = data.table::rleid(Flare))

Related

Create new columns based on 2 columns

So I have this kind of table df
Id
Type
QTY
unit
1
A
5
1
2
B
10
2
3
C
5
3
2
A
10
4
3
B
5
5
1
C
10
6
I want to create this data frame df2
Id
A_QTY
A_unit
B_QTY
B_unit
C_QTY
C_unit
1
5
1
0
0
10
6
2
10
4
10
2
0
0
3
0
0
5
5
5
3
This means that I want to create a new column for every "Type's" "QTY" and "unit" for each "Id". I was thinking to use a loop to first create a new column for each Type, to get something like this :
Id
Type
QTY
unit
A_QTY
A_unit
B_QTY
B_unit
C_QTY
C_unit
1
A
5
1
5
1
0
0
0
0
2
B
10
2
0
0
10
2
0
0
3
C
5
3
0
0
0
0
5
3
2
A
10
4
10
4
0
0
0
0
3
B
5
5
0
0
5
5
0
0
1
C
10
6
0
0
0
0
10
6
, and then group_by() to agregate them resulting in df2. But I get stuck when it comes to creating the new columns. I have tried the for loop but my level on R is still not that great yet. I can't manage to create new columns from those existing columns...
I'll appreciate any suggestions you have for me!
You can use pivot_wider from the tidyr package:
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = "Type", # Columns to get the names from
values_from = c("QTY", "unit"), # Columns to get the values from
names_glue = "{Type}_{.value}", # Column naming
values_fill = 0, # Fill NAs with 0
names_vary = "slowest") # To get the right column ordering
output
# A tibble: 3 × 7
Id A_QTY A_unit B_QTY B_unit C_QTY C_unit
<int> <int> <int> <int> <int> <int> <int>
1 1 5 1 0 0 10 6
2 2 10 4 10 2 0 0
3 3 0 0 5 5 5 3
library(tidyverse)
df %>%
pivot_longer(-c(Id, Type)) %>%
mutate(name = str_c(Type, name, sep = "_")) %>%
select(-Type) %>%
pivot_wider(names_from = "name", values_from = "value", values_fill = 0)
# A tibble: 3 × 7
Id A_QTY A_unit B_QTY B_unit C_QTY C_unit
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 1 0 0 10 6
2 2 10 4 10 2 0 0
3 3 0 0 5 5 5 3

Time since last event of grouped data in R

I have a data frame that contains a grouping variable (ID), a date and an event column with numeric values, in which 0 represent no event and >0 represents an event. An example data frame can be generated with the following code:
df <- data.frame(ID = c (1, 1, 1, 1, 2, 2, 2),
date = as.Date(c("2014-08-03", "2014-08-04", "2014-08-07", "2014-08-10", "2015-07-01", "2015-07-03", "2015-08-01")),
event = c(1, 0, 3, 0, 0, 4, 0))
df
> df
ID date event
1 1 2014-08-03 1
2 1 2014-08-04 0
3 1 2014-08-07 3
4 1 2014-08-10 0
5 2 2015-07-01 0
6 2 2015-07-03 4
7 2 2015-08-01 0
Now, I want to calculate the time that has passed since any last event (>0) has occured. In the particular case that the first entry/entries for any ID contains no event, "NA" should be generated. My desired output would look like this:
> df
ID date event tae
1 1 2014-08-03 1 0
2 1 2014-08-04 0 1
3 1 2014-08-07 3 0
4 1 2014-08-10 0 3
5 2 2015-07-01 0 NA
6 2 2015-07-03 4 0
7 2 2015-08-01 0 29
I have tried several different approaches. The closest I got was this:
library(dplyr)
df %>%
mutate(tmpG = cumsum(c(FALSE, as.logical(diff(event))))) %>%
group_by(ID) %>%
mutate(tmp = c(0, diff(date)) * !event) %>%
group_by(tmpG) %>%
mutate(tae = cumsum(tmp)) %>%
ungroup() %>%
select(-c(tmp, tmpG))
# A tibble: 7 x 4
ID date event tae
<dbl> <date> <dbl> <dbl>
1 1 2014-08-03 1 0
2 1 2014-08-04 0 1
3 1 2014-08-07 3 0
4 1 2014-08-10 0 3
5 2 2015-07-01 0 3
6 2 2015-07-03 4 0
7 2 2015-08-01 0 29
Any suggestions on how to get that code running (or any other alternative) would be greatly appreciated.
Here is another tidyverse approach, that uses fill to carry forward the most recent event.
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(last_event = if_else(event > 0, date, NA_real_)) %>%
fill(last_event) %>%
mutate(tae = as.numeric(date - last_event))
Output
ID date event last_event tae
<dbl> <date> <dbl> <date> <dbl>
1 1 2014-08-03 1 2014-08-03 0
2 1 2014-08-04 0 2014-08-03 1
3 1 2014-08-07 3 2014-08-07 0
4 1 2014-08-10 0 2014-08-07 3
5 2 2015-07-01 0 NA NA
6 2 2015-07-03 4 2015-07-03 0
7 2 2015-08-01 0 2015-07-03 29
df %>%
group_by(ID) %>%
mutate(tae = as.double(if_else(event==0, date-lag(date), 0)))
Output:
ID date event tae
<dbl> <date> <dbl> <dbl>
1 1 2014-08-03 1 0
2 1 2014-08-04 0 1
3 1 2014-08-07 3 0
4 1 2014-08-10 0 3
5 2 2015-07-01 0 NA
6 2 2015-07-03 4 0
7 2 2015-08-01 0 29

Calculated Column Based on Rows with Date Range

I have a dataframe as follows:
ID
Col1
RespID
Col3
Col4
Year
Month
Day
1
blue
729Ad
3.2
A
2021
April
2
2
orange
295gS
6.5
A
2021
April
1
3
red
729Ad
8.4
B
2021
April
20
4
yellow
592Jd
2.9
A
2021
March
12
5
green
937sa
3.5
B
2021
May
13
I would like to calculate a new column, Col5, such that its value is 1 if the row has Col4 value of A and there exists another column somewhere in the dataset a row with the same RespId but a Col4 value of B. Otherwise it’s value is 0. Then I will drop all rows with Col4 value of B, to keep just those with A. I'd also like to account for the date fields (year, month, date) so that this is done in groups based on say a 30 day timeframe. So if 'B' appears within 30 days of when 'A' appears in the dataset, only then is there a 1 present (if 'B' appears within 60 days, then there is no 1. Additionally, I'd like to keep everything as data.frames.
Here is what the desired output table would look like prior to dropping rows with Col4 value of B:
ID
Col1
RespID
Col3
Col4
Col5
1
blue
729Ad
3.2
A
1
2
orange
295gS
6.5
A
0
3
red
729Ad
8.4
B
0
4
yellow
592Jd
2.9
A
0
5
green
937sa
3.5
B
0
I have found Ronak's solution in this thread (Calculated Column Based on Rows in Tidymodels Recipe) to be useful, however, would like to modify for the date range.
A lot of things to unpack here.
I think you're tripping up over your own feet by trying to do too many things at once. I've broken down the code into four distinct steps to make the thought process easy to follow. Obviously, for use in a production environment it should be rewritten more efficiently.
1. Generate some data
library(tidyverse)
set.seed(42)
df <- tibble(
id = c(1:10),
resp_id = c(1701, seq(2286, 2289), 1701, seq(2290, 2293)),
grouping = sample(c("A", "B"), size = 10, replace = TRUE),
date = seq.Date(as.Date("2363-10-04"), as.Date("2363-11-17"), length.out = 10)
)
Resulting data:
# A tibble: 10 × 4
id resp_id grouping date
<int> <dbl> <chr> <date>
1 1 1701 A 2363-10-04
2 2 2286 A 2363-10-08
3 3 2287 A 2363-10-13
4 4 2288 A 2363-10-18
5 5 2289 B 2363-10-23
6 6 1701 B 2363-10-28
7 7 2290 B 2363-11-02
8 8 2291 B 2363-11-07
9 9 2292 A 2363-11-12
10 10 2293 B 2363-11-17
2. Check grouping
df <- df %>%
mutate(
is_a = ifelse(grouping == "A", 1, 0),
is_b = ifelse(grouping == "B", 1, 0)
)
We have the grouping now as easy-to-use dummy variables:
> df
# A tibble: 10 × 6
id resp_id grouping date is_a is_b
<int> <dbl> <chr> <date> <dbl> <dbl>
1 1 1701 A 2363-10-04 1 0
2 2 2286 A 2363-10-08 1 0
3 3 2287 A 2363-10-13 1 0
4 4 2288 A 2363-10-18 1 0
5 5 2289 B 2363-10-23 0 1
6 6 1701 B 2363-10-28 0 1
7 7 2290 B 2363-11-02 0 1
8 8 2291 B 2363-11-07 0 1
9 9 2292 A 2363-11-12 1 0
10 10 2293 B 2363-11-17 0 1
3. Check completeness
df <- df %>%
group_by(
resp_id
) %>%
mutate(
# Check if the grouping has both "A" and "B" values
is_complete = ifelse(
sum(is_a) > 0 & sum(is_b) > 0,
1,
0
)
) %>%
ungroup()
We see that there is only one resp_id value that is complete — 1701:
> df
# A tibble: 10 × 7
id resp_id grouping date is_a is_b is_complete
<int> <dbl> <chr> <date> <dbl> <dbl> <dbl>
1 1 1701 A 2363-10-04 1 0 1
2 2 2286 A 2363-10-08 1 0 0
3 3 2287 A 2363-10-13 1 0 0
4 4 2288 A 2363-10-18 1 0 0
5 5 2289 B 2363-10-23 0 1 0
6 6 1701 B 2363-10-28 0 1 1
7 7 2290 B 2363-11-02 0 1 0
8 8 2291 B 2363-11-07 0 1 0
9 9 2292 A 2363-11-12 1 0 0
10 10 2293 B 2363-11-17 0 1 0
4. Assign target value
df <- df %>%
group_by(
resp_id
) %>%
mutate(
# Check if the "A" part of a complete grouping has a another value within 30 days
is_within_timeframe = ifelse(
is_complete == 1 & is_a == 1 & max(date) - min(date) <= 30,
1,
0
)
) %>%
ungroup()
We see that our one complete set has in fact a B value that falls within 30 days of the A observation (Caveat: This only works if there are always exactly one or two observations per grouping!). Column is_within_timeframe corresponds to your Col4:
> df
# A tibble: 10 × 8
id resp_id grouping date is_a is_b is_complete is_within_timeframe
<int> <dbl> <chr> <date> <dbl> <dbl> <dbl> <dbl>
1 1 1701 A 2363-10-04 1 0 1 1
2 2 2286 A 2363-10-08 1 0 0 0
3 3 2287 A 2363-10-13 1 0 0 0
4 4 2288 A 2363-10-18 1 0 0 0
5 5 2289 B 2363-10-23 0 1 0 0
6 6 1701 B 2363-10-28 0 1 1 0
7 7 2290 B 2363-11-02 0 1 0 0
8 8 2291 B 2363-11-07 0 1 0 0
9 9 2292 A 2363-11-12 1 0 0 0
10 10 2293 B 2363-11-17 0 1 0 0

How to have R sum nonexistent or null data

A bit convoluted so I will start with the basic concept. The data is employment by area and sizeclass. From there, I produce a data frame that has the sizeclass, area, total employment by sizeclass, number of worksites by sizeclass. The bigger the sizeclass, the more employment. 1 equal to employing between 0 and 4. 9 being equal to employing 1000+. Obviously some areas do not have large employers. However, I need the end result to always have 9 rows per area even if there is 0 employment for that sizeclass. Sample data is below.
area <- c(01,01,01,01,01,01,01,03,03,03,03)
employment <- c(1,5,9,10,11,12,67,100,4,444,149)
sizeclass <- c(1,2,2,3,3,3,5,6,1,7,6)
df2 <- data.frame(area,employment,sizeclass)
This is the code that I am using and while it works, it does not produce a result for sizeclass 4 in area 01 for example. How would I have it sum by sizeclass even if there is nothing to sum or count?
sizeclassreport <- df2 %>%
select (area,employment,sizeclass) %>%
group_by(area,sizeclass) %>%
summarise(employment = sum(employment),worksites = n())
The desired result would be 18 rows in length with the sum of employment by sizeclass for each sizeclass and number of worksites even if there is no employment.
We can use complete to get all the values from the custom value range between 1 and 9 for the 'sizeclass'. By default, the other columns values will be filled by NA. If wanted, it can be filled with a custom value i.e. 0
library(dplyr)
library(tidyr)
sizeclassreport %>%
group_by(area) %>%
complete(sizeclass = 1:9,
fill = list(employment = 0, worksites = 0)) %>%
ungroup
-output
# A tibble: 18 x 4
area sizeclass employment worksites
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 1 2 14 2
3 1 3 33 3
4 1 4 0 0
5 1 5 67 1
6 1 6 0 0
7 1 7 0 0
8 1 8 0 0
9 1 9 0 0
10 3 1 4 1
11 3 2 0 0
12 3 3 0 0
13 3 4 0 0
14 3 5 0 0
15 3 6 249 2
16 3 7 444 1
17 3 8 0 0
18 3 9 0 0

Identifying duplicate within groups by latest date

I currently have a data frame that looks like this:
ID Value Date
1 1 A 1/1/2018
2 1 B 2/3/1988
3 1 B 6/3/1994
4 2 A 12/6/1999
5 2 B 24/12/1957
6 3 A 9/8/1968
7 3 B 20/9/2016
8 3 C 15/4/1993
9 3 C 9/8/1994
10 4 A 8/8/1988
11 4 C 6/4/2001
Within each ID I would like to identify a row where there is a duplicate Value. The Value that I would like to identify is the duplicate with the most recent Date.
The resulting data frame should look like this:
ID Value Date mostRecentDuplicate
1 1 A 1/1/2018 0
2 1 B 2/3/1988 0
3 1 B 6/3/1994 1
4 2 A 12/6/1999 0
5 2 B 24/12/1957 0
6 3 A 9/8/1968 0
7 3 B 20/9/2016 0
8 3 C 15/4/1993 0
9 3 C 9/8/1994 1
10 4 A 8/8/1988 0
11 4 C 6/4/2001 0`
How do I go about doing this?
Using dplyr we can first convert Date to actual date value, then group_by ID and Value and assign value 1 in the group where there is more than 1 row and the row_number is same as row number of maximum Date.
library(dplyr)
df %>%
mutate(Date = as.Date(Date, "%d/%m/%Y")) %>%
group_by(ID, Value) %>%
mutate(mostRecentDuplicate = +(n() > 1 & row_number() == which.max(Date))) %>%
ungroup()
# A tibble: 11 x 4
# ID Value Date mostRecentDuplicate
# <int> <fct> <date> <int>
# 1 1 A 2018-01-01 0
# 2 1 B 1988-03-02 0
# 3 1 B 1994-03-06 1
# 4 2 A 1999-06-12 0
# 5 2 B 1957-12-24 0
# 6 3 A 1968-08-09 0
# 7 3 B 2016-09-20 0
# 8 3 C 1993-04-15 0
# 9 3 C 1994-08-09 1
#10 4 A 1988-08-08 0
#11 4 C 2001-04-06 0

Resources