R How to lag a dataframe by groups - r

I have the following data set:
Name Year VarA VarB Data.1 Data.2
A 2016 L H 100 101
A 2017 L H 105 99
A 2018 L H 103 105
A 2016 L A 90 95
A 2017 L A 99 92
A 2018 L A 102 101
I want to add a lagged variable by the grouping: Name, VarA, VarB so that my data would look like:
Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
A 2016 L H 100 101 NA NA
A 2017 L H 105 99 100 NA
A 2018 L H 103 105 105 100
A 2016 L A 90 95 NA NA
A 2017 L A 99 92 90 NA
A 2018 L A 102 101 99 90
I found the following link, which is helpful: debugging: function to create multiple lags for multiple columns (dplyr)
And am using the following code:
df <- df %>%
group_by(Name) %>%
arrange(Name, VarA, VarB, Year) %>%
do(data.frame(., setNames(shift(.[,c(5:6)], 1:2), c(seq(1:8)))))
However, the lag offsetting all data associated w/ name, instead of the grouping I want, so only the 2018 years are accurately lagged.
Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
A 2016 L H 100 101 NA NA
A 2017 L H 105 99 100 NA
A 2018 L H 103 105 105 100
A 2016 L A 90 95 103 105
A 2017 L A 99 92 90 103
A 2018 L A 102 101 99 90
How do I get the lag to reset for each new grouping combination (e.g. Name / VarA / VarB)?

dplyr::lag lets you set the distance you want to lag by. You can group by whatever variables you want—in this case, Name, VarA, and VarB—before making your lagged variables.
library(dplyr)
df %>%
group_by(Name, VarA, VarB) %>%
mutate(Lg1.Data.1 = lag(Data.1, n = 1), Lg2.Data.1 = lag(Data.1, n = 2))
#> # A tibble: 6 x 8
#> # Groups: Name, VarA, VarB [2]
#> Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
#> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2016 L H 100 101 NA NA
#> 2 A 2017 L H 105 99 100 NA
#> 3 A 2018 L H 103 105 105 100
#> 4 A 2016 L A 90 95 NA NA
#> 5 A 2017 L A 99 92 90 NA
#> 6 A 2018 L A 102 101 99 90
If you want a version that scales to more lags, you can use some non-standard evaluation to create new lagged columns dynamically. I'll do this with purrr::map to iterate of a set of n to lag by, make a list of data frames with the new columns added, then join all the data frames together. There are probably better NSE ways to do this, so hopefully someone can improve upon it.
I'm making up some new data, just to have a wider range of years to illustrate. Inside mutate, you can create column names with quo_name.
library(dplyr)
library(purrr)
set.seed(127)
df <- tibble(
Name = "A", Year = rep(2016:2020, 2), VarA = "L", VarB = rep(c("H", "A"), each = 5),
Data.1 = sample(1:10, 10, replace = T), Data.2 = sample(1:10, 10, replace = T)
)
df_list <- purrr::map(1:4, function(i) {
df %>%
group_by(Name, VarA, VarB) %>%
mutate(!!quo_name(paste0("Lag", i)) := dplyr::lag(Data.1, n = i))
})
You don't need to save this list—I'm just doing it to show an example of one of the data frames. You could instead go straight into reduce.
df_list[[3]]
#> # A tibble: 10 x 7
#> # Groups: Name, VarA, VarB [2]
#> Name Year VarA VarB Data.1 Data.2 Lag3
#> <chr> <int> <chr> <chr> <int> <int> <int>
#> 1 A 2016 L H 3 9 NA
#> 2 A 2017 L H 1 4 NA
#> 3 A 2018 L H 3 8 NA
#> 4 A 2019 L H 2 2 3
#> 5 A 2020 L H 4 5 1
#> 6 A 2016 L A 8 4 NA
#> 7 A 2017 L A 6 8 NA
#> 8 A 2018 L A 3 2 NA
#> 9 A 2019 L A 8 6 8
#> 10 A 2020 L A 9 1 6
Then use purrr::reduce to join all the data frames in the list. Since there are columns that are the same in each of the data frames, and those are the ones you want to join by, you can get away with not specifying join-by columns in inner_join.
reduce(df_list, inner_join)
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> # A tibble: 10 x 10
#> # Groups: Name, VarA, VarB [?]
#> Name Year VarA VarB Data.1 Data.2 Lag1 Lag2 Lag3 Lag4
#> <chr> <int> <chr> <chr> <int> <int> <int> <int> <int> <int>
#> 1 A 2016 L H 3 9 NA NA NA NA
#> 2 A 2017 L H 1 4 3 NA NA NA
#> 3 A 2018 L H 3 8 1 3 NA NA
#> 4 A 2019 L H 2 2 3 1 3 NA
#> 5 A 2020 L H 4 5 2 3 1 3
#> 6 A 2016 L A 8 4 NA NA NA NA
#> 7 A 2017 L A 6 8 8 NA NA NA
#> 8 A 2018 L A 3 2 6 8 NA NA
#> 9 A 2019 L A 8 6 3 6 8 NA
#> 10 A 2020 L A 9 1 8 3 6 8
Created on 2018-12-07 by the reprex package (v0.2.1)

Related

How to take the mean of two subsequent rows iteratively thereby reducing the number of rows?

I have a tibble like so:
library(dplyr)
set.seed(1)
my_tib <- tibble(identifier = rep(letters[1:3], each = 4),
year = rep(seq(2005, 2020, 5), 3),
value = rnorm(12, mean = 1000, 100) %>% round()
)
my_tib
# A tibble: 12 × 3
identifier year value
<chr> <dbl> <dbl>
1 a 2005 937
2 a 2010 1018
3 a 2015 916
4 a 2020 1160
5 b 2005 1033
6 b 2010 918
7 b 2015 1049
8 b 2020 1074
9 c 2005 1058
10 c 2010 969
11 c 2015 1151
12 c 2020 1039
Now I'd like to shrink down my tibble by taking the mean value for two years each, creating a new column for the year bracket. For example, I'd like to take the mean of 937 and 1018 (977.5) for the new year_bracket 2005-2010.
I'd like to repeat this for all years and all identifiers.
So the first new 5 rows of my tibble look like this:
head(my_new_tib, 5)
# A tibble: 9 × 3
identifier year_bracket value
<chr> <chr> <dbl>
1 a 2005-2010 977.5
2 a 2010-2015 967
3 a 2015-2020 1038
4 b 2005-2010 975.5
5 b 2010-2015 983.5
Ideally, I'm looking for a piped dplyr solution but I'm also curious regarding other solutions.
Using dplyr:
library(dplyr)
my_tib |>
group_by(identifier) |>
mutate(value = (value + lag(value))/2,
year_bracket = paste0(lag(year)," - ",year),
.keep = "unused",
.before = 2) |>
filter(!is.na(value)) |>
ungroup()
Output:
# A tibble: 9 x 3
identifier year_bracket value
<chr> <chr> <dbl>
1 a 2005 - 2010 978.
2 a 2010 - 2015 967
3 a 2015 - 2020 1038
4 b 2005 - 2010 976.
5 b 2010 - 2015 984.
6 b 2015 - 2020 1062.
7 c 2005 - 2010 1014.
8 c 2010 - 2015 1060
9 c 2015 - 2020 1095
Another possible solution:
library(tidyverse)
my_tib %>%
group_by(identifier) %>%
slice(c(1, rep(2:(n()-1), each = 2) , n())) %>%
group_by(identifier, aux = rep(1:n(), each=2, length.out = n())) %>%
summarise(year_bracket = str_c(year, collapse = "_"), value = mean(value),
.groups = "drop") %>% select(-aux)
#> # A tibble: 9 × 3
#> identifier year_bracket value
#> <chr> <chr> <dbl>
#> 1 a 2005_2010 978.
#> 2 a 2010_2015 967
#> 3 a 2015_2020 1038
#> 4 b 2005_2010 976.
#> 5 b 2010_2015 984.
#> 6 b 2015_2020 1062.
#> 7 c 2005_2010 1014.
#> 8 c 2010_2015 1060
#> 9 c 2015_2020 1095

R: Turning row data from one dataframe into column data by group in another

I have data in the following format:
ID
Age
Sex
1
29
M
2
32
F
3
18
F
4
89
M
5
45
M
and;
ID
subID
Type
Status
Year
1
3
Car
Y
1
11
Toyota
NULL
2011
1
23
Kia
NULL
2009
2
5
Car
N
3
2
Car
Y
3
4
Honda
NULL
2019
3
7
Fiat
NULL
2006
3
8
Mitsubishi
NULL
2020
4
1
Car
N
5
7
Car
Y
Each ID in the second table has a row specifying if they have a car, and additional rows stating the brand of car/s they own. Each person has a maximum of 3 cars. I want to simplify this data into a single table as so.
ID
Age
Sex
Car?
Car.1
Car1.year
Car.2
Car2.year
Car.3
Car3.year
1
29
M
Y
Toyota
2011
Kia
2009
NULL
NULL
2
32
F
N
NULL
NULL
NULL
NULL
NULL
NULL
3
18
F
Y
Honda
2019
Fiat
2006
Mitsubishi
2020
4
89
M
N
NULL
NULL
NULL
NULL
NULL
NULL
5
45
M
Y
NULL
NULL
NULL
NULL
NULL
NULL
I've tried using the mutate function in dplyr with the case_when function, but I can't check conditions in another dataframe. If I try to join the tables together, I would have multiple rows for each ID which I want to avoid. The non-standard set up of the second table makes things complicated. My only remaining idea is to switch to Python/Pandas and create a for loop that slowly loops through each ID, searches the second dataframe if the person has a car and the car brands, then mutates a column in the first dataframe. But given the size of my dataset, this would be inefficient and take a long time.
What is the best way to do this?
You can try the following codes:
library(tidyverse)
df1
# A tibble: 5 x 3
ID Age Sex
<dbl> <dbl> <chr>
1 1 29 M
2 2 32 F
3 3 18 F
4 4 89 M
5 5 45 M
df2
# A tibble: 10 x 5
ID subID Type Status Year
<dbl> <dbl> <chr> <chr> <dbl>
1 1 3 Car Y NA
2 1 11 Toyota Y 2011
3 1 23 Kia Y 2009
4 2 5 Car N NA
5 3 2 Car Y NA
6 3 4 Honda Y 2019
7 3 7 Fiat Y 2006
8 3 8 Mitsubishi Y 2020
9 4 1 Clothed N NA
10 5 7 Clothed Y NA
df2 <- df2 %>% mutate(Status = if_else(Status == "NULL", "Y", Status))
df3 <- df2 %>% filter(!is.na(Year)) %>% group_by(ID) %>% mutate(index = row_number())
df4 <- df3 %>% pivot_wider(id_cols = c(ID), values_from = c(Type, Year), names_from = index )
So your desired output will be produced:
df1 %>% left_join(df2 %>% select(ID, Status) %>% distinct()) %>% left_join(df4)
# A tibble: 5 x 10
ID Age Sex Status Type_1 Type_2 Type_3 Year_1 Year_2 Year_3
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 29 M Y Toyota Kia NA 2011 2009 NA
2 2 32 F N NA NA NA NA NA NA
3 3 18 F Y Honda Fiat Mitsubishi 2019 2006 2020
4 4 89 M N NA NA NA NA NA NA
5 5 45 M Y NA NA NA NA NA NA

R calculating differences on a pivoted tibble

I'm struggling some beginner issues with R and tables. I spend most of my data visualisation time in Tableau but I want to be able to replicate work in R to take advantage of the report generation capacity of RMarkdown and the StatCanR library to allow me to pull data in from their Statistics Canada's CANSIM/CODR tables. My coding experience is along the lines of C, C++, Java, Javascript and Python with all but Python learnt in college around the turn of the millenium.
I am extracting rates of certain types of crimes and have created the following table.
```# A tibble: 4 × 11
Violations `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Total, all Criminal Code violati… 5780. 5638. 5206. 5061. 5232. 5297. 5375. 5513. 5878. 5301.
2 Total violent Criminal Code viol… 1236. 1199. 1096. 1044. 1070. 1076. 1113. 1152. 1279. 1254.
3 Total property crime violations … 3536. 3438. 3154. 3100. 3231. 3239. 3265. 3348. 3512. 3071.
4 Total drug violations [401] 330. 317. 311. 295. 280. 267. 254. 229. 186. 176.
I have filtered away data that is more than ten years old and only for certain crimes.
# Pivot the data
table_01 <- pivot_wider(table_01 %>%select("REF_DATE","Violations","VALUE"),names_from=REF_DATE, values_from=VALUE)
table01a<-table_01 %>%select(2020,2019,2011)
)
mutate(
ten_year_change = 2020-2011,
one_year_change = 2020-2019
)
I've been messing around with different libraries including tidyverse and dplyr. I want the code to calculate the diffence between the most recent two years and the difference between the most recent year and (most recent year - 10 years ago). The idea is to generate a new report when Statistics Canada updates their data.
This code is above absolutely not what I want. I do want the years that I calculate differences to not be hard coded so I don't have to edit the code in six months.
My suspicion is that I am not getting my head around the R way of doing things, but if I can get a push in the right direction, I would appreciate it.
Below is the TLDR full RMarkdown script:
---
title: "CJS Statistical Summary"
output: word_document
date: '2021-10-05'
---
` ` `{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#load libraries
#install.packages("tidyverse")
#install.packages("statcanR")
#install.packages("flextable")
#install.packages("dplyr")
library("tidyverse")
library("statcanR")
library("flextable")
library("dplyr")
setwd("~/R_Scripts") # change for a Windows-style path if ran in Windows.
#set language
language <-"eng"
# Load dataset Incident-based crime statistics, by detailed violations
CODR_0177 <- statcan_data('35-10-0177-01', language)
# Code not written for these CODR tables
#CODR_0027 <- statcan_data('35-10-0027-01', language)
#CODR_0038 <- statcan_data('35-10-0038-01', language)
#CODR_0029 <- statcan_data('35-10-0029-01', language)
#CODR_0022 <- statcan_data('35-10-0022-01', language)
#CODR_0006 <- statcan_data('35-10-0006-01', language)
` ` `
## Table 1
` ` `{r table_01, echo=FALSE}
# Develop table 1 - Crime Stats
# =============================
# Find most recent ten years
years <- distinct(CODR_0177 %>% select("REF_DATE"))
years <- arrange(years,desc(REF_DATE))%>% slice(1:10)
# Copying the crime stats table so it isn't altered in case we need to reuse it.
table_01 <- CODR_0177
# Remove unused columns
table_01 <- table_01 %>% select("REF_DATE","GEO","Violations","Statistics","UOM","VALUE") %>% filter(REF_DATE %in% years$REF_DATE)
# Keep only national data
table_01 <- table_01 %>% filter(GEO == "Canada")
# Keep only crime rate
table_01 <- table_01 %>% filter(Statistics == "Rate per 100,000 population")
# Keep only certain Violations
display_violations <- c("Total, all Criminal Code violations (excluding traffic) [50]","Total violent Criminal Code violations [100]","Total property crime violations [200]","Total drug violations [401]" )
table_01 <- table_01 %>% filter(Violations %in% display_violations)
# Pivot the data
table_01 <- pivot_wider(table_01 %>%select("REF_DATE","Violations","VALUE"),names_from=REF_DATE, values_from=VALUE)
#calculating year to year differences
table01a<-table_01 %>%select(2020,2019,2011)
)
mutate(
ten_year_change = 2020-2011,
one_year_change = 2020-2019
)
# Edit look and feel for report using Flextable
flex_table_01<-flextable(table_01)
flex_table_01<-theme_vanilla(flex_table_01)
flex_table_01<-add_header_row(
flex_table_01,
values=c("","Rates per 100,000 population","% change"),
colwidths=c(1,10,2)
)
flex_table_01<-add_header_row(
flex_table_01,
values=c("Incidents Reported to Police (Crime Rate)"),
colwidths=c(13)
)
flex_table_01 <- align(flex_table_01, i = 1, part = "header", align = "center")
flex_table_01 <- fontsize(flex_table_01, i = NULL, j = NULL, size = 8, part = "all")
flex_table_01 <- colformat_double(flex_table_01, big.mark=",", digits = 0, na_str = "N/A")
flex_table_01
#remove temporary files
rm(years)
rm(display_violations)
rm(table_01)
This is much easier with the data in "long" format. Below is an example with fake data. We use the lag function to get changes over different time ranges. Once you've added the changes over various timescales, you can subset and reshape the data as needed to create your final tables.
library(tidyverse)
# Fake data
set.seed(2)
d = tibble(
REF_DATE = rep(2010:2020, each=4),
Violations = rep(LETTERS[1:4], 11),
value = sample(100:200, 44)
)
d
#> # A tibble: 44 × 3
#> REF_DATE Violations value
#> <int> <chr> <int>
#> 1 2010 A 184
#> 2 2010 B 178
#> 3 2010 C 169
#> 4 2010 D 105
#> 5 2011 A 131
#> 6 2011 B 107
#> 7 2011 C 116
#> 8 2011 D 192
#> 9 2012 A 180
#> 10 2012 B 175
#> # … with 34 more rows
d1 = d %>%
arrange(Violations, REF_DATE) %>%
group_by(Violations) %>%
mutate(lag1 = value - lag(value),
lag10 = value - lag(value, n=10))
print(d1, n=23)
#> # A tibble: 44 × 5
#> # Groups: Violations [4]
#> REF_DATE Violations value lag1 lag10
#> <int> <chr> <int> <int> <int>
#> 1 2010 A 184 NA NA
#> 2 2011 A 131 -53 NA
#> 3 2012 A 180 49 NA
#> 4 2013 A 174 -6 NA
#> 5 2014 A 189 15 NA
#> 6 2015 A 132 -57 NA
#> 7 2016 A 139 7 NA
#> 8 2017 A 108 -31 NA
#> 9 2018 A 101 -7 NA
#> 10 2019 A 147 46 NA
#> 11 2020 A 193 46 9
#> 12 2010 B 178 NA NA
#> 13 2011 B 107 -71 NA
#> 14 2012 B 175 68 NA
#> 15 2013 B 164 -11 NA
#> 16 2014 B 154 -10 NA
#> 17 2015 B 153 -1 NA
#> 18 2016 B 115 -38 NA
#> 19 2017 B 171 56 NA
#> 20 2018 B 166 -5 NA
#> 21 2019 B 190 24 NA
#> 22 2020 B 117 -73 -61
#> 23 2010 C 169 NA NA
#> # … with 21 more rows
We can also do multiple lags as once:
d2 = d %>%
arrange(Violations, REF_DATE) %>%
group_by(Violations) %>%
mutate(map_dfc(1:10 %>% set_names(paste0("lag.", .)),
~ value - lag(value, n=.x)))
d2
#> # A tibble: 44 × 13
#> # Groups: Violations [4]
#> REF_DATE Violations value lag.1 lag.2 lag.3 lag.4 lag.5 lag.6 lag.7 lag.8
#> <int> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 2010 A 184 NA NA NA NA NA NA NA NA
#> 2 2011 A 131 -53 NA NA NA NA NA NA NA
#> 3 2012 A 180 49 -4 NA NA NA NA NA NA
#> 4 2013 A 174 -6 43 -10 NA NA NA NA NA
#> 5 2014 A 189 15 9 58 5 NA NA NA NA
#> 6 2015 A 132 -57 -42 -48 1 -52 NA NA NA
#> 7 2016 A 139 7 -50 -35 -41 8 -45 NA NA
#> 8 2017 A 108 -31 -24 -81 -66 -72 -23 -76 NA
#> 9 2018 A 101 -7 -38 -31 -88 -73 -79 -30 -83
#> 10 2019 A 147 46 39 8 15 -42 -27 -33 16
#> # … with 34 more rows, and 2 more variables: lag.9 <int>, lag.10 <int>
Created on 2021-10-05 by the reprex package (v2.0.1)

Sum up with the next line into a new colum

I'm having some trouble on figuring out how to create a new column with the sum of 2 subsequent cells.
I have :
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
Now, I want a new column where the first line is the sum of 1+2, the second line is the sum of 1+2+3 , the third line is the sum 1+2+3+4 and so on.
As 1, 2, 3, 4... are hipoteticall values, I need to measure the absolute growth from a decade to another in order to create later on a new variable to measure the percentage change from a decade to another.
library(tibble)
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
library(slider)
library(dplyr, warn.conflicts = F)
df1 %>%
mutate(xx = slide_sum(Values, after = 1, before = Inf))
#> # A tibble: 9 x 3
#> Years Values xx
#> <dbl> <dbl> <dbl>
#> 1 1990 1 3
#> 2 2000 2 6
#> 3 2010 3 10
#> 4 2020 4 15
#> 5 2030 5 21
#> 6 2050 6 28
#> 7 2060 7 36
#> 8 2070 8 45
#> 9 2080 9 45
Created on 2021-08-12 by the reprex package (v2.0.0)
Assuming the last row is to be repeated. Otherwise the fill part can be skipped.
library(dplyr)
library(tidyr)
df1 %>%
mutate(x = lead(cumsum(Values))) %>%
fill(x)
# Years Values x
# <dbl> <dbl> <dbl>
# 1 1990 1 3
# 2 2000 2 6
# 3 2010 3 10
# 4 2020 4 15
# 5 2030 5 21
# 6 2050 6 28
# 7 2060 7 36
# 8 2070 8 45
# 9 2080 9 45
Using base R
v1 <- cumsum(df1$Values)[-1]
df1$new <- c(v1, v1[length(v1)])
You want the cumsum() function. Here are two ways to do it.
### Base R
df1$cumsum <- cumsum(df1$Values)
### Using dplyr
library(dplyr)
df1 <- df1 %>%
mutate(cumsum = cumsum(Values))
Here is the output in either case.
df1
# A tibble: 9 x 3
Years Values cumsum
<dbl> <dbl> <dbl>
1 1990 1 1
2 2000 2 3
3 2010 3 6
4 2020 4 10
5 2030 5 15
6 2050 6 21
7 2060 7 28
8 2070 8 36
9 2080 9 45
A data.table option
> setDT(df)[, newCol := shift(cumsum(Values), -1, fill = sum(Values))][]
Years Values newCol
1: 1990 1 3
2: 2000 2 6
3: 2010 3 10
4: 2020 4 15
5: 2030 5 21
6: 2050 6 28
7: 2060 7 36
8: 2070 8 45
9: 2080 9 45
or a base R option following a similar idea
transform(
df,
newCol = c(cumsum(Values)[-1],sum(Values))
)

dplyr: keep empty levels of factor but not empty levels of a combination of factors that don't appear in data

When grouping and summarising with dplyr, what is the correct way to keep empty levels of each grouping factor but not keep empty combinations from multiple grouping factors?
As an example, consider data recorded at different times at multiple sites. I might filter and then calculate something for each year in each site. I'd like to have the default value of the summary on an empty vector if the filter removes a year completely. So site "a" has 10 years and site "b" has 1 year so I'd always like 11 rows in the summary.
If I use .drop = TRUE in group_by I lose years:
library(dplyr)
library(zoo)
library(lubridate)
set.seed(1)
df <- data.frame(site = factor(c(rep("a", 120), rep("b", 12))),
date = c(seq.Date(as.Date("2000/1/1"), by = "month", length.out = 120), seq.Date(as.Date("2000/1/1"), by = "month", length.out = 12)),
value = rnorm(132, 50, 10))
df$year <- factor(lubridate::year(df$date))
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = TRUE) %>%
summarise(f = first(date))
#> # A tibble: 6 x 3
#> # Groups: site [1]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2004 2004-08-01
#> 3 a 2005 2005-01-01
#> 4 a 2007 2007-11-01
#> 5 a 2008 2008-10-01
#> 6 a 2009 2009-02-01
and with .drop = FALSE I gain all the extra years for site "b" which were not in the original data:
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = FALSE) %>%
summarise(f = first(date))
#> # A tibble: 20 x 3
#> # Groups: site [2]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2001 NA
#> 3 a 2002 NA
#> 4 a 2003 NA
#> 5 a 2004 2004-08-01
#> 6 a 2005 2005-01-01
#> 7 a 2006 NA
#> 8 a 2007 2007-11-01
#> 9 a 2008 2008-10-01
#> 10 a 2009 2009-02-01
#> 11 b 2000 NA
#> 12 b 2001 NA
#> 13 b 2002 NA
#> 14 b 2003 NA
#> 15 b 2004 NA
#> 16 b 2005 NA
#> 17 b 2006 NA
#> 18 b 2007 NA
#> 19 b 2008 NA
#> 20 b 2009 NA
The best way I could think of was to calculate counts, then merge then filter then drop the count variable, but that's pretty messy.
I know the .drop was only recently added to dplyr, which is very useful for one factor, but is there yet a clean way to do this for multiple factors?
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = FALSE) %>%
summarise(f = first(date)) %>%
left_join(df %>% count(site, year, .drop = FALSE), by = c("site", "year")) %>%
filter(n > 0) %>%
select(-n)
#> # A tibble: 11 x 3
#> # Groups: site [2]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2001 NA
#> 3 a 2002 NA
#> 4 a 2003 NA
#> 5 a 2004 2004-08-01
#> 6 a 2005 2005-01-01
#> 7 a 2006 NA
#> 8 a 2007 2007-11-01
#> 9 a 2008 2008-10-01
#> 10 a 2009 2009-02-01
#> 11 b 2000 NA
Not sure if this is what you like.
If you replace dates with value < 65 with NA instead of filtering them out you can proceed as usual.
df %>%
mutate(date = replace(date, value < 65, NA)) %>%
group_by(site, year) %>%
summarise(f = first(date[!is.na(date)]))
# A tibble: 11 x 3
# Groups: site [2]
site year f
<fct> <fct> <date>
1 a 2000 NA
2 a 2001 NA
3 a 2002 2002-03-01
4 a 2003 NA
5 a 2004 NA
6 a 2005 NA
7 a 2006 2006-02-01
8 a 2007 NA
9 a 2008 2008-07-01
10 a 2009 2009-02-01
11 b 2000 2000-08-01

Resources