R - Dataframe (group_by/aggregate/pivot_wider) Manipulation [duplicate]

R - Dataframe (group_by/aggregate/pivot_wider) Manipulation [duplicate] - r

This question already has answers here:
Aggregate by specific year in R
(2 answers)
Closed last year.
I'm currently having an issue manipulating/aggregating my dataframe. The current data frame I have is as follow:
Farm
Year
Cow
Duck
Chicken
Sheep
Horse
Farm 1
2020
22
12
100
30
25
Farm 1
2020
0
12
120
20
20
Farm 1
2019
16
6
80
10
16
Farm 1
2019
12
0
50
0
11
Farm 1
2018
8
0
0
16
0
Farm 1
2018
0
0
10
13
12
Farm 2
2020
31
28
27
10
14
Farm 2
2020
0
13
31
20
0
Farm 2
2019
3
31
0
20
43
Farm 2
2019
20
50
43
17
42
Farm 2
2018
39
33
0
48
10
Farm 2
2018
34
20
28
12
12
Farm 3
2020
27
0
37
30
42
Farm 3
2020
50
9
0
0
0
Farm 3
2019
0
19
0
20
16
Farm 3
2019
0
2
0
0
7
Farm 3
2018
0
0
5
27
0
Farm 3
2018
0
7
43
49
42
For simplicity, the code for the data frame is as follows:
Farms = c(rep("Farm 1", 6), rep("Farm 2", 6), rep("Farm 3", 6))
Year = rep(c(2020,2020,2019,2019,2018,2018),3)
Cow = c(22,0,16,12,8,0,31,0,3,20,39,34,27,50,0,0,0,0)
Duck = c(12,12,6,0,0,0,28,13,31,50,33,20,0,9,19,2,0,7)
Chicken = c(100,120,80,50,0,10,27,31,0,43,0,28,37,0,0,0,5,43)
Sheep = c(30,20,10,0,16,13,10,20,20,17,48,12,30,0,20,0,27,49)
Horse = c(25,20,16,11,0,12,14,0,43,42,10,12,42,0,16,7,0,42)
Data = data.frame(Farms, Year, Cow, Duck, Chicken, Sheep, Horse)
Can I check if anyone knows how I can change the dataframe to the following table below using group_by and/or aggregate and/or pivot_wider or any other ways? The dataframe below aggregated the farm by year and took the average of each animal for the year.
Farm
Year
Cow
Duck
Chicken
Sheep
Horse
Farm 1
2020
Average of 2020 = (22+0)/2 = 11
12
110
25
22.5
Farm 1
2019
14
3
65
5
13.5
Farm 1
2018
4
0
5
14.5
6
Farm 2
2020
15.5
20.5
29
15
7
Farm 2
2019
11.5
40.5
21.5
18.5
42.5
Farm 2
2018
36.5
26.5
14
30
11
Farm 3
2020
38.5
4.5
18.5
15
21
Farm 3
2019
0
10.5
0
10
11.5
Farm 3
2018
0
3.5
24
38
21
Thank you in Advance and a happy 2022 to all!

aggregate(.~Year + Farms, Data, mean)
Year Farms Cow Duck Chicken Sheep Horse
1 2018 Farm 1 4.0 0.0 5.0 14.5 6.0
2 2019 Farm 1 14.0 3.0 65.0 5.0 13.5
3 2020 Farm 1 11.0 12.0 110.0 25.0 22.5
4 2018 Farm 2 36.5 26.5 14.0 30.0 11.0
5 2019 Farm 2 11.5 40.5 21.5 18.5 42.5
6 2020 Farm 2 15.5 20.5 29.0 15.0 7.0
7 2018 Farm 3 0.0 3.5 24.0 38.0 21.0
8 2019 Farm 3 0.0 10.5 0.0 10.0 11.5
9 2020 Farm 3 38.5 4.5 18.5 15.0 21.0
aggregate(.~Farms + Year, Data, mean)
Farms Year Cow Duck Chicken Sheep Horse
1 Farm 1 2018 4.0 0.0 5.0 14.5 6.0
2 Farm 2 2018 36.5 26.5 14.0 30.0 11.0
3 Farm 3 2018 0.0 3.5 24.0 38.0 21.0
4 Farm 1 2019 14.0 3.0 65.0 5.0 13.5
5 Farm 2 2019 11.5 40.5 21.5 18.5 42.5
6 Farm 3 2019 0.0 10.5 0.0 10.0 11.5
7 Farm 1 2020 11.0 12.0 110.0 25.0 22.5
8 Farm 2 2020 15.5 20.5 29.0 15.0 7.0
9 Farm 3 2020 38.5 4.5 18.5 15.0 21.0
Data%>%
group_by(Farms, Year) %>%
summarise(across(everything(), mean), .groups = 'drop')
# A tibble: 9 x 7
Farms Year Cow Duck Chicken Sheep Horse
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Farm 1 2018 4 0 5 14.5 6
2 Farm 1 2019 14 3 65 5 13.5
3 Farm 1 2020 11 12 110 25 22.5
4 Farm 2 2018 36.5 26.5 14 30 11
5 Farm 2 2019 11.5 40.5 21.5 18.5 42.5
6 Farm 2 2020 15.5 20.5 29 15 7
7 Farm 3 2018 0 3.5 24 38 21
8 Farm 3 2019 0 10.5 0 10 11.5
9 Farm 3 2020 38.5 4.5 18.5 15 21

Onyambu's answer is good. But small thing - and I know you didn't ask for this - you might want to consider if by average you want the mean or median statistic. At first glance, looks like the data might be rather skewed and median might be better for you.
Data %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:7) %>%
ggplot(aes(x = values)) + geom_density() + facet_wrap(~names)

Related

merge of 2 data frames based on several columns defining 1 variable in r

I have 2 data frame. Codes are: year, pd, treatm and rep.
Variablea are LAI in the first data frame, cimer, himv, nőv are in the second.
I would like to add variable LAI to the other variables/ columns.
I am not sure how to set the correct ordeing of LAI data, while 1 data has 4 codes to define.
Could You help me to solve this problem, please?
Thank You very much!
Data frames are:
> sample1
year treatm pd rep LAI
1 2020 1 A 1 2.58
2 2020 1 A 2 2.08
3 2020 1 A 3 2.48
4 2020 1 A 4 2.98
5 2020 2 A 1 3.34
6 2020 2 A 2 3.11
7 2020 2 A 3 3.20
8 2020 2 A 4 2.56
9 2020 1 B 1 2.14
10 2020 1 B 2 2.17
11 2020 1 B 3 2.24
12 2020 1 B 4 2.29
13 2020 2 B 1 3.41
14 2020 2 B 2 3.12
15 2020 2 B 3 2.81
16 2020 2 B 4 2.63
17 2021 1 A 1 2.15
18 2021 1 A 2 2.25
19 2021 1 A 3 2.52
20 2021 1 A 4 2.57
21 2021 2 A 1 2.95
22 2021 2 A 2 2.82
23 2021 2 A 3 3.11
24 2021 2 A 4 3.04
25 2021 1 B 1 3.25
26 2021 1 B 2 2.33
27 2021 1 B 3 2.75
28 2021 1 B 4 3.09
29 2021 2 B 1 3.18
30 2021 2 B 2 2.75
31 2021 2 B 3 3.21
32 2021 2 B 4 3.57
> sample2
year.pd.treatm.rep.cimer.himv.nőv
1 2020,A,1,1,92,93,94
2 2020,A,2,1,91,92,93
3 2020,B,1,1,72,73,75
4 2020,B,2,1,73,74,75
5 2020,A,1,2,95,96,100
6 2020,A,2,2,90,91,94
7 2020,B,1,2,74,76,78
8 2020,B,2,2,71,72,74
9 2020,A,1,3,94,95,96
10 2020,A,2,3,92,93,96
11 2020,B,1,3,76,77,77
12 2020,B,2,3,74,75,76
13 2020,A,1,4,90,91,97
14 2020,A,2,4,90,91,94
15 2020,B,1,4,74,75,NA
16 2020,B,2,4,73,75,NA
17 2021,A,1,1,92,93,94
18 2021,A,2,1,91,92,93
19 2021,B,1,1,72,73,75
20 2021,B,2,1,73,74,75
21 2021,A,1,2,95,96,100
22 2021,A,2,2,90,91,94
23 2021,B,1,2,74,76,78
24 2021,B,2,2,71,72,74
25 2021,A,1,3,94,95,96
26 2021,A,2,3,92,93,96
27 2021,B,1,3,76,77,77
28 2021,B,2,3,74,75,76
29 2021,A,1,4,90,91,97
30 2021,A,2,4,90,91,94
31 2021,B,1,4,74,75,NA
32 2021,B,2,4,73,75,NA

You can use inner_join from dply
library(tidyverse)
inner_join(sample2,sample1, by=c("year","pd", "treatm", "rep"))
Output (first six lines)
year pd treatm rep cimer himv nov LAI
1: 2020 A 1 1 92 93 94 2.58
2: 2020 A 2 1 91 92 93 3.34
3: 2020 B 1 1 72 73 75 2.14
4: 2020 B 2 1 73 74 75 3.41
5: 2020 A 1 2 95 96 100 2.08
6: 2020 A 2 2 90 91 94 3.11
You can also use data.table
sample2[sample1, on=.(year,pd,treatm,rep)]

Loop to sum weekly rolling average

I am new to coding. I have a data set of daily stream flow averages over 20 years. Following is an example:
DATE FLOW
1 10/1/2001 88.2
2 10/2/2001 77.6
3 10/3/2001 68.4
4 10/4/2001 61.5
5 10/5/2001 55.3
6 10/6/2001 52.5
7 10/7/2001 49.7
8 10/8/2001 46.7
9 10/9/2001 43.3
10 10/10/2001 41.3
11 10/11/2001 39.3
12 10/12/2001 37.7
13 10/13/2001 35.8
14 10/14/2001 34.1
15 10/15/2001 39.8
I need to create a loop summing the previous 6 days as well as the current day (rolling weekly average), and print it to an array for the designated water year. I have already created an aggregate function to separate yearly average daily means into their designated water years.
# Separating dates into specific water years
wtr_yr <- function(dates, start_month=9)
# Convert dates into POSIXlt
POSIDATE = as.POSIXlt(NEW_DATE)
# Year offset
offset = ifelse(POSIDATE$mon >= start_month - 1, 1, 0)
# Water year
adj.year = POSIDATE$year + 1900 + offset
# Aggregating the water year function to take the mean
mean.FLOW=aggregate(data_set$FLOW,list(adj.year), mean)

It seems that it can be done much more easily.
But first I need to prepare a bit more data.
library(tidyverse)
library(lubridate)
df = tibble(
DATE = seq(mdy("1/1/2010"), mdy("12/31/2022"), 1),
FLOW = rnorm(length(DATE), 40, 10)
)
output
# A tibble: 4,748 x 2
DATE FLOW
<date> <dbl>
1 2010-01-01 34.4
2 2010-01-02 37.7
3 2010-01-03 55.6
4 2010-01-04 40.7
5 2010-01-05 41.3
6 2010-01-06 57.2
7 2010-01-07 44.6
8 2010-01-08 27.3
9 2010-01-09 33.1
10 2010-01-10 35.5
# ... with 4,738 more rows
Now let's do the aggregation by year and week number
df %>%
group_by(year(DATE), week(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 689 x 3
# Groups: year(DATE) [13]
`year(DATE)` `week(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 44.5
2 2010 2 39.6
3 2010 3 38.5
4 2010 4 35.3
5 2010 5 44.1
6 2010 6 39.4
7 2010 7 41.3
8 2010 8 43.9
9 2010 9 38.5
10 2010 10 42.4
# ... with 679 more rows
Note, for the function week, the first week starts on January 1st. If you want to number the weeks according to the ISO 8601 standard, use the isoweek function. Alternatively, you can also use an epiweek compatible with the US CDC.
df %>%
group_by(year(DATE), isoweek(DATE)) %>%
summarise(mean = mean(FLOW))
output
# A tibble: 681 x 3
# Groups: year(DATE) [13]
`year(DATE)` `isoweek(DATE)` mean
<dbl> <dbl> <dbl>
1 2010 1 40.0
2 2010 2 45.5
3 2010 3 33.2
4 2010 4 38.9
5 2010 5 45.0
6 2010 6 40.7
7 2010 7 38.5
8 2010 8 42.5
9 2010 9 37.1
10 2010 10 42.4
# ... with 671 more rows
If you want to better understand how these functions work, please follow the code below
df %>%
mutate(
w1 = week(DATE),
w2 = isoweek(DATE),
w3 = epiweek(DATE)
)
output
# A tibble: 4,748 x 5
DATE FLOW w1 w2 w3
<date> <dbl> <dbl> <dbl> <dbl>
1 2010-01-01 34.4 1 53 52
2 2010-01-02 37.7 1 53 52
3 2010-01-03 55.6 1 53 1
4 2010-01-04 40.7 1 1 1
5 2010-01-05 41.3 1 1 1
6 2010-01-06 57.2 1 1 1
7 2010-01-07 44.6 1 1 1
8 2010-01-08 27.3 2 1 1
9 2010-01-09 33.1 2 1 1
10 2010-01-10 35.5 2 1 2
# ... with 4,738 more rows

Error for NA using group_by or aggregate function [aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate]

I've recently picked up R programming and have been looking through some group_by/aggregate questions posted here to help me learn better. A question came to my mind earlier today on how group_by/aggregate can incorporate NA data rather than 0.
Given the table and code below (credits to max_lim for allowing me to use his data set), what happens if the field of NA exist (which does happen quite often)?
Farms = c(rep("Farm 1", 6), rep("Farm 2", 6), rep("Farm 3", 6))
Year = rep(c(2020,2020,2019,2019,2018,2018),3)
Cow = c(22,NA,16,12,8,NA,31,NA,3,20,39,34,27,50,NA,NA,NA,NA)
Duck = c(12,12,6,NA,NA,NA,28,13,31,50,33,20,NA,9,19,2,NA,7)
Chicken = c(100,120,80,50,NA,10,27,31,NA,43,NA,28,37,NA,NA,NA,5,43)
Sheep = c(30,20,10,NA,16,13,10,20,20,17,48,12,30,NA,20,NA,27,49)
Horse = c(25,20,16,11,NA,12,14,NA,43,42,10,12,42,NA,16,7,NA,42)
Data = data.frame(Farms, Year, Cow, Duck, Chicken, Sheep, Horse)
Farm
Year
Cow
Duck
Chicken
Sheep
Horse
Farm 1
2020
22
12
100
30
25
Farm 1
2020
NA
12
120
20
20
Farm 1
2019
16
6
80
10
16
Farm 1
2019
12
NA
50
NA
11
Farm 1
2018
8
NA
NA
16
NA
Farm 1
2018
NA
NA
10
13
12
Farm 2
2020
31
28
27
10
14
Farm 2
2020
NA
13
31
20
NA
Farm 2
2019
3
31
NA
20
43
Farm 2
2019
20
50
43
17
42
Farm 2
2018
39
33
NA
48
10
Farm 2
2018
34
20
28
12
12
Farm 3
2020
27
NA
37
30
42
Farm 3
2020
50
9
NA
NA
NA
Farm 3
2019
NA
19
NA
20
16
Farm 3
2019
NA
2
NA
NA
7
Farm 3
2018
NA
NA
5
27
NA
Farm 3
2018
NA
7
43
49
42
If I were to use aggregate(.~Farms + Year, Data, mean) here, I would get Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate which I assume is because the mean function isn't able to account for NA.
Does anyone know how we can modify the aggregate/group_by function to account for the NA by calculating the average using only years without NA data? i.e.
2020: 10, 2019: NA, 2018:20, 2017:NA, 2016:15 -> the average (after discounting NA years 2019 and 2017) will be (10 + 20 + 15) / (3) = 15.
The ideal output will be as follow:
Farm
Year
Cow
Duck
Chicken
Sheep
Horse
Farm 1
2020
22 (avg = 22/1 as one entry is NA)
12
110
25
22.5
Farm 1
2019
14
6
65
10
13.5
Farm 1
2018
8
N.A. (as it's all NA)
10
14.5
12
Farm 2
2020
31
20.5
29
15
14
Farm 2
2019
11.5
40.5
43
18.5
42.5
Farm 2
2018
36.5
26.5
28
30
11
Farm 3
2020
...
...
...
...
...
Farm 3
2019
...
...
...
...
...
Farm 3
2018
...
...
...
...
...

Here is a way to create the wanted data.frame. I think your solution has one error in row 2 (Sheep), where mean(NA, 10) is equal to 5 and not 10.
library(dplyr)
Using aggregate
Data %>%
aggregate(.~Year+Farms,., FUN=mean, na.rm=T, na.action=NULL) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))
Using summarize
Data %>%
group_by(Year, Farms) %>%
summarize(MeanCow = mean(Cow, na.rm=T),
MeanDuck = mean(Duck, na.rm=T),
MeanChicken = mean(Chicken, na.rm=T),
MeanSheep = mean(Sheep, na.rm=T),
MeanHorse = mean(Horse, na.rm=T)) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))
Solution for both
Year Farms Cow Duck Chicken Sheep Horse
1 2020 Farm 1 22.0 12.0 110 25.0 22.5
2 2019 Farm 1 14.0 6.0 65 10.0 13.5
3 2018 Farm 1 8.0 NA 10 14.5 12.0
4 2020 Farm 2 31.0 20.5 29 15.0 14.0
5 2019 Farm 2 11.5 40.5 43 18.5 42.5
6 2018 Farm 2 36.5 26.5 28 30.0 11.0
7 2020 Farm 3 38.5 9.0 37 30.0 42.0
8 2019 Farm 3 NA 10.5 NA 20.0 11.5
9 2018 Farm 3 NA 7.0 24 38.0 42.0

Purrr Multiply index data frame with dataframe

Thank you all for reading this problem.
What i would like to do is multiply my testdata with my index file while matching columns.
So multiplying Dp_water with Dp_water and iterating over all index vars kcal, fat, prot, carbs.
In my test data i have for 10 individuals data on consumption of 4 food groups in grams.
for each individual i would like to calculate the kcal fat prot carb intake.
For each individual i would like to make a new variable
Dp_water_kcal, Dp_coffee_kcal, Dp_soup_kcal , Dp_soda_kcal
Dp_water_fat, Dp_coffee_fat, Dp_soup_fat , Dp_soda_fat
ect...
library(tidyverse)
Sample data
Index file
index <- data.frame(Variable=c("Dp_water","Dp_coffee","Dp_soup","Dp_soda"),
kcal=c(0,10,20,40),
fat=c(0,5,10,15),
prot=c(2,4,6,8),
carbs=c(3,6,9,12))
index <- index %>%
pivot_longer(c(kcal,fat,prot,carbs)) %>%
pivot_wider(names_from = Variable, values_from = value)
> index
# A tibble: 4 x 5
name Dp_water Dp_coffee Dp_soup Dp_soda
<chr> <dbl> <dbl> <dbl> <dbl>
1 kcal 0 10 20 40
2 fat 0 5 10 15
3 prot 2 4 6 8
4 carbs 3 6 9 12
Below subject data consumption of 4 foodgroups.
test_data <- data.frame(Dp_water=c(11:20),
Dp_coffee=c(31:40),
Dp_soup=c(21:30),
Dp_soda=c(41:50),
id=1:10)
Dp_water Dp_coffee Dp_soup Dp_soda id
1 11 31 21 41 1
2 12 32 22 42 2
3 13 33 23 43 3
4 14 34 24 44 4
5 15 35 25 45 5
6 16 36 26 46 6
7 17 37 27 47 7
8 18 38 28 48 8
9 19 39 29 49 9
10 20 40 30 50 10
If i do the following it works. But i would like to do this for all variables and not only kcal. And i would like to be able to keep the id column.
test_data %>%
select(-id) %>%
map2_dfr(., test_data[match(names(.), names(test_data))], ~.x/100 * .y) %>%
set_names(paste0(names(.), "_kcal"))
# A tibble: 10 x 4
Dp_water_kcal Dp_coffee_kcal Dp_soup_kcal Dp_soda_kcal
<dbl> <dbl> <dbl> <dbl>
1 1.21 9.61 4.41 16.8
2 1.44 10.2 4.84 17.6
3 1.69 10.9 5.29 18.5
4 1.96 11.6 5.76 19.4
5 2.25 12.2 6.25 20.2
6 2.56 13.0 6.76 21.2
7 2.89 13.7 7.29 22.1
8 3.24 14.4 7.84 23.0
9 3.61 15.2 8.41 24.0
10 4 16 9 25
Thank you all for any help!

Filter a dataframe by keeping row dates of three days in a row preferably with dplyr

I would like to filter a dataframe based on its date column. I would like to keep the rows where I have at least 3 consecutive days. I would like to do this as effeciently and quickly as possible, so if someone has a vectorized approached it would be good.
I tried to inspire myself from the following link, but it didn't really go well, as it is a different problem:
How to filter rows based on difference in dates between rows in R?
I tried to do it with a for loop, I managed to put an indicator on the dates who are not consecutive, but it didn't give me the desired result, because it keeps all dates that are in a row even if they are less than 3 in a row.
tf is my dataframe
for(i in 2:(nrow(tf)-1)){
if(tf$Date[i] != tf$Date[i+1] %m+% days(-1)){
if(tf$Date[i] != tf$Date[i-1] %m+% days(1)){
tf$Date[i] = as.Date(0)
}
}
}
The first 22 rows of my dataframe look something like this:
Date RR.x RR.y Y
1 1984-10-20 1 10.8 1984
2 1984-11-04 1 12.5 1984
3 1984-11-05 1 7.0 1984
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
7 1984-11-13 1 5.9 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
11 1986-11-17 1 14.1 1986
12 2003-10-17 1 7.8 2003
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
16 2003-11-15 1 26.4 2003
17 2003-11-20 1 10.0 2003
18 2011-10-29 1 10.0 2011
19 2011-11-04 1 11.4 2011
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
The result should be:
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011

One possibility could be:
df %>%
mutate(Date = as.Date(Date, format = "%Y-%m-%d"),
diff = c(0, diff(Date))) %>%
group_by(grp = cumsum(diff > 1 & lead(diff, default = last(diff)) == 1)) %>%
filter(if_else(diff > 1 & lead(diff, default = last(diff)) == 1, 1, diff) == 1) %>%
filter(n() >= 3) %>%
ungroup() %>%
select(-diff, -grp)
Date RR.x RR.y Y
<date> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984
2 1984-11-10 1 24.4 1984
3 1984-11-11 1 19 1984
4 1986-10-15 1 10.3 1986
5 1986-10-16 1 18.1 1986
6 1986-10-17 1 11.3 1986
7 2003-10-25 1 7.6 2003
8 2003-10-26 1 5 2003
9 2003-10-27 1 6.6 2003
10 2011-11-21 1 9.8 2011
11 2011-11-22 1 5.6 2011
12 2011-11-23 1 20.4 2011

Here's a base solution:
DF$Date <- as.Date(DF$Date)
rles <- rle(cumsum(c(1,diff(DF$Date)!=1)))
rles$values <- rles$lengths >= 3
DF[inverse.rle(rles), ]
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
Similar approach in dplyr
DF%>%
mutate(Date = as.Date(Date))%>%
add_count(IDs = cumsum(c(1, diff(Date) !=1)))%>%
filter(n >= 3)
# A tibble: 12 x 6
Date RR.x RR.y Y IDs n
<date> <int> <dbl> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984 3 3
2 1984-11-10 1 24.4 1984 3 3
3 1984-11-11 1 19 1984 3 3
4 1986-10-15 1 10.3 1986 5 3
5 1986-10-16 1 18.1 1986 5 3
6 1986-10-17 1 11.3 1986 5 3
7 2003-10-25 1 7.6 2003 8 3
8 2003-10-26 1 5 2003 8 3
9 2003-10-27 1 6.6 2003 8 3
10 2011-11-21 1 9.8 2011 13 3
11 2011-11-22 1 5.6 2011 13 3
12 2011-11-23 1 20.4 2011 13 3