Group by Consecutive Dates R - r

I want to group animals based on consecutive months they were found within the same burrow, but also split up those groups if the months were not consecutive.
#Input Data
burrow.data <- read.csv
Animal Burrow Date
1 027 B0961 2022-03-01
2 027 B0961 2022-04-26
3 033 1920 2021-11-02
4 033 1955 2022-03-29
5 033 1955 2022-04-26
6 063 B0540 2021-04-21
7 063 B0540 2022-01-04
8 063 B0540 2022-03-01
9 101 B0021 2020-11-23
10 101 B0021 2020-12-23
11 101 B0021 2021-11-04
12 101 B0021 2022-01-06
13 101 B0021 2022-02-04
14 101 B0021 2022-03-03
#Expected Output
Animal Burrow grp Date.Start Date.End
1 033 1920 1 2021-11-02 2021-11-02
2 033 1955 1 2022-03-29 2022-04-26
3 101 B0021 1 2020-11-23 2020-12-23
4 101 B0021 2 2022-01-06 2020-03-03
5 063 B0540 1 2021-04-21 2022-03-01
6 027 B0961 1 2022-03-01 2022-04-26
I used code from another post: Group consecutive dates in R
And wrote:
burrow.input <- burrow.data[order(burrow.data$Date),]
burrow.input$grp <- ave(as.integer(burrow.input$Date), burrow.input[-4], FUN = function(z) cumsum(c(TRUE, diff(z)>1)))
burrow.input
out <- aggregate(Date ~ Animal + Burrow + grp, data = burrow.input, FUN = function(z) setNames(range(z), c("Start", "End")))
out <- do.call(data.frame,out)
out[,4:5] <- lapply(out[,4:5], as.Date, origin = "1970-01-01")
out
The code keeps grouping 101 into a single group instead of two groups broken up by a date gap (See below).
How can I fix this?
Animal Burrow grp Date.Start Date.End
1 033 1920 1 2021-11-02 2021-11-02
2 033 1955 1 2022-03-29 2022-04-26
3 101 B0021 1 2020-11-23 2022-03-03
4 063 B0540 1 2021-04-21 2022-03-01
5 027 B0961 1 2022-03-01 2022-04-26

Group the data by Animal, Burrow and a grouping variable that changes each time the date jumps by more than 1 month. Here as.yearmon converts the date to a yearmon object which internally is a year plus 0 for Jan, 1/12 for Feb, ..., 11/12 for Dec so multiply that by 12 and check whether the difference between it and the prior value is greater than 1. Take the cumulative sum of that to generate a grouping variable. Finally summarize that, sort and remove the grouping variable that was added.
library(dplyr)
library(zoo)
burrow.data %>%
group_by(Animal, Burrow,
diff = cumsum( c(1, diff(12 * as.yearmon(Date)) > 1) ) ) %>%
summarize(Date.start = first(Date), Date.end = last(Date), .groups = "drop") %>%
arrange(Burrow) %>%
select(-diff)
giving:
# A tibble: 7 × 4
Animal Burrow Date.start Date.end
<int> <chr> <chr> <chr>
1 33 1920 2021-11-02 2021-11-02
2 33 1955 2022-03-29 2022-04-26
3 101 B0021 2020-11-23 2021-11-04
4 101 B0021 2022-01-06 2022-03-03
5 63 B0540 2021-04-21 2022-01-04
6 63 B0540 2022-03-01 2022-03-01
7 27 B0961 2022-03-01 2022-04-26
Note
The input data in reproducible form is:
burrow.data <-
structure(list(Animal = c(27L, 27L, 33L, 33L, 33L, 63L, 63L,
63L, 101L, 101L, 101L, 101L, 101L, 101L), Burrow = c("B0961",
"B0961", "1920", "1955", "1955", "B0540", "B0540", "B0540", "B0021",
"B0021", "B0021", "B0021", "B0021", "B0021"), Date = c("2022-03-01",
"2022-04-26", "2021-11-02", "2022-03-29", "2022-04-26", "2021-04-21",
"2022-01-04", "2022-03-01", "2020-11-23", "2020-12-23", "2021-11-04",
"2022-01-06", "2022-02-04", "2022-03-03")), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12", "13", "14"))

Related

How to create and populate dummy rows in tidyverse?

I am working with some monthly data and I would like to convert it to daily data by creating and populating some dummy rows, as the question suggests.
For example, say I have the following data:
date index
2013-04-30 232
2013-05-31 232
2013-06-30 233
Is there an "easy" way, preferably through tidyverse, that I could convert the above data into daily data, assuming I keep the index constant throughout the month? For example, I would like to create another 29 rows for April, ranging from 2013-04-01 to 2013-04-29 with the index of the last day of the month which would be 232 for April. The same should be applied to the rest of months (I have more data than just those three months).
Any intuitive suggestions will be greatly appreciated :)
Using complete and fill from tidyr you could do:
dat <- structure(list(
date = structure(c(15825, 15856, 15886), class = "Date"),
index = c(232L, 232L, 233L)
), class = "data.frame", row.names = c(
NA,
-3L
))
library(tidyr)
dat |>
complete(date = seq(as.Date("2013-04-01"), as.Date("2013-06-30"), "day")) |>
fill(index, .direction = "up")
#> # A tibble: 91 × 2
#> date index
#> <date> <int>
#> 1 2013-04-01 232
#> 2 2013-04-02 232
#> 3 2013-04-03 232
#> 4 2013-04-04 232
#> 5 2013-04-05 232
#> 6 2013-04-06 232
#> 7 2013-04-07 232
#> 8 2013-04-08 232
#> 9 2013-04-09 232
#> 10 2013-04-10 232
#> # … with 81 more rows

r flag rows in 1st dataframe if the date is between two dates in second dataframe

I have two datasets. First dataset has two date columns (Start, Stop) for each ID. Sometimes it can contain multiple Start-Stop Dates for each ID.
Dataset1
Id Code Start Stop
431 279 2017-11-15 2019-08-15
431 578 2019-09-15 2021-01-15
832 590 2008-04-15 2020-05-15
832 519 2020-06-15 2021-04-15
The second dataset has Id and many time stamped rows per ID, like this below
Id Weight Date
431 12.23 2018
832 15.12 2020
832 6.78 2020
832 4.27 2007
My goal is to create a column InBetween that indicates "Yes" if the row in 2nd dataset is between the two dates in the first dataset or "No" if the rows in the 2nd dataset is not between the two dates in first dataset, with other columns from 1st dataset like this.
Id Weight Date Between Code Start Stop
431 12.23 2018 Yes 279 2017-11-15 2019-08-15
832 15.12 2020 Yes 590 2008-04-15 2020-05-15
832 6.78 2020 Yes 590 2008-04-15 2020-05-15
832 4.27 2007 No NA NA NA
I could do this with for loops, but I prefer any solution using dplyr, innerjoin or other options without forloops. Thanks in advance.
It's pretty messy but you may try,
df2 %>%
full_join(df1, by = "Id") %>%
mutate(Date = as.Date(ISOdate(Date,1,1)),
Start = as.Date(Start),
Stop = as.Date(Stop)) %>%
rowwise %>%
mutate(Between = between(Date, Start, Stop)) %>%
group_by(Id, Date) %>%
mutate(check = any(Between)) %>%
filter(!(Between == FALSE& check == TRUE)) %>%
mutate(Start = ifelse(check, Start, NA),
Stop = ifelse(check, Stop, NA),
Code = ifelse(check, Code, NA)) %>%
distinct() %>% select(-check)
Id Weight Date Code Start Stop Between
<int> <dbl> <date> <int> <dbl> <dbl> <lgl>
1 431 12.2 2018-01-01 279 17485 18123 TRUE
2 832 15.1 2020-01-01 590 13984 18397 TRUE
3 832 6.78 2020-01-01 590 13984 18397 TRUE
4 832 4.27 2007-01-01 NA NA NA FALSE
Here's a somewhat shorter version that also achieves what you are after.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
df1 <- tibble::tribble(
~Id, ~Code, ~Start, ~Stop,
431L, 279L, "2017-11-15", "2019-08-15",
431L, 578L, "2019-09-15", "2021-01-15",
832L, 590L, "2008-04-15", "2020-05-15",
832L, 519L, "2020-06-15", "2021-04-15"
)
df2 <- tibble::tribble(
~Id, ~Weight, ~Date,
431L, 12.23, 2018L,
832L, 15.12, 2020L,
832L, 6.78, 2020L,
832L, 4.27, 2007L
)
df1 <- df1 %>%
mutate(Start = ymd(Start),
Stop = ymd(Stop))
df2 <- df2 %>%
mutate(Date = ymd(Date, truncated = 2L))
full_join(df1, df2) %>%
mutate(Between = case_when( (Date %within% interval(ymd(Start), ymd(Stop))) == TRUE ~ TRUE,
TRUE ~ FALSE))
#> Joining, by = "Id"
#> # A tibble: 8 × 7
#> Id Code Start Stop Weight Date Between
#> <int> <int> <date> <date> <dbl> <date> <lgl>
#> 1 431 279 2017-11-15 2019-08-15 12.2 2018-01-01 TRUE
#> 2 431 578 2019-09-15 2021-01-15 12.2 2018-01-01 FALSE
#> 3 832 590 2008-04-15 2020-05-15 15.1 2020-01-01 TRUE
#> 4 832 590 2008-04-15 2020-05-15 6.78 2020-01-01 TRUE
#> 5 832 590 2008-04-15 2020-05-15 4.27 2007-01-01 FALSE
#> 6 832 519 2020-06-15 2021-04-15 15.1 2020-01-01 FALSE
#> 7 832 519 2020-06-15 2021-04-15 6.78 2020-01-01 FALSE
#> 8 832 519 2020-06-15 2021-04-15 4.27 2007-01-01 FALSE
Created on 2021-10-11 by the reprex package (v2.0.1)

'Interpolation' of a missing date/value in R?

I have a dataframe like so:
Month CumulativeSum
2019-02-01 40
2019-03-01 70
2019-04-01 80
2019-07-01 100
2019-08-01 120
Problem is that nothing happen in May and June, hence there is no data. Plotting this in barcharts results in some empty space on the x-axis.
Is there some way to "fill" the missing spot like so, using the last known value?:
Month CumulativeSum
2019-02-01 40
2019-03-01 70
2019-04-01 80
**2019-05-01 80** <--
**2019-06-01 80** <--
2019-07-01 100
2019-08-01 120
We can use complete
library(dplyr)
library(tidyr)
df1 %>%
complete(Month = seq(min(Month), max(Month), by = '1 month')) %>%
fill(CumulativeSum)
-output
# A tibble: 7 x 2
# Month CumulativeSum
# <date> <int>
#1 2019-02-01 40
#2 2019-03-01 70
#3 2019-04-01 80
#4 2019-05-01 80
#5 2019-06-01 80
#6 2019-07-01 100
#7 2019-08-01 120
data
df1 <- structure(list(Month = structure(c(17928, 17956, 17987, 18078,
18109), class = "Date"), CumulativeSum = c(40L, 70L, 80L, 100L,
120L)), row.names = c(NA, -5L), class = "data.frame")
Here is a base R option using cummax
transform(
data.frame(
Month = seq(min(df$Month), max(df$Month), by = "1 month"),
CumulativeSum = -Inf
),
CumulativeSum = cummax(replace(CumulativeSum, Month %in% df$Month, df$CumulativeSum))
)
which gives
Month CumulativeSum
1 2019-02-01 40
2 2019-03-01 70
3 2019-04-01 80
4 2019-05-01 80
5 2019-06-01 80
6 2019-07-01 100
7 2019-08-01 120

R code (Rstats) calculating unemployment rate based off columns in long form data

I am trying to calculate the unemployment rate based of the data below and add it as new rows to the data table. I want to divide unemployed by labourforce based off the date and add each datapoint as a row.
Essentially, I am trying to go from this
date
series_1
value
2021-01-01
labourforce
13793
2021-02-01
labourforce
13812
2021-03-01
labourforce
13856
2021-01-01
unemployed
875
2021-02-01
unemployed
805
2021-03-01
unemployed
778
to this
date
series_1
value
2021-01-01
labourforce
13793
2021-02-01
labourforce
13812
2021-03-01
labourforce
13856
2021-01-01
unemployed
875
2021-02-01
unemployed
805
2021-03-01
unemployed
778
2021-01-01
unemploymentrate
6.3
2021-02-01
unemploymentrate
5.8
2021-03-01
unemploymentrate
5.6
Here is my code so far. I know the last line is wrong? Any suggestions or ideas are welcome!
longdata %>%
group_by(date) %>%
summarise(series_1 = 'unemploymentrate',
value = series_1$unemployed/series_1$labourforce))
Fro each day, you can get the ratio of 'unemployed' by 'labourforce' and add it as new rows to your original dataset.
library(dplyr)
df %>%
group_by(date) %>%
summarise(value = value[series_1 == 'unemployed']/value[series_1 == 'labourforce'] * 100,
series_1 = 'unemploymentrate') %>%
bind_rows(df) %>%
arrange(series_1)
# date value series_1
# <chr> <dbl> <chr>
#1 2021-01-01 13793 labourforce
#2 2021-02-01 13812 labourforce
#3 2021-03-01 13856 labourforce
#4 2021-01-01 875 unemployed
#5 2021-02-01 805 unemployed
#6 2021-03-01 778 unemployed
#7 2021-01-01 6.34 unemploymentrate
#8 2021-02-01 5.83 unemploymentrate
#9 2021-03-01 5.61 unemploymentrate
Try:
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = series_1, values_from = value) %>%
mutate(unempolymentrate = round(unemployed*100/labourforce, 2)) %>%
pivot_longer(-1, names_to = "series_1", values_to = "value") %>%
mutate(series_1 = factor(series_1, levels = c("labourforce", "unemployed", "unempolymentrate"))) %>%
arrange(series_1, date)
#> # A tibble: 9 x 3
#> date series_1 value
#> <chr> <fct> <dbl>
#> 1 2021-01-01 labourforce 13793
#> 2 2021-02-01 labourforce 13812
#> 3 2021-03-01 labourforce 13856
#> 4 2021-01-01 unemployed 875
#> 5 2021-02-01 unemployed 805
#> 6 2021-03-01 unemployed 778
#> 7 2021-01-01 unempolymentrate 6.34
#> 8 2021-02-01 unempolymentrate 5.83
#> 9 2021-03-01 unempolymentrate 5.61
Created on 2021-04-23 by the reprex package (v2.0.0)
data
df <- structure(list(date = c("2021-01-01", "2021-02-01", "2021-03-01",
"2021-01-01", "2021-02-01", "2021-03-01"), series_1 = c("labourforce",
"labourforce", "labourforce", "unemployed", "unemployed", "unemployed"
), value = c(13793L, 13812L, 13856L, 875L, 805L, 778L)), class = "data.frame", row.names = c(NA,
-6L))

R Aggregate days and count specific observations for each day

Day Time Numbers
6388 2017-02-01 10:43 R33
7129 2017-02-04 15:32 N39.0, N39.0, N39.0
9689 2017-02-17 08:54 S72.11, S72.11, S72.11, S72.11
6703 2017-02-02 18:55 R11
9026 2017-02-13 17:34 S06.0, S06.0, S06.0
5013 2017-01-25 00:33 J18.1, J18.1, J18.1, J18.1
5849 2017-01-29 17:57 I21.4, I21.4, I21.4
9245 2017-02-14 19:03 J18.0, J18.0, J18.0
1978 2017-01-09 21:23 K59.0
5021 2017-01-25 02:46 I47.1, I47.1, I47.1
9258 2017-02-14 20:19 S42.3
541 2017-01-03 11:44 I63.8, I63.8, I63.8
4207 2017-01-20 19:52 E83.58, E83.58, E83.58
8650 2017-02-11 18:39 R55, R55, S06.0, S06.0, R55
9442 2017-02-15 21:30 K86.1
4186 2017-01-20 18:27 S05.1
4231 2017-01-20 22:10 M17.9
6847 2017-02-03 11:45 L02.4
1739 2017-01-08 21:19 S20.2
3685 2017-01-18 09:56 G40.9
9497 2017-02-16 09:52 S83.6
2563 2017-01-12 20:47 M13.16, M25.56, M25.56
9731 2017-02-17 13:10 B99, B99, N39.0, N39.0
7759 2017-02-07 14:25 R51, G43.0, G43.0
368 2017-01-02 15:05 T83.0, T83.0, T83.0, N13.3, N13.6
I want to aggregate this df in a special way. I want to count how many Numbers starting e.g. "A" on each day. I want a new dataframe that looks like this:
Day GroupA GroupB GroupC .....
1 2017-01-01 2 2 0
2 2017-01-02 ..................
GroupA means Numbers starting with A. If there are multiple numbers starting with A in one single row it count be counted as one. The class of my number-column is character.
> class(df[1,3])
[1] "character"
> df[1,3]
[1] "A41.8, A41.51, A41.51"**
My problem is how I can combine the aggregate-command with the counts. My real df is a lot bigger, it is more than 2 years long, so I would need an automatized solution.
EDIT: See data down below
structure(list(Day= c("2017-01-07", "2017-01-23", "2017-01-08",
"2017-01-13", "2017-02-10", "2017-01-07", "2017-01-24", "2017-01-02",
"2017-01-03", "2017-01-06", "2017-01-11", "2017-01-21", "2017-01-13",
"2017-01-10", "2017-02-18", "2017-01-10", "2017-01-31", "2017-01-27",
"2017-01-23", "2017-01-13", "2017-02-10", "2017-01-09", "2017-01-23",
"2017-01-09", "2017-01-08"), Time= c("02:02", "14:51", "02:12",
"17:49", "00:00", "21:30", "22:28", "17:27", "12:14", "22:52",
"14:19", "11:40", "19:33", "04:01", "15:59", "14:57", "08:34",
"13:21", "02:01", "14:29", "20:17", "14:30", "02:34", "04:56",
"14:34"), Number= c("H10.9", "K85.80, K85.20, K85.80, K85.20",
"R09.1", "I10.90", "I48.9, I48.0, I48.9, I48.0", "A09.0, A09.0, R42, R42",
"H16.1", "K92.1, K92.1, K92.1", "K40.90, J12.2, J18.0, J96.01, J12.2",
"B99, J15.8, J18.0, J15.8", "S01.55", "M21.33", "I10.01, I10.01, J44.81, J44.81",
"S00.95", "B08.2", "S05.1", "M20.1", "G40.2, S93.40, S93.40",
"M25.51", "J44.19, J44.11, J44.19, J44.11", "G40.9, G40.2, G40.2",
"E87.1, E87.1, J18.0, J18.0", "I10.91", "R22.0", "S06.5, S06.5, S06.5, R06.88, S12.22"
)), .Names = c("Day", "Time", "Number"), row.names = c(1336L,
4687L, 1536L, 2737L, 8272L, 1507L, 4994L, 400L, 550L, 1305L,
2325L, 4292L, 2748L, 2008L, 9974L, 2113L, 6144L, 5433L, 4577L,
2697L, 8468L, 1883L, 4578L, 1783L, 1657L), class = "data.frame")
This is a pretty interesting problem that takes a little digging into. The first thing to do is get all the unique capital letters in each set in Number per row. stringr::str_extract_all gets you a list-column of string vectors that match this regex, and after taking unique values from each list entry, you have this:
library(dplyr)
library(tidyr)
as_tibble(df1) %>%
mutate(Day = lubridate::ymd(Day),
letters = purrr::map(stringr::str_extract_all(Number, "[A-Z]"), unique)) %>%
select(-Number) %>%
head()
#> # A tibble: 6 x 3
#> Day Time letters
#> <date> <chr> <list>
#> 1 2017-01-07 02:02 <chr [1]>
#> 2 2017-01-23 14:51 <chr [1]>
#> 3 2017-01-08 02:12 <chr [1]>
#> 4 2017-01-13 17:49 <chr [1]>
#> 5 2017-02-10 00:00 <chr [1]>
#> 6 2017-01-07 21:30 <chr [2]>
Unnest it so you have one row per date & time per letter, then count the number of observations of each letter per day—gets confusing, and the order matters here. Then reshape it into a wide format so each group gets a column.
as_tibble(df1) %>%
mutate(Day = lubridate::ymd(Day),
letters = purrr::map(stringr::str_extract_all(Number, "[A-Z]"), unique)) %>%
select(-Number) %>%
unnest(letters) %>%
count(Day, letters) %>%
arrange(letters) %>%
pivot_wider(names_from = letters, names_prefix = "group",
values_from = n, values_fill = list(n = 0)) %>%
head()
#> # A tibble: 6 x 12
#> Day groupA groupB groupE groupG groupH groupI groupJ groupK groupM
#> <date> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 2017-01-07 1 0 0 0 1 0 0 0 0
#> 2 2017-01-06 0 1 0 0 0 0 1 0 0
#> 3 2017-02-18 0 1 0 0 0 0 0 0 0
#> 4 2017-01-09 0 0 1 0 0 0 1 0 0
#> 5 2017-01-27 0 0 0 1 0 0 0 0 0
#> 6 2017-02-10 0 0 0 1 0 1 0 0 0
#> # … with 2 more variables: groupR <int>, groupS <int>
In this first few rows with the sample of data, there aren't any 2s, but there are some later on in the data frame. (I don't yet understand how pivot_wider orders things, but you can arrange by day after this if you want.)

Resources