Merge between two files in terms of date and postcode with calculating the day-weighted average for the air pollution data in R - r

I have two data frames. One called Pollution. The second called Med. An example of both data frames as follows:
Pollution
year month postcode PM10 NO2
2001 1 12345 40 20
2001 2 12345 30 25
2001 3 12345 35 25
2001 4 12345 30 20
2001 1 62346 40 20
2001 2 62346 30 25
2001 3 62346 35 25
2001 4 62346 30 20
2002 1 12345 44 24
2002 2 12345 36 26
2002 3 12345 30 20
2002 4 12345 32 22
2002 1 62346 48 28
2002 2 62346 20 35
2002 3 62346 89 101
2002 4 62346 37 27
Med
ID postcode trementDate_start treatmentDate_end
1 12345 2001-01-15 2001-03-16
2 62346 2001-01-15 2001-02-16
3 12345 2002-02-21 2002-03-16
4 12345 2002-02-15 2002-04-16
4 62346 2002-03-16 2002-04-30
The idea is to link the (pollution) data with the (Med) data by the date and postcode.
To do that, I need to calculate an average pollution level based upon the number of days of exposure at a particular level (PM10, NO2).
First creating a column called num_of_day to calculate the day length in each month for the periods between the start and end date of treatment in the medical data frame. The subtracting idea between the end date to start date has been found not precise.
For Example ( I will take ID number 1 with a postcode of 12345 just for an explanation of how I calculated the pollution average for PM10 and NO2 by putting into consideration the days)
Med
ID postcode trementDate_start treatmentDate_end
1 12345 2001-01-15 2001-03-16
Pollution
year month postcode PM10 NO2
2001 1 12345 40 20
2001 2 12345 30 25
2001 3 12345 35 25
The air pollution of PM10 and NO2 values for the periods between 2001-01-15 and 2001-03-16 will be as follow:
The trementDate_start (2001-01-15) its PM10 = 40 and NO2 = 20.
the periods in between (2001-02-00) its PM10 = 30 and NO2 = 25.
The trementDate_end (2001-03-16) its PM10 = 35 and NO2 = 25.
I have to then calculate the day of exposure for those periods each:
the trementDate_start (2001-01-15) [January have total of 31 days]
15/31 = 0.48 days of exposure
the periods in between (2001-02-00) [February have a total of 29 days]
this should remain the same PM10 and NO2 values because the pollution measurements in the file are on monthly basis. So, it will be:
29/29 = 1 days of exposure
the trementDate_end (2001-03-16) [March have total of 31 days]
16/31 = 0.51 days of exposure
Then I can calculate afterwards the pollution average based on the exposure days:
the trementDate_start (2001-01-15) exposure days 0.48 * 40 = 19.2(for the PM10) and 0.48 * 20 = 9.6 (for the NO2)
the periods in between (2001-02-00): 1 * 30 = 30 for PM10 and 1* 25 = 25 for NO2
the trementDate_end (2001-03-16): 0.51 * 35 = 17.85 for PM10 and 0.51 * 25 = 12.75 for NO2
Then add the PM10 together (19.2 + 30 + 17.85 = 67.05).
Then I will divide 67.05 by 3 months ( 3 month is the period were the person get exposed to the air pollution during his first treatment), which is equal to 22.35
The output should be like below:
ID postcode trementDate_start treatmentDate_end. PM10. NO2
1 12345 2001-01-15 2001-03-16 22.35 15.78
zoowalk, created the code below based on my previous requirement before I updated the information with the precise day thing. It worked perfectly.
I saw this post. stackoverflow.com/questions/15569333/…. I think this can short the idea of calculating the precise days that I explained above, which takes into account the fact that not all months and years have the same number of days, e.g., the leap year. Still cant figure out how to put them in a code with the other points looking up for the postcode and year and month.
I would appreciate extra help with this. I see it as too complex for me.

Does this help? I am not sure I fully understand for what kind of average you are looking for. Why is 70 / 32 days = 40.93?
library(tidyverse)
pollution <- data.frame(
year = c(2001L,2001L,2001L,2001L,2001L,
2001L,2001L,2001L,2002L,2002L,2002L,2002L,2002L,2002L,
2002L,2002L),
month = c(1L,2L,3L,4L,1L,2L,3L,4L,1L,2L,
3L,4L,1L,2L,3L,4L),
postcode = c(12345L,12345L,12345L,12345L,62346L,
62346L,62346L,62346L,12345L,12345L,12345L,12345L,
62346L,62346L,62346L,62346L),
PM10 = c(40L,30L,35L,30L,40L,30L,35L,30L,
44L,36L,30L,32L,48L,20L,89L,37L),
NO2 = c(20L,25L,25L,20L,20L,25L,25L,20L,
24L,26L,20L,22L,28L,35L,101L,27L)
) %>%
mutate(date_floor=paste(year,month, 01, sep="-") %>%
lubridate::ymd())
pollution
#> year month postcode PM10 NO2 date_floor
#> 1 2001 1 12345 40 20 2001-01-01
#> 2 2001 2 12345 30 25 2001-02-01
#> 3 2001 3 12345 35 25 2001-03-01
#> 4 2001 4 12345 30 20 2001-04-01
#> 5 2001 1 62346 40 20 2001-01-01
#> 6 2001 2 62346 30 25 2001-02-01
#> 7 2001 3 62346 35 25 2001-03-01
#> 8 2001 4 62346 30 20 2001-04-01
#> 9 2002 1 12345 44 24 2002-01-01
#> 10 2002 2 12345 36 26 2002-02-01
#> 11 2002 3 12345 30 20 2002-03-01
#> 12 2002 4 12345 32 22 2002-04-01
#> 13 2002 1 62346 48 28 2002-01-01
#> 14 2002 2 62346 20 35 2002-02-01
#> 15 2002 3 62346 89 101 2002-03-01
#> 16 2002 4 62346 37 27 2002-04-01
med <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 2L, 3L, 4L, 4L),
postcode = c(12345L, 62346L, 12345L, 12345L, 62346L),
treatmentDate_start = c("15/01/2001",
"15/01/2001","21/02/2002","15/03/2002","16/04/2002"),
treatmentDate_end = c("16/02/2001",
"16/02/2001","16/03/2002","16/04/2002","30/04/2002")
)
med <- med %>%
mutate(across(.cols=contains("Date"), lubridate::dmy)) %>% #convert to class date
pivot_longer(cols=contains("treatment"),
names_to = "date_type",
values_to = "date") %>%
mutate(date_floor=lubridate::floor_date(date,
unit="month"))
df_join <- med %>%
left_join(., pollution) %>%
select(-date_floor)
#> Joining, by = c("postcode", "date_floor")
df_join
#> # A tibble: 10 x 8
#> ID postcode date_type date year month PM10 NO2
#> <int> <int> <chr> <date> <int> <int> <int> <int>
#> 1 1 12345 treatmentDate_start 2001-01-15 2001 1 40 20
#> 2 1 12345 treatmentDate_end 2001-02-16 2001 2 30 25
#> 3 2 62346 treatmentDate_start 2001-01-15 2001 1 40 20
#> 4 2 62346 treatmentDate_end 2001-02-16 2001 2 30 25
#> 5 3 12345 treatmentDate_start 2002-02-21 2002 2 36 26
#> 6 3 12345 treatmentDate_end 2002-03-16 2002 3 30 20
#> 7 4 12345 treatmentDate_start 2002-03-15 2002 3 30 20
#> 8 4 12345 treatmentDate_end 2002-04-16 2002 4 32 22
#> 9 4 62346 treatmentDate_start 2002-04-16 2002 4 37 27
#> 10 4 62346 treatmentDate_end 2002-04-30 2002 4 37 27
df_join <- df_join %>%
pivot_wider(id_cols=c(ID, postcode, year, month, PM10, NO2),
names_from = date_type,
values_from = date) %>%
mutate(treatmentDate_start = case_when(is.na(treatmentDate_start) ~ lubridate::floor_date(treatmentDate_end, unit="month"),
TRUE ~ as.Date(treatmentDate_start ))) %>%
mutate(treatmentDate_end = case_when(is.na(treatmentDate_end) ~ lubridate::ceiling_date(treatmentDate_start, unit="month"),
TRUE ~ as.Date(treatmentDate_end ))) %>%
mutate(duration=treatmentDate_end-treatmentDate_start)
#this is basically all the info you need.
glimpse(df_join)
#> Rows: 9
#> Columns: 9
#> $ ID <int> 1, 1, 2, 2, 3, 3, 4, 4, 4
#> $ postcode <int> 12345, 12345, 62346, 62346, 12345, 12345, 12345...
#> $ year <int> 2001, 2001, 2001, 2001, 2002, 2002, 2002, 2002,...
#> $ month <int> 1, 2, 1, 2, 2, 3, 3, 4, 4
#> $ PM10 <int> 40, 30, 40, 30, 36, 30, 30, 32, 37
#> $ NO2 <int> 20, 25, 20, 25, 26, 20, 20, 22, 27
#> $ treatmentDate_start <date> 2001-01-15, 2001-02-01, 2001-01-15, 2001-02-01...
#> $ treatmentDate_end <date> 2001-02-01, 2001-02-16, 2001-02-01, 2001-02-16...
#> $ duration <drtn> 17 days, 15 days, 17 days, 15 days, 8 days, 15...
df_join %>%
group_by(ID, postcode) %>%
summarise(across(.cols=c(PM10, NO2, duration), .fns=sum)) %>%
mutate(across(.cols=c(PM10, NO2), .fns=function(x) x/as.numeric(duration)))
#> `summarise()` regrouping output by 'ID' (override with `.groups` argument)
#> # A tibble: 5 x 5
#> # Groups: ID [4]
#> ID postcode PM10 NO2 duration
#> <int> <int> <dbl> <dbl> <drtn>
#> 1 1 12345 2.19 1.41 32 days
#> 2 2 62346 2.19 1.41 32 days
#> 3 3 12345 2.87 2 23 days
#> 4 4 12345 1.94 1.31 32 days
#> 5 4 62346 2.64 1.93 14 days
Created on 2020-12-01 by the reprex package (v0.3.0)

Updated #zoowalk codes to meet my new updated need, since no one till now have provided help. Still, I found it not generating the correct answer. However, my issue has been solved by using excel. I found R with this case is difficult.
library(tidyverse)
library(lubridate)
pollution <- data.frame(
year = c(2001L,2001L,2001L,2001L,2001L,
2001L,2001L,2001L,2002L,2002L,2002L,2002L,2002L,2002L,
2002L,2002L),
month = c(1L,2L,3L,4L,1L,2L,3L,4L,1L,2L,
3L,4L,1L,2L,3L,4L),
postcode = c(12345L,12345L,12345L,12345L,62346L,
62346L,62346L,62346L,12345L,12345L,12345L,12345L,
62346L,62346L,62346L,62346L),
PM10 = c(40L,30L,35L,30L,40L,30L,35L,30L,
44L,36L,30L,32L,48L,20L,89L,37L),
NO2 = c(20L,25L,25L,20L,20L,25L,25L,20L,
24L,26L,20L,22L,28L,35L,101L,27L)
) %>%
mutate(date_floor=paste(year,month, 01, sep="-") %>%
lubridate::ymd())
med <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 2L, 3L, 4L, 4L),
postcode = c(12345L, 62346L, 12345L, 12345L, 62346L),
treatmentDate_start = c("15/01/2001",
"15/01/2001","21/02/2002","15/02/2002","16/03/2002"),
treatmentDate_end = c("16/03/2001",
"16/02/2001","16/03/2002","16/04/2002","30/04/2002")
)
med <- med %>%
mutate(across(.cols=contains("Date"), lubridate::dmy)) %>% #convert to class date
pivot_longer(cols=contains("treatment"),
names_to = "date_type",
values_to = "date") %>%
mutate(date_floor=lubridate::floor_date(date,
unit="month"))
df_join <- med %>%
left_join(., pollution) %>%
select(-date_floor)
#> Joining, by = c("postcode", "date_floor")
df_join <- df_join %>%
pivot_wider(id_cols=c(ID, postcode, year, month, PM10, NO2),
names_from = date_type,
values_from = date) %>%
mutate(treatmentDate_start = case_when(is.na(treatmentDate_start) ~ lubridate::floor_date(treatmentDate_end, unit="month"),
TRUE ~ as.Date(treatmentDate_start ))) %>%
mutate(treatmentDate_end = case_when(is.na(treatmentDate_end) ~ lubridate::ceiling_date(treatmentDate_start, unit="month"),
TRUE ~ as.Date(treatmentDate_end ))) %>%
mutate(duration= lubridate::time_length(difftime(treatmentDate_end, treatmentDate_start), "years"))

Related

how to find the growth rate of applicants per year

I have this data set with 20 variables, and I want to find the growth rate of applicants per year. The data provided is from 2020-2022. How would I go about that? I tried subsetting the data but I'm stuck on how to approach it. So essentially, I want to put the respective applicants to its corresponding year and calculate the growth rate.
Observations ID# Date
1 1226 2022-10-16
2 1225 2021-10-15
3 1224 2020-08-14
4 1223 2021-12-02
5 1222 2022-02-25
One option is to use lubridate::year to split your year-month-day variable into years and then dplyr::summarize().
library(tidyverse)
library(lubridate)
set.seed(123)
id <- seq(1:100)
date <- as.Date(sample( as.numeric(as.Date('2017-01-01') ): as.numeric(as.Date('2023-01-01') ), 100,
replace = T),
origin = '1970-01-01')
df <- data.frame(id, date) %>%
mutate(year = year(date))
head(df)
#> id date year
#> 1 1 2018-06-10 2018
#> 2 2 2017-07-14 2017
#> 3 3 2022-01-16 2022
#> 4 4 2020-02-16 2020
#> 5 5 2020-06-06 2020
#> 6 6 2020-06-21 2020
df <- df %>%
group_by(year) %>%
summarize(n = n())
head(df)
#> # A tibble: 6 × 2
#> year n
#> <dbl> <int>
#> 1 2017 17
#> 2 2018 14
#> 3 2019 17
#> 4 2020 18
#> 5 2021 11
#> 6 2022 23

Reformat Dataframe to start a new row at a certain column

I have a dataframe that looks like this:
ID x.2019 x.2020
1 10 20
2 20 30
3 30 40
4 40 50
5 50 60
and I would like to reformat it to look like this:
ID time x
1 2019 10
1 2020 20
2 2019 20
2 2020 30
3 2019 40
3 2020 50
4 2019 60
4 2020 70
5 2019 70
5 2020 80
Any idea how to achieve this?
This is a rather simple task which you can probably find in other answers. Though, you can achieve what you want with data.table as follows:
library(data.table)
df = data.table( ID = 1:5,
x.2019 = seq(10, 50, by = 10),
x.2020 = seq(20, 60, by = 10)
)
# change column names conveniently
setnames(df, c("x.2019", "x.2020"), c("2019", "2020"))
# transform the dataset from wide to long format
out = melt(df, id.vars = "ID", variable.name = "time", value.name = "x", variable.factor = FALSE)
# cast time to integer
out[ , time := as.integer(time)]
# reorder by ID
setorder(out, ID)
out
#> ID time x
#> 1: 1 2019 10
#> 2: 1 2020 20
#> 3: 2 2019 20
#> 4: 2 2020 30
#> 5: 3 2019 30
#> 6: 3 2020 40
#> 7: 4 2019 40
#> 8: 4 2020 50
#> 9: 5 2019 50
#> 10: 5 2020 60
Created on 2022-01-20 by the reprex package (v2.0.1)
You can use pivot_longer:
library(dplyr)
library(tidyr)
df = data.frame(ID=1:5,
x.2019=c(10, 20, 30, 40, 50),
x.2020=c(20, 30, 40, 50, 60))
df %>%
pivot_longer(cols = c(2, 3), names_to = 'time', values_to = 'x') %>%
mutate(time = as.integer(stringr::str_replace(time, 'x.', '')))
Result:
# A tibble: 10 x 3
ID time x
<int> <int> <dbl>
1 1 2019 10
2 1 2020 20
3 2 2019 20
4 2 2020 30
5 3 2019 30
6 3 2020 40
7 4 2019 40
8 4 2020 50
9 5 2019 50
10 5 2020 60

Assigning Values in R by Date Range

I am trying to create a "week" variable in my dataset of daily observations that begins with a new value (1, 2, 3, et cetera) whenever a new Monday happens. My dataset has observations beginning on April 6th, 2020, and the data are stored in a "YYYY-MM-DD" as.date() format. In this example, an observation between April 6th and April 12th would be a "1", an observation between April 13th and April 19 would be a "2", et cetera.
I am aware of the week() package in lubridate, but unfortunately that doesn't work for my purposes because there are not exactly 54 weeks in the year, and therefore "week 54" would only be a few days long. In other words, I would like the days of December 28th, 2020 to January 3rd, 2021 to be categorized as the same week.
Does anyone have a good solution to this problem? I appreciate any insight folks might have.
This will also do
df <- data.frame(date = as.Date("2020-04-06")+ 0:365)
library(dplyr)
library(lubridate)
df %>% group_by(d= year(date), week = (isoweek(date))) %>%
mutate(week = cur_group_id()) %>% ungroup() %>% select(-d)
# A tibble: 366 x 2
date week
<date> <int>
1 2020-04-06 1
2 2020-04-07 1
3 2020-04-08 1
4 2020-04-09 1
5 2020-04-10 1
6 2020-04-11 1
7 2020-04-12 1
8 2020-04-13 2
9 2020-04-14 2
10 2020-04-15 2
# ... with 356 more rows
Subtract the dates with the minimum date, divide the difference by 7 and use floor to get 1 number for each 7 days.
x <- as.Date(c('2020-04-06','2020-04-07','2020-04-13','2020-12-28','2021-01-03'))
as.integer(floor((x - min(x))/7) + 1)
#[1] 1 1 2 39 39
Maybe lubridate::isoweek() and lubridate::isoyear() is what you want?
Some data:
df1 <- data.frame(date = seq.Date(as.Date("2020-04-06"),
as.Date("2021-01-04"),
by = "1 day"))
Example code:
library(dplyr)
library(lubridate)
df1 <- df1 %>%
mutate(week = isoweek(date),
year = isoyear(date)) %>%
group_by(year) %>%
mutate(week2 = 1 + (week - min(week))) %>%
ungroup()
head(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-04-06 15 2020 1
2 2020-04-07 15 2020 1
3 2020-04-08 15 2020 1
4 2020-04-09 15 2020 1
5 2020-04-10 15 2020 1
6 2020-04-11 15 2020 1
7 2020-04-12 15 2020 1
8 2020-04-13 16 2020 2
tail(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-12-28 53 2020 39
2 2020-12-29 53 2020 39
3 2020-12-30 53 2020 39
4 2020-12-31 53 2020 39
5 2021-01-01 53 2020 39
6 2021-01-02 53 2020 39
7 2021-01-03 53 2020 39
8 2021-01-04 1 2021 1

Is there a R function which can undo cumsum() and recreate the original non-cumulative column in a dataset?

For simplicity, I have created a small dummy dataset.
Please note: dates are in yyyy-mm-dd format
Here is dataset DF:
DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))
# A tibble: 12 x 3
country date visits
<chr> <chr> <dbl>
1 France 2020-01-01 10
2 France 2020-01-02 16
3 France 2020-01-03 14
4 France 2020-01-04 12
5 England 2020-01-01 11
6 England 2020-01-02 9
7 England 2020-01-03 12
8 England 2020-01-04 14
9 Spain 2020-01-01 13
10 Spain 2020-01-02 13
11 Spain 2020-01-03 15
12 Spain 2020-01-04 10
Here is dataset DFc:
DFc <- DF %>% group_by(country) %>% mutate(cumulative_visits = cumsum(visits))
# A tibble: 12 x 3
# Groups: country [3]
country date cumulative_visits
<chr> <chr> <dbl>
1 France 2020-01-01 10
2 France 2020-01-02 26
3 France 2020-01-03 40
4 France 2020-01-04 52
5 England 2020-01-01 11
6 England 2020-01-02 20
7 England 2020-01-03 32
8 England 2020-01-04 46
9 Spain 2020-01-01 13
10 Spain 2020-01-02 26
11 Spain 2020-01-03 41
12 Spain 2020-01-04 51
Let's say I only have dataset DFc. Which R functions can I use to recreate the visits column (as shown in dataset DF) and essentially "undo/reverse" cumsum()?
I have been told that I can incorporate the lag() function but I am not sure how to do this.
Also, how would the code change if the dates were spaced weeks apart, rather than one day?
Any help would be much appreciated :)
Starting from your toy example:
library(dplyr)
DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))
DF <- DF %>%
group_by(country) %>%
mutate(cumulative_visits = cumsum(visits)) %>%
ungroup()
I propose you two methods:
diff
lag [as you specifically required]
DF %>%
group_by(country) %>%
mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>%
ungroup()
#> # A tibble: 12 x 6
#> country date visits cumulative_visits decum_visits1 decum_visits2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 France 2020-01-01 10 10 10 10
#> 2 France 2020-02-01 16 26 16 16
#> 3 France 2020-03-01 14 40 14 14
#> 4 France 2020-04-01 12 52 12 12
#> 5 England 2020-01-01 11 11 11 11
#> 6 England 2020-02-01 9 20 9 9
#> 7 England 2020-03-01 12 32 12 12
#> 8 England 2020-04-01 14 46 14 14
#> 9 Spain 2020-01-01 13 13 13 13
#> 10 Spain 2020-02-01 13 26 13 13
#> 11 Spain 2020-03-01 15 41 15 15
#> 12 Spain 2020-04-01 10 51 10 10
If one date is missing, let's say, like in the following example:
DF1 <- DF %>%
# set to date!
mutate(date = as.Date(date)) %>%
# remove one date just for the sake of the example
filter(date != as.Date("2020-02-01"))
Then I advice you to complete the dates, while you fill visits with zero and cumulative_visits with the last seen value. Then you can get the opposite of cumsum in the same way as before.
DF1 %>%
group_by(country) %>%
# complete and fill with zero!
tidyr::complete(date = seq.Date(min(date), max(date), by = "month"), fill = list(visits = 0)) %>%
# fill cumulative with the last available value
tidyr::fill(cumulative_visits) %>%
# reset in the same way
mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>%
ungroup()
Here's a generic solution. It's sloppy because as you see this didn't return foo[1] but that can be fixed. (as can reversing the output of the last line. ) I'll leave that "as an exercise for the reader" .
foo <- sample(1:20,10)
[1] 16 11 13 5 6 12 19 10 3 4
bar <- cumsum(foo)
[1] 16 27 40 45 51 63 82 92 95 99
rev(bar[-1])-rev(bar[-length(bar)])
[1] 4 3 10 19 12 6 5 13 11

Calculate average number of individuals present on each date in R

I have a dataset that contains the residence period (start.date to end.date) of marked individuals (ID) at different sites. My goal is to generate a column that tells me the average number of other individuals per day that were also present at the same site (across the total residence period of each individual).
To do this, I need to determine the total number of individuals that were present per site on each date, summed across the total residence period of each individual. Ultimately, I will divide this sum by the total residence days of each individual to calculate the average. Can anyone help me accomplish this?
I calculated the total number of residence days (total.days) using lubridate and dplyr
mutate(total.days = end.date - start.date + 1)
site ID start.date end.date total.days
1 1 16 5/24/17 6/5/17 13
2 1 46 4/30/17 5/20/17 21
3 1 26 4/30/17 5/23/17 24
4 1 89 5/5/17 5/13/17 9
5 1 12 5/11/17 5/14/17 4
6 2 14 5/4/17 5/10/17 7
7 2 18 5/9/17 5/29/17 21
8 2 19 5/24/17 6/10/17 18
9 2 39 5/5/17 5/18/17 14
First of all, it is always advisable to give a sample of the data in a more friendly format using dput(yourData) so that other can easily regenerate your data. Here is the output of dput() you could better be sharing:
> dput(dat)
structure(list(site = c(1, 1, 1, 1, 1, 2, 2, 2, 2), ID = c(16,
46, 26, 89, 12, 14, 18, 19, 39), start.date = structure(c(17310,
17286, 17286, 17291, 17297, 17290, 17295, 17310, 17291), class = "Date"),
end.date = structure(c(17322, 17306, 17309, 17299, 17300,
17296, 17315, 17327, 17304), class = "Date")), class = "data.frame", row.names =
c(NA,
-9L))
To do this easily we first need to unpack the start.date and end.date to individual dates:
newDat <- data.frame()
for (i in 1:nrow(dat)){
expand <- data.frame(site = dat$site[i],
ID = dat$ID[i],
Dates = seq.Date(dat$start.date[i], dat$end.date[i], 1))
newDat <- rbind(newDat, expand)
}
newDat
site ID Dates
1 1 16 2017-05-24
2 1 16 2017-05-25
3 1 16 2017-05-26
4 1 16 2017-05-27
5 1 16 2017-05-28
6 1 16 2017-05-29
7 1 16 2017-05-30
. . .
. . .
Then we calculate the number of other individuals present in each site in each day:
individualCount = newDat %>%
group_by(site, Dates) %>%
summarise(individuals = n_distinct(ID) - 1)
individualCount
# A tibble: 75 x 3
# Groups: site [?]
site Dates individuals
<dbl> <date> <int>
1 1 2017-04-30 1
2 1 2017-05-01 1
3 1 2017-05-02 1
4 1 2017-05-03 1
5 1 2017-05-04 1
6 1 2017-05-05 2
7 1 2017-05-06 2
8 1 2017-05-07 2
9 1 2017-05-08 2
10 1 2017-05-09 2
# ... with 65 more rows
Then, we augment our data with the new information using left_join() and calculate the required average:
newDat <- left_join(newDat, individualCount, by = c("site", "Dates")) %>%
group_by(site, ID) %>%
summarise(duration = max(Dates) - min(Dates)+1,
av.individuals = mean(individuals))
newDat
# A tibble: 9 x 4
# Groups: site [?]
site ID duration av.individuals
<dbl> <dbl> <time> <dbl>
1 1 12 4 0.75
2 1 16 13 0
3 1 26 24 1.42
4 1 46 21 1.62
5 1 89 9 1.33
6 2 14 7 1.14
7 2 18 21 0.875
8 2 19 18 0.333
9 2 39 14 1.14
The final step is to add the required column to the original dataset (dat) again with left_join():
dat %>% left_join(newDat, by = c("site", "ID"))
dat
site ID start.date end.date duration av.individuals
1 1 16 2017-05-24 2017-06-05 13 days 0.000000
2 1 46 2017-04-30 2017-05-20 21 days 1.619048
3 1 26 2017-04-30 2017-05-23 24 days 1.416667
4 1 89 2017-05-05 2017-05-13 9 days 2.333333
5 1 12 2017-05-11 2017-05-14 4 days 2.750000
6 2 14 2017-05-04 2017-05-10 7 days 1.142857
7 2 18 2017-05-09 2017-05-29 21 days 0.857143
8 2 19 2017-05-24 2017-06-10 18 days 0.333333
9 2 39 2017-05-05 2017-05-18 14 days 1.142857

Resources