Is There an R Function to Map a Date Index? - r

I have a data set with a lot of countries and currency data, like this:
iso date SPOT
<chr> <date> <dbl>
1 AUD 2000-01-03 0.658
2 AUD 2000-01-04 0.655
3 AUD 2000-01-05 0.658
4 AUD 2000-01-06 0.653
5 AUD 2000-01-07 0.655
6 AUD 2000-01-10 0.656
7 AUD 2000-01-11 0.658
8 AUD 2000-01-12 0.659
9 AUD 2000-01-13 0.668
10 AUD 2000-01-14 0.666
and I want to create an exact date index where the data for each is mapped to the day of one year ago, so the mapping the data to "LAG1" like this, where LAG1 = date - years(1):
iso date SPOT LAG1
<chr> <date> <dbl> <date>
1 AUD 2000-01-03 0.658 1999-01-03
2 AUD 2000-01-04 0.655 1999-01-04
3 AUD 2000-01-05 0.658 1999-01-05
4 AUD 2000-01-06 0.653 1999-01-06
5 AUD 2000-01-07 0.655 1999-01-07
6 AUD 2000-01-10 0.656 1999-01-10
7 AUD 2000-01-11 0.658 1999-01-11
8 AUD 2000-01-12 0.659 1999-01-12
9 AUD 2000-01-13 0.668 1999-01-13
10 AUD 2000-01-14 0.666 1999-01-14
This was my solution:
df %>%
mutate(LAG1=date-years(1)) %>%
select(iso,LAG1=date,LAG1_SPOT=SPOT) %>%
right_join(.,df,by=c("iso", "LAG1")) %>% as_tibble()
but I don't like it because it's a bunch of lines for something I think should be simpler, and I want to make it into a function.
Is there a better way to do this?

I think your intent of merging/joining is the right way to go. In fact, it's "right" because it will naturally deal with data anomalies better. I also think there are a couple of small logic errors in your code.
Since your data doesn't have enough to look at past years, here is some fake data. I'm making SPOT just a sequence to help visualize the sequence, but otherwise it doesn't matter much. I'm also going to introduce two anomalies in the data to demonstrate how they will show in the end.
library(dplyr)
library(lubridate)
dates <- seq.Date(as.Date("2020-03-15"), by = "day", length.out = 5)
df <- tibble(
iso = rep(c("AUD", "USD"), each = 10),
date = rep(c(dates - years(1), dates), times = 2),
SPOT = 1:20
)
# data missingness
df <- df[-3,]
# repeated date
df$date[12] <- df$date[13]
df
# # A tibble: 19 x 3
# iso date SPOT
# <chr> <date> <int>
# 1 AUD 2019-03-15 1
# 2 AUD 2019-03-16 2
# 3 AUD 2019-03-18 4
# 4 AUD 2019-03-19 5
# 5 AUD 2020-03-15 6
# 6 AUD 2020-03-16 7
# 7 AUD 2020-03-17 8
# 8 AUD 2020-03-18 9
# 9 AUD 2020-03-19 10
# 10 USD 2019-03-15 11
# 11 USD 2019-03-16 12
# 12 USD 2019-03-18 13
# 13 USD 2019-03-18 14
# 14 USD 2019-03-19 15
# 15 USD 2020-03-15 16
# 16 USD 2020-03-16 17
# 17 USD 2020-03-17 18
# 18 USD 2020-03-18 19
# 19 USD 2020-03-19 20
Using your code from above, we see this:
df %>%
mutate(date = date - years(1)) %>%
rename(LAG1_SPOT = SPOT) %>%
right_join(., df, by = c("iso", "date"))
# # A tibble: 19 x 4
# iso date LAG1_SPOT SPOT
# <chr> <date> <int> <int>
# 1 AUD 2019-03-15 6 1
# 2 AUD 2019-03-16 7 2
# 3 AUD 2019-03-18 9 4
# 4 AUD 2019-03-19 10 5
# 5 AUD 2020-03-15 NA 6
# 6 AUD 2020-03-16 NA 7
# 7 AUD 2020-03-17 NA 8
# 8 AUD 2020-03-18 NA 9
# 9 AUD 2020-03-19 NA 10
# 10 USD 2019-03-15 16 11
# 11 USD 2019-03-16 17 12
# 12 USD 2019-03-18 19 13
# 13 USD 2019-03-18 19 14
# 14 USD 2019-03-19 20 15
# 15 USD 2020-03-15 NA 16
# 16 USD 2020-03-16 NA 17
# 17 USD 2020-03-17 NA 18
# 18 USD 2020-03-18 NA 19
# 19 USD 2020-03-19 NA 20
Since I believe your intent is to compare this year's data with last year's data, then the above shows that we have paired them, but the date of reference is last year. I suggest that you should be using +:
df %>%
mutate(date = date + years(1)) %>%
rename(LAG1_SPOT = SPOT) %>%
right_join(., df, by = c("iso", "date"))
# # A tibble: 20 x 4
# iso date LAG1_SPOT SPOT
# <chr> <date> <int> <int>
# 1 AUD 2019-03-15 NA 1
# 2 AUD 2019-03-16 NA 2
# 3 AUD 2019-03-18 NA 4
# 4 AUD 2019-03-19 NA 5
# 5 AUD 2020-03-15 1 6
# 6 AUD 2020-03-16 2 7
# 7 AUD 2020-03-17 NA 8
# 8 AUD 2020-03-18 4 9
# 9 AUD 2020-03-19 5 10
# 10 USD 2019-03-15 NA 11
# 11 USD 2019-03-16 NA 12
# 12 USD 2019-03-18 NA 13
# 13 USD 2019-03-18 NA 14
# 14 USD 2019-03-19 NA 15
# 15 USD 2020-03-15 11 16
# 16 USD 2020-03-16 12 17
# 17 USD 2020-03-17 NA 18
# 18 USD 2020-03-18 13 19
# 19 USD 2020-03-18 14 19
# 20 USD 2020-03-19 15 20
This also shows how data anomalies will present. First, in AUD we see that 03-17 is missing data from last year, so we have nothing to compare the 8 spot against. This is just the fact that we are missing data. Unavoidable, but a lag here would have given us data, likely from the wrong date. Second, we see that our dupe-data (acquisition systems are imperfect!), we now have two rows for USD on 2020-03-18, which is certainly suspect (but outside the scope of your question). But we have compared both of 2019's values with the single 2020 value.
If the data anomalies never show up in your data, I still think join is the correct method for dealing with this, as if there is ever a time that lag will find the wrong row (leap-years?), you will never know that it failed: you'll get data and use it with no indication.
BTW: if you are just looking to reduce the four lines of code, this is perfectly equivalent:
transmute(df, iso, date = date + years(1), LAG1_SPOT = SPOT) %>%
right_join(., df, by = c("iso", "date"))

Related

Given a series of dates and a birth day, is there a way to obtain the age at every date entry along with a final age using the lubridate package?

I have a database of information pertaining to individuals observed over time. I would like to find a way to obtain the age of these individuals whenever a record was taken. Assuming the BIRTH assigns a value of 0, I would like to obtain the age either in days or months for the visits after. It would also be helpful to obtain a final age (either day or month) for each individual (*not included in the code). For example, for ID (A), the final age would be 10 months. I would like to use the lubridate function as it's in-built date feature makes it easier to work with dates. Any help with this is much appreciated.
date<-c("2000-01-01","2000-01-14","2000-01-25","2000-02-12","2000-02-27","2000-06-05","2000-10-30",
"2001-02-04","2001-06-15","2001-12-26","2002-05-22","2002-06-04",
"2000-01-08","2000-07-11","2000-08-18","2000-11-27")
ID<-c("A","A","A","A","A","A","A",
"B","B","B","B","B",
"C","C","C","C")
status<-c("BIRTH","ETC","ETC","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
df1<-data.frame(date,ID,status)
print(df1)
date ID status
1 2000-01-01 A BIRTH
2 2000-01-14 A ETC
3 2000-01-25 A ETC
4 2000-02-12 A ETC
5 2000-02-27 A ETC
6 2000-06-05 A ETC
7 2000-10-30 A ETC
8 2001-02-04 B BIRTH
9 2001-06-15 B ETC
10 2001-12-26 B ETC
11 2002-05-22 B ETC
12 2002-06-04 B ETC
13 2000-01-08 C BIRTH
14 2000-07-11 C ETC
15 2000-08-18 C ETC
16 2000-11-27 C ETC
date.new<-c("2000-01-01","2000-01-14","2000-01-25","2000-02-12","2000-02-27","2000-06-05","2000-10-30",
"2001-02-04","2001-06-15","2001-12-26","2002-05-22","2001-02-04",
"2000-01-08","2000-07-11","2000-08-18","2000-11-27")
ID.new<-c("A","A","A","A","A","A","A",
"B","B","B","B","B",
"C","C","C","C")
status.new<-c("BIRTH","ETC","ETC","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
age<-c(0,1,1,2,2,6,10,
0,4,10,15,16,
0,6,7,10)
df2<-data.frame(date.new,ID.new,status.new,age)
print(df2)
date.new ID.new status.new age
1 2000-01-01 A BIRTH 0
2 2000-01-14 A ETC 1
3 2000-01-25 A ETC 1
4 2000-02-12 A ETC 2
5 2000-02-27 A ETC 2
6 2000-06-05 A ETC 6
7 2000-10-30 A ETC 10
8 2001-02-04 B BIRTH 0
9 2001-06-15 B ETC 4
10 2001-12-26 B ETC 10
11 2002-05-22 B ETC 15
12 2001-02-04 B ETC 16
13 2000-01-08 C BIRTH 0
14 2000-07-11 C ETC 6
15 2000-08-18 C ETC 7
16 2000-11-27 C ETC 10
For calculations related to age in years or months, I'd like to encourage you to try the clock package rather than lubridate. lubridate is a great package, but produces some unexpected results with these kinds of calculations if you aren't 100% sure of what you are doing. In clock, the function to do this is date_count_between(). Notice that one of the results is different between clock and lubridate here:
library(clock)
library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
date = c("2000-01-01","2000-01-14",
"2000-01-25","2000-02-12","2000-02-27","2000-06-05",
"2000-10-30","2001-02-04","2001-06-15","2001-12-26",
"2002-05-22","2002-06-04","2000-01-08","2000-07-11",
"2000-08-18","2000-11-27"),
ID = c("A","A","A","A","A","A",
"A","B","B","B","B","B","C","C","C","C"),
status = c("BIRTH","ETC","ETC","ETC",
"ETC","ETC","ETC","BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
)
df %>%
mutate(date = date_parse(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"]) %>%
ungroup() %>%
mutate(
age_clock = date_count_between(birth_date, date, "month"),
age_lubridate = as.period(date - birth_date) %/% months(1))
#> # A tibble: 16 × 6
#> date ID status birth_date age_clock age_lubridate
#> <date> <chr> <chr> <date> <int> <dbl>
#> 1 2000-01-01 A BIRTH 2000-01-01 0 0
#> 2 2000-01-14 A ETC 2000-01-01 0 0
#> 3 2000-01-25 A ETC 2000-01-01 0 0
#> 4 2000-02-12 A ETC 2000-01-01 1 1
#> 5 2000-02-27 A ETC 2000-01-01 1 1
#> 6 2000-06-05 A ETC 2000-01-01 5 5
#> 7 2000-10-30 A ETC 2000-01-01 9 9
#> 8 2001-02-04 B BIRTH 2001-02-04 0 0
#> 9 2001-06-15 B ETC 2001-02-04 4 4
#> 10 2001-12-26 B ETC 2001-02-04 10 10
#> 11 2002-05-22 B ETC 2001-02-04 15 15
#> 12 2002-06-04 B ETC 2001-02-04 16 15
#> 13 2000-01-08 C BIRTH 2000-01-08 0 0
#> 14 2000-07-11 C ETC 2000-01-08 6 6
#> 15 2000-08-18 C ETC 2000-01-08 7 7
#> 16 2000-11-27 C ETC 2000-01-08 10 10
clock says that 2001-02-04 to 2002-06-04 is 16 months, while the lubridate method here only says it is 15 months. This has to do with the fact that the lubridate calculation uses the length of an average month, which doesn't always accurately reflect how we think about months.
Consider this simple example, I think most people would agree that a child born on this date in February is considered "1 month and 1 day" old. But lubridate shows 0 months!
library(clock)
library(lubridate, warn.conflicts = FALSE)
# "1 month and 1 day apart"
feb <- as.Date("2020-02-28")
mar <- as.Date("2020-03-29")
# As expected when thinking about age in months
date_count_between(feb, mar, "month")
#> [1] 1
# Not expected
as.period(mar - feb) %/% months(1)
#> [1] 0
secs_in_day <- 86400
secs_in_month <- as.numeric(months(1))
secs_in_month / secs_in_day
#> [1] 30.4375
# Less than 30.4375 days, so not 1 month
mar - feb
#> Time difference of 30 days
The issue is that lubridate uses the length of an average month in the computation, which is 30.4375 days. But there are only 30 days between these two dates, so it isn't considered a full month.
clock, on the other hand, uses the day component of the starting date to determine if a "full month" has passed or not. In other words, because we have passed the 28th of March, clock decides that 1 month has passed, which is consistent with how we generally think about age.
Using dplyr and lubridate, we can do the following. We first turn the date column into a date. Then we group by ID, find the birth date and calculate the number of months since that date via some lubridate magic (see How do I use the lubridate package to calculate the number of months between two date vectors where one of the vectors has NA values?).
library(dplyr)
library(lubridate)
df1 %>%
mutate(date = as_date(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"],
age = as.period(date - birth_date) %/% months(1)) %>%
ungroup()
Which gives:
date ID status birth_date age
<date> <fct> <fct> <date> <dbl>
1 2000-01-01 A BIRTH 2000-01-01 0
2 2000-01-14 A ETC 2000-01-01 0
3 2000-01-25 A ETC 2000-01-01 0
4 2000-02-12 A ETC 2000-01-01 1
5 2000-02-27 A ETC 2000-01-01 1
6 2000-06-05 A ETC 2000-01-01 5
7 2000-10-30 A ETC 2000-01-01 9
8 2001-02-04 B BIRTH 2001-02-04 0
9 2001-06-15 B ETC 2001-02-04 4
10 2001-12-26 B ETC 2001-02-04 10
11 2002-05-22 B ETC 2001-02-04 15
12 2002-06-04 B ETC 2001-02-04 15
13 2000-01-08 C BIRTH 2000-01-08 0
14 2000-07-11 C ETC 2000-01-08 6
15 2000-08-18 C ETC 2000-01-08 7
16 2000-11-27 C ETC 2000-01-08 10
Which is your expected output except for some rounding differences. See my comment on your question.

Obtaining trading days from rollapply in R

I have following simulated dataset of y column with fixed trading days (say 250) of 2018.
data
# A tibble: 249 × 2
Date y
<dttm> <dbl>
1 2018-01-02 00:00:00 0.409
2 2018-01-03 00:00:00 -1.90
3 2018-01-04 00:00:00 0.131
4 2018-01-05 00:00:00 -0.619
5 2018-01-08 00:00:00 0.449
6 2018-01-09 00:00:00 0.448
7 2018-01-10 00:00:00 0.124
8 2018-01-11 00:00:00 -0.346
9 2018-01-12 00:00:00 0.775
10 2018-01-15 00:00:00 -0.948
# … with 239 more rows
with tail
> tail(data,n=10)
# A tibble: 10 × 2
Date y
<dttm> <dbl>
1 2018-12-13 00:00:00 -0.00736
2 2018-12-14 00:00:00 -1.30
3 2018-12-17 00:00:00 0.227
4 2018-12-18 00:00:00 -0.671
5 2018-12-19 00:00:00 -0.750
6 2018-12-20 00:00:00 -0.906
7 2018-12-21 00:00:00 -1.74
8 2018-12-27 00:00:00 0.331
9 2018-12-28 00:00:00 -0.768
10 2018-12-31 00:00:00 0.649
I want to calculate rolling sd of column y with window 60 and then to find the exact trading days not actual-usual days (it can be done from index? I don't know.)
data2 = data%>%
mutate(date = as.Date(Date))
data3=data2[,-1];head(data3)
roll_win = 60
data3$a = c(rep(NA_real_, roll_win - 1), zoo::rollapply(data3$y, roll_win ,sd))
dat = subset(data3, !is.na(a))
dat_max = dat[dat$a == max(dat$a, na.rm = TRUE), ]
dat_max$date_start = dat_max$date - (roll_win - 1)
dat_max
Turn outs that the period of high volatility is :
dat_max
# A tibble: 1 × 4
y date a date_start
<dbl> <date> <dbl> <date>
1 0.931 2018-04-24 1.18 2018-02-24
Now if I subtract the two dates I will have :
> dat_max$date - dat_max$date_start
Time difference of 59 days
Which is actually true but these are NOT THE TRADING DAYS.
I have asked a similar question here but it didn't solved the problem.Actually the asked question then was how I can obtain the days of high volatility.
Any help how I can obtain this trading days ? Thanks in advance
EDIT
FOR FULL DATA
library(gsheet)
data= gsheet2tbl("https://docs.google.com/spreadsheets/d/1PdZDb3OgqSaO6znUWsAh7p_MVLHgNbQM/edit?usp=sharing&ouid=109626011108852110510&rtpof=true&sd=true")
data
Start date for each time window
If the question is how to calculate the start date for each window then using the data in the Note at the end and a window of 3:
w <- 3
out <- mutate(data,
sd = zoo::rollapplyr(y, w, sd, fill = NA),
start = dplyr::lag(Date, w - 1)
)
out
giving:
Date y sd start
1 2018-12-13 -0.00736 NA <NA>
2 2018-12-14 -1.30000 NA <NA>
3 2018-12-17 0.22700 0.8223515 2018-12-13
4 2018-12-18 -0.67100 0.7674388 2018-12-14
5 2018-12-19 -0.75000 0.5427053 2018-12-17
6 2018-12-20 -0.90600 0.1195840 2018-12-18
7 2018-12-21 -1.74000 0.5322894 2018-12-19
8 2018-12-27 0.33100 1.0420146 2018-12-20
9 2018-12-28 -0.76800 1.0361488 2018-12-21
10 2018-12-31 0.64900 0.7435068 2018-12-27
Largest sd's with their start and end dates
and the largest 4 sd's and their start and end dates are:
head(dplyr::arrange(out, -sd), 4)
giving:
Date y sd start
8 2018-12-27 0.331 1.0420146 2018-12-20
9 2018-12-28 -0.768 1.0361488 2018-12-21
3 2018-12-17 0.227 0.8223515 2018-12-13
4 2018-12-18 -0.671 0.7674388 2018-12-14
Rows between two dates
If the question is how many rows are between and include two dates that appear in data then:
d1 <- as.Date("2018-12-14")
d2 <- as.Date("2018-12-20")
diff(match(c(d1, d2), data$Date)) + 1
## [1] 5
Note
Lines <- " Date y
1 2018-12-13T00:00:00 -0.00736
2 2018-12-14T00:00:00 -1.30
3 2018-12-17T00:00:00 0.227
4 2018-12-18T00:00:00 -0.671
5 2018-12-19T00:00:00 -0.750
6 2018-12-20T00:00:00 -0.906
7 2018-12-21T00:00:00 -1.74
8 2018-12-27T00:00:00 0.331
9 2018-12-28T00:00:00 -0.768
10 2018-12-31T00:00:00 0.649"
data <- read.table(text = Lines)
data$Date <- as.Date(data$Date)

R: counting timestamps per week

I have a dataframe containing a lot of tweets. Each tweet has a unique timestamp. Now, what I would like to calculate is how many tweets have been published in each week, based on the timestamp. Any ideas? I tried to do it with tidyverse and dplyr, sadly it didn't work.
Kind regards,
Daniel
library(dplyr)
set.seed(42)
tweets <- tibble(timestamp = sort(Sys.time() - runif(1000, 0, 365*86400)), tweet = paste("tweet", 1:1000))
tweets
# # A tibble: 1,000 x 2
# timestamp tweet
# <dttm> <chr>
# 1 2021-01-27 09:39:47 tweet 1
# 2 2021-01-28 02:38:29 tweet 2
# 3 2021-01-28 07:33:02 tweet 3
# 4 2021-01-29 08:42:47 tweet 4
# 5 2021-01-29 09:21:58 tweet 5
# 6 2021-01-29 16:01:09 tweet 6
# 7 2021-01-30 05:04:18 tweet 7
# 8 2021-01-30 21:45:05 tweet 8
# 9 2021-01-31 18:32:24 tweet 9
# 10 2021-02-02 02:57:51 tweet 10
# # ... with 990 more rows
tweets %>%
group_by(yearweek = format(timestamp, format = "%Y-%U")) %>%
summarize(date = min(as.Date(timestamp)), ntweets = n(), .groups = "drop")
# # A tibble: 54 x 3
# yearweek date ntweets
# <chr> <date> <int>
# 1 2021-04 2021-01-27 8
# 2 2021-05 2021-01-31 15
# 3 2021-06 2021-02-07 19
# 4 2021-07 2021-02-14 24
# 5 2021-08 2021-02-21 28
# 6 2021-09 2021-02-28 22
# 7 2021-10 2021-03-07 16
# 8 2021-11 2021-03-15 13
# 9 2021-12 2021-03-21 15
# 10 2021-13 2021-03-28 19
# # ... with 44 more rows
See ?strptime for definitions of the various "week of the year" options ("%U", "%V", "%W").

Calculate average number of individuals present on each date in R

I have a dataset that contains the residence period (start.date to end.date) of marked individuals (ID) at different sites. My goal is to generate a column that tells me the average number of other individuals per day that were also present at the same site (across the total residence period of each individual).
To do this, I need to determine the total number of individuals that were present per site on each date, summed across the total residence period of each individual. Ultimately, I will divide this sum by the total residence days of each individual to calculate the average. Can anyone help me accomplish this?
I calculated the total number of residence days (total.days) using lubridate and dplyr
mutate(total.days = end.date - start.date + 1)
site ID start.date end.date total.days
1 1 16 5/24/17 6/5/17 13
2 1 46 4/30/17 5/20/17 21
3 1 26 4/30/17 5/23/17 24
4 1 89 5/5/17 5/13/17 9
5 1 12 5/11/17 5/14/17 4
6 2 14 5/4/17 5/10/17 7
7 2 18 5/9/17 5/29/17 21
8 2 19 5/24/17 6/10/17 18
9 2 39 5/5/17 5/18/17 14
First of all, it is always advisable to give a sample of the data in a more friendly format using dput(yourData) so that other can easily regenerate your data. Here is the output of dput() you could better be sharing:
> dput(dat)
structure(list(site = c(1, 1, 1, 1, 1, 2, 2, 2, 2), ID = c(16,
46, 26, 89, 12, 14, 18, 19, 39), start.date = structure(c(17310,
17286, 17286, 17291, 17297, 17290, 17295, 17310, 17291), class = "Date"),
end.date = structure(c(17322, 17306, 17309, 17299, 17300,
17296, 17315, 17327, 17304), class = "Date")), class = "data.frame", row.names =
c(NA,
-9L))
To do this easily we first need to unpack the start.date and end.date to individual dates:
newDat <- data.frame()
for (i in 1:nrow(dat)){
expand <- data.frame(site = dat$site[i],
ID = dat$ID[i],
Dates = seq.Date(dat$start.date[i], dat$end.date[i], 1))
newDat <- rbind(newDat, expand)
}
newDat
site ID Dates
1 1 16 2017-05-24
2 1 16 2017-05-25
3 1 16 2017-05-26
4 1 16 2017-05-27
5 1 16 2017-05-28
6 1 16 2017-05-29
7 1 16 2017-05-30
. . .
. . .
Then we calculate the number of other individuals present in each site in each day:
individualCount = newDat %>%
group_by(site, Dates) %>%
summarise(individuals = n_distinct(ID) - 1)
individualCount
# A tibble: 75 x 3
# Groups: site [?]
site Dates individuals
<dbl> <date> <int>
1 1 2017-04-30 1
2 1 2017-05-01 1
3 1 2017-05-02 1
4 1 2017-05-03 1
5 1 2017-05-04 1
6 1 2017-05-05 2
7 1 2017-05-06 2
8 1 2017-05-07 2
9 1 2017-05-08 2
10 1 2017-05-09 2
# ... with 65 more rows
Then, we augment our data with the new information using left_join() and calculate the required average:
newDat <- left_join(newDat, individualCount, by = c("site", "Dates")) %>%
group_by(site, ID) %>%
summarise(duration = max(Dates) - min(Dates)+1,
av.individuals = mean(individuals))
newDat
# A tibble: 9 x 4
# Groups: site [?]
site ID duration av.individuals
<dbl> <dbl> <time> <dbl>
1 1 12 4 0.75
2 1 16 13 0
3 1 26 24 1.42
4 1 46 21 1.62
5 1 89 9 1.33
6 2 14 7 1.14
7 2 18 21 0.875
8 2 19 18 0.333
9 2 39 14 1.14
The final step is to add the required column to the original dataset (dat) again with left_join():
dat %>% left_join(newDat, by = c("site", "ID"))
dat
site ID start.date end.date duration av.individuals
1 1 16 2017-05-24 2017-06-05 13 days 0.000000
2 1 46 2017-04-30 2017-05-20 21 days 1.619048
3 1 26 2017-04-30 2017-05-23 24 days 1.416667
4 1 89 2017-05-05 2017-05-13 9 days 2.333333
5 1 12 2017-05-11 2017-05-14 4 days 2.750000
6 2 14 2017-05-04 2017-05-10 7 days 1.142857
7 2 18 2017-05-09 2017-05-29 21 days 0.857143
8 2 19 2017-05-24 2017-06-10 18 days 0.333333
9 2 39 2017-05-05 2017-05-18 14 days 1.142857

How can I fill missing data points in R for a given dataframe

I have a dataframe which contains dates, products and amounts. However product b is not on every date, I would like it to be with an NA or 0 balance. Is this possible?
Summary_Date <-
as.Date(c("2017-01-31",
"2017-02-28",
"2017-03-31",
"2017-03-31",
"2017-04-30",
"2017-05-31",
"2017-05-31",
"2017-06-30"))
Product <-
as.character(c("a","a","a","b","a","a","b","a"))
Amounts <-
as.numeric(c(10,10,10,20,10,10,20,10))
df <- data.frame(Summary_Date,Product,Amounts)
Regards,
Aksel
You can use tidyr:
> library(tidyr)
> complete(data = df,Summary_Date,Product)
# A tibble: 12 x 3
Summary_Date Product Amounts
<date> <fctr> <dbl>
1 2017-01-31 a 10
2 2017-01-31 b NA
3 2017-02-28 a 10
4 2017-02-28 b NA
5 2017-03-31 a 10
6 2017-03-31 b 20
7 2017-04-30 a 10
8 2017-04-30 b NA
9 2017-05-31 a 10
10 2017-05-31 b 20
11 2017-06-30 a 10
12 2017-06-30 b NA

Resources