I have a database of information pertaining to individuals observed over time. I would like to find a way to obtain the age of these individuals whenever a record was taken. Assuming the BIRTH assigns a value of 0, I would like to obtain the age either in days or months for the visits after. It would also be helpful to obtain a final age (either day or month) for each individual (*not included in the code). For example, for ID (A), the final age would be 10 months. I would like to use the lubridate function as it's in-built date feature makes it easier to work with dates. Any help with this is much appreciated.
date<-c("2000-01-01","2000-01-14","2000-01-25","2000-02-12","2000-02-27","2000-06-05","2000-10-30",
"2001-02-04","2001-06-15","2001-12-26","2002-05-22","2002-06-04",
"2000-01-08","2000-07-11","2000-08-18","2000-11-27")
ID<-c("A","A","A","A","A","A","A",
"B","B","B","B","B",
"C","C","C","C")
status<-c("BIRTH","ETC","ETC","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
df1<-data.frame(date,ID,status)
print(df1)
date ID status
1 2000-01-01 A BIRTH
2 2000-01-14 A ETC
3 2000-01-25 A ETC
4 2000-02-12 A ETC
5 2000-02-27 A ETC
6 2000-06-05 A ETC
7 2000-10-30 A ETC
8 2001-02-04 B BIRTH
9 2001-06-15 B ETC
10 2001-12-26 B ETC
11 2002-05-22 B ETC
12 2002-06-04 B ETC
13 2000-01-08 C BIRTH
14 2000-07-11 C ETC
15 2000-08-18 C ETC
16 2000-11-27 C ETC
date.new<-c("2000-01-01","2000-01-14","2000-01-25","2000-02-12","2000-02-27","2000-06-05","2000-10-30",
"2001-02-04","2001-06-15","2001-12-26","2002-05-22","2001-02-04",
"2000-01-08","2000-07-11","2000-08-18","2000-11-27")
ID.new<-c("A","A","A","A","A","A","A",
"B","B","B","B","B",
"C","C","C","C")
status.new<-c("BIRTH","ETC","ETC","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
age<-c(0,1,1,2,2,6,10,
0,4,10,15,16,
0,6,7,10)
df2<-data.frame(date.new,ID.new,status.new,age)
print(df2)
date.new ID.new status.new age
1 2000-01-01 A BIRTH 0
2 2000-01-14 A ETC 1
3 2000-01-25 A ETC 1
4 2000-02-12 A ETC 2
5 2000-02-27 A ETC 2
6 2000-06-05 A ETC 6
7 2000-10-30 A ETC 10
8 2001-02-04 B BIRTH 0
9 2001-06-15 B ETC 4
10 2001-12-26 B ETC 10
11 2002-05-22 B ETC 15
12 2001-02-04 B ETC 16
13 2000-01-08 C BIRTH 0
14 2000-07-11 C ETC 6
15 2000-08-18 C ETC 7
16 2000-11-27 C ETC 10
For calculations related to age in years or months, I'd like to encourage you to try the clock package rather than lubridate. lubridate is a great package, but produces some unexpected results with these kinds of calculations if you aren't 100% sure of what you are doing. In clock, the function to do this is date_count_between(). Notice that one of the results is different between clock and lubridate here:
library(clock)
library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
date = c("2000-01-01","2000-01-14",
"2000-01-25","2000-02-12","2000-02-27","2000-06-05",
"2000-10-30","2001-02-04","2001-06-15","2001-12-26",
"2002-05-22","2002-06-04","2000-01-08","2000-07-11",
"2000-08-18","2000-11-27"),
ID = c("A","A","A","A","A","A",
"A","B","B","B","B","B","C","C","C","C"),
status = c("BIRTH","ETC","ETC","ETC",
"ETC","ETC","ETC","BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
)
df %>%
mutate(date = date_parse(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"]) %>%
ungroup() %>%
mutate(
age_clock = date_count_between(birth_date, date, "month"),
age_lubridate = as.period(date - birth_date) %/% months(1))
#> # A tibble: 16 × 6
#> date ID status birth_date age_clock age_lubridate
#> <date> <chr> <chr> <date> <int> <dbl>
#> 1 2000-01-01 A BIRTH 2000-01-01 0 0
#> 2 2000-01-14 A ETC 2000-01-01 0 0
#> 3 2000-01-25 A ETC 2000-01-01 0 0
#> 4 2000-02-12 A ETC 2000-01-01 1 1
#> 5 2000-02-27 A ETC 2000-01-01 1 1
#> 6 2000-06-05 A ETC 2000-01-01 5 5
#> 7 2000-10-30 A ETC 2000-01-01 9 9
#> 8 2001-02-04 B BIRTH 2001-02-04 0 0
#> 9 2001-06-15 B ETC 2001-02-04 4 4
#> 10 2001-12-26 B ETC 2001-02-04 10 10
#> 11 2002-05-22 B ETC 2001-02-04 15 15
#> 12 2002-06-04 B ETC 2001-02-04 16 15
#> 13 2000-01-08 C BIRTH 2000-01-08 0 0
#> 14 2000-07-11 C ETC 2000-01-08 6 6
#> 15 2000-08-18 C ETC 2000-01-08 7 7
#> 16 2000-11-27 C ETC 2000-01-08 10 10
clock says that 2001-02-04 to 2002-06-04 is 16 months, while the lubridate method here only says it is 15 months. This has to do with the fact that the lubridate calculation uses the length of an average month, which doesn't always accurately reflect how we think about months.
Consider this simple example, I think most people would agree that a child born on this date in February is considered "1 month and 1 day" old. But lubridate shows 0 months!
library(clock)
library(lubridate, warn.conflicts = FALSE)
# "1 month and 1 day apart"
feb <- as.Date("2020-02-28")
mar <- as.Date("2020-03-29")
# As expected when thinking about age in months
date_count_between(feb, mar, "month")
#> [1] 1
# Not expected
as.period(mar - feb) %/% months(1)
#> [1] 0
secs_in_day <- 86400
secs_in_month <- as.numeric(months(1))
secs_in_month / secs_in_day
#> [1] 30.4375
# Less than 30.4375 days, so not 1 month
mar - feb
#> Time difference of 30 days
The issue is that lubridate uses the length of an average month in the computation, which is 30.4375 days. But there are only 30 days between these two dates, so it isn't considered a full month.
clock, on the other hand, uses the day component of the starting date to determine if a "full month" has passed or not. In other words, because we have passed the 28th of March, clock decides that 1 month has passed, which is consistent with how we generally think about age.
Using dplyr and lubridate, we can do the following. We first turn the date column into a date. Then we group by ID, find the birth date and calculate the number of months since that date via some lubridate magic (see How do I use the lubridate package to calculate the number of months between two date vectors where one of the vectors has NA values?).
library(dplyr)
library(lubridate)
df1 %>%
mutate(date = as_date(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"],
age = as.period(date - birth_date) %/% months(1)) %>%
ungroup()
Which gives:
date ID status birth_date age
<date> <fct> <fct> <date> <dbl>
1 2000-01-01 A BIRTH 2000-01-01 0
2 2000-01-14 A ETC 2000-01-01 0
3 2000-01-25 A ETC 2000-01-01 0
4 2000-02-12 A ETC 2000-01-01 1
5 2000-02-27 A ETC 2000-01-01 1
6 2000-06-05 A ETC 2000-01-01 5
7 2000-10-30 A ETC 2000-01-01 9
8 2001-02-04 B BIRTH 2001-02-04 0
9 2001-06-15 B ETC 2001-02-04 4
10 2001-12-26 B ETC 2001-02-04 10
11 2002-05-22 B ETC 2001-02-04 15
12 2002-06-04 B ETC 2001-02-04 15
13 2000-01-08 C BIRTH 2000-01-08 0
14 2000-07-11 C ETC 2000-01-08 6
15 2000-08-18 C ETC 2000-01-08 7
16 2000-11-27 C ETC 2000-01-08 10
Which is your expected output except for some rounding differences. See my comment on your question.
I have following simulated dataset of y column with fixed trading days (say 250) of 2018.
data
# A tibble: 249 × 2
Date y
<dttm> <dbl>
1 2018-01-02 00:00:00 0.409
2 2018-01-03 00:00:00 -1.90
3 2018-01-04 00:00:00 0.131
4 2018-01-05 00:00:00 -0.619
5 2018-01-08 00:00:00 0.449
6 2018-01-09 00:00:00 0.448
7 2018-01-10 00:00:00 0.124
8 2018-01-11 00:00:00 -0.346
9 2018-01-12 00:00:00 0.775
10 2018-01-15 00:00:00 -0.948
# … with 239 more rows
with tail
> tail(data,n=10)
# A tibble: 10 × 2
Date y
<dttm> <dbl>
1 2018-12-13 00:00:00 -0.00736
2 2018-12-14 00:00:00 -1.30
3 2018-12-17 00:00:00 0.227
4 2018-12-18 00:00:00 -0.671
5 2018-12-19 00:00:00 -0.750
6 2018-12-20 00:00:00 -0.906
7 2018-12-21 00:00:00 -1.74
8 2018-12-27 00:00:00 0.331
9 2018-12-28 00:00:00 -0.768
10 2018-12-31 00:00:00 0.649
I want to calculate rolling sd of column y with window 60 and then to find the exact trading days not actual-usual days (it can be done from index? I don't know.)
data2 = data%>%
mutate(date = as.Date(Date))
data3=data2[,-1];head(data3)
roll_win = 60
data3$a = c(rep(NA_real_, roll_win - 1), zoo::rollapply(data3$y, roll_win ,sd))
dat = subset(data3, !is.na(a))
dat_max = dat[dat$a == max(dat$a, na.rm = TRUE), ]
dat_max$date_start = dat_max$date - (roll_win - 1)
dat_max
Turn outs that the period of high volatility is :
dat_max
# A tibble: 1 × 4
y date a date_start
<dbl> <date> <dbl> <date>
1 0.931 2018-04-24 1.18 2018-02-24
Now if I subtract the two dates I will have :
> dat_max$date - dat_max$date_start
Time difference of 59 days
Which is actually true but these are NOT THE TRADING DAYS.
I have asked a similar question here but it didn't solved the problem.Actually the asked question then was how I can obtain the days of high volatility.
Any help how I can obtain this trading days ? Thanks in advance
EDIT
FOR FULL DATA
library(gsheet)
data= gsheet2tbl("https://docs.google.com/spreadsheets/d/1PdZDb3OgqSaO6znUWsAh7p_MVLHgNbQM/edit?usp=sharing&ouid=109626011108852110510&rtpof=true&sd=true")
data
Start date for each time window
If the question is how to calculate the start date for each window then using the data in the Note at the end and a window of 3:
w <- 3
out <- mutate(data,
sd = zoo::rollapplyr(y, w, sd, fill = NA),
start = dplyr::lag(Date, w - 1)
)
out
giving:
Date y sd start
1 2018-12-13 -0.00736 NA <NA>
2 2018-12-14 -1.30000 NA <NA>
3 2018-12-17 0.22700 0.8223515 2018-12-13
4 2018-12-18 -0.67100 0.7674388 2018-12-14
5 2018-12-19 -0.75000 0.5427053 2018-12-17
6 2018-12-20 -0.90600 0.1195840 2018-12-18
7 2018-12-21 -1.74000 0.5322894 2018-12-19
8 2018-12-27 0.33100 1.0420146 2018-12-20
9 2018-12-28 -0.76800 1.0361488 2018-12-21
10 2018-12-31 0.64900 0.7435068 2018-12-27
Largest sd's with their start and end dates
and the largest 4 sd's and their start and end dates are:
head(dplyr::arrange(out, -sd), 4)
giving:
Date y sd start
8 2018-12-27 0.331 1.0420146 2018-12-20
9 2018-12-28 -0.768 1.0361488 2018-12-21
3 2018-12-17 0.227 0.8223515 2018-12-13
4 2018-12-18 -0.671 0.7674388 2018-12-14
Rows between two dates
If the question is how many rows are between and include two dates that appear in data then:
d1 <- as.Date("2018-12-14")
d2 <- as.Date("2018-12-20")
diff(match(c(d1, d2), data$Date)) + 1
## [1] 5
Note
Lines <- " Date y
1 2018-12-13T00:00:00 -0.00736
2 2018-12-14T00:00:00 -1.30
3 2018-12-17T00:00:00 0.227
4 2018-12-18T00:00:00 -0.671
5 2018-12-19T00:00:00 -0.750
6 2018-12-20T00:00:00 -0.906
7 2018-12-21T00:00:00 -1.74
8 2018-12-27T00:00:00 0.331
9 2018-12-28T00:00:00 -0.768
10 2018-12-31T00:00:00 0.649"
data <- read.table(text = Lines)
data$Date <- as.Date(data$Date)
I have a dataframe containing a lot of tweets. Each tweet has a unique timestamp. Now, what I would like to calculate is how many tweets have been published in each week, based on the timestamp. Any ideas? I tried to do it with tidyverse and dplyr, sadly it didn't work.
Kind regards,
Daniel
library(dplyr)
set.seed(42)
tweets <- tibble(timestamp = sort(Sys.time() - runif(1000, 0, 365*86400)), tweet = paste("tweet", 1:1000))
tweets
# # A tibble: 1,000 x 2
# timestamp tweet
# <dttm> <chr>
# 1 2021-01-27 09:39:47 tweet 1
# 2 2021-01-28 02:38:29 tweet 2
# 3 2021-01-28 07:33:02 tweet 3
# 4 2021-01-29 08:42:47 tweet 4
# 5 2021-01-29 09:21:58 tweet 5
# 6 2021-01-29 16:01:09 tweet 6
# 7 2021-01-30 05:04:18 tweet 7
# 8 2021-01-30 21:45:05 tweet 8
# 9 2021-01-31 18:32:24 tweet 9
# 10 2021-02-02 02:57:51 tweet 10
# # ... with 990 more rows
tweets %>%
group_by(yearweek = format(timestamp, format = "%Y-%U")) %>%
summarize(date = min(as.Date(timestamp)), ntweets = n(), .groups = "drop")
# # A tibble: 54 x 3
# yearweek date ntweets
# <chr> <date> <int>
# 1 2021-04 2021-01-27 8
# 2 2021-05 2021-01-31 15
# 3 2021-06 2021-02-07 19
# 4 2021-07 2021-02-14 24
# 5 2021-08 2021-02-21 28
# 6 2021-09 2021-02-28 22
# 7 2021-10 2021-03-07 16
# 8 2021-11 2021-03-15 13
# 9 2021-12 2021-03-21 15
# 10 2021-13 2021-03-28 19
# # ... with 44 more rows
See ?strptime for definitions of the various "week of the year" options ("%U", "%V", "%W").
I have a dataset that contains the residence period (start.date to end.date) of marked individuals (ID) at different sites. My goal is to generate a column that tells me the average number of other individuals per day that were also present at the same site (across the total residence period of each individual).
To do this, I need to determine the total number of individuals that were present per site on each date, summed across the total residence period of each individual. Ultimately, I will divide this sum by the total residence days of each individual to calculate the average. Can anyone help me accomplish this?
I calculated the total number of residence days (total.days) using lubridate and dplyr
mutate(total.days = end.date - start.date + 1)
site ID start.date end.date total.days
1 1 16 5/24/17 6/5/17 13
2 1 46 4/30/17 5/20/17 21
3 1 26 4/30/17 5/23/17 24
4 1 89 5/5/17 5/13/17 9
5 1 12 5/11/17 5/14/17 4
6 2 14 5/4/17 5/10/17 7
7 2 18 5/9/17 5/29/17 21
8 2 19 5/24/17 6/10/17 18
9 2 39 5/5/17 5/18/17 14
First of all, it is always advisable to give a sample of the data in a more friendly format using dput(yourData) so that other can easily regenerate your data. Here is the output of dput() you could better be sharing:
> dput(dat)
structure(list(site = c(1, 1, 1, 1, 1, 2, 2, 2, 2), ID = c(16,
46, 26, 89, 12, 14, 18, 19, 39), start.date = structure(c(17310,
17286, 17286, 17291, 17297, 17290, 17295, 17310, 17291), class = "Date"),
end.date = structure(c(17322, 17306, 17309, 17299, 17300,
17296, 17315, 17327, 17304), class = "Date")), class = "data.frame", row.names =
c(NA,
-9L))
To do this easily we first need to unpack the start.date and end.date to individual dates:
newDat <- data.frame()
for (i in 1:nrow(dat)){
expand <- data.frame(site = dat$site[i],
ID = dat$ID[i],
Dates = seq.Date(dat$start.date[i], dat$end.date[i], 1))
newDat <- rbind(newDat, expand)
}
newDat
site ID Dates
1 1 16 2017-05-24
2 1 16 2017-05-25
3 1 16 2017-05-26
4 1 16 2017-05-27
5 1 16 2017-05-28
6 1 16 2017-05-29
7 1 16 2017-05-30
. . .
. . .
Then we calculate the number of other individuals present in each site in each day:
individualCount = newDat %>%
group_by(site, Dates) %>%
summarise(individuals = n_distinct(ID) - 1)
individualCount
# A tibble: 75 x 3
# Groups: site [?]
site Dates individuals
<dbl> <date> <int>
1 1 2017-04-30 1
2 1 2017-05-01 1
3 1 2017-05-02 1
4 1 2017-05-03 1
5 1 2017-05-04 1
6 1 2017-05-05 2
7 1 2017-05-06 2
8 1 2017-05-07 2
9 1 2017-05-08 2
10 1 2017-05-09 2
# ... with 65 more rows
Then, we augment our data with the new information using left_join() and calculate the required average:
newDat <- left_join(newDat, individualCount, by = c("site", "Dates")) %>%
group_by(site, ID) %>%
summarise(duration = max(Dates) - min(Dates)+1,
av.individuals = mean(individuals))
newDat
# A tibble: 9 x 4
# Groups: site [?]
site ID duration av.individuals
<dbl> <dbl> <time> <dbl>
1 1 12 4 0.75
2 1 16 13 0
3 1 26 24 1.42
4 1 46 21 1.62
5 1 89 9 1.33
6 2 14 7 1.14
7 2 18 21 0.875
8 2 19 18 0.333
9 2 39 14 1.14
The final step is to add the required column to the original dataset (dat) again with left_join():
dat %>% left_join(newDat, by = c("site", "ID"))
dat
site ID start.date end.date duration av.individuals
1 1 16 2017-05-24 2017-06-05 13 days 0.000000
2 1 46 2017-04-30 2017-05-20 21 days 1.619048
3 1 26 2017-04-30 2017-05-23 24 days 1.416667
4 1 89 2017-05-05 2017-05-13 9 days 2.333333
5 1 12 2017-05-11 2017-05-14 4 days 2.750000
6 2 14 2017-05-04 2017-05-10 7 days 1.142857
7 2 18 2017-05-09 2017-05-29 21 days 0.857143
8 2 19 2017-05-24 2017-06-10 18 days 0.333333
9 2 39 2017-05-05 2017-05-18 14 days 1.142857
I have a dataframe which contains dates, products and amounts. However product b is not on every date, I would like it to be with an NA or 0 balance. Is this possible?
Summary_Date <-
as.Date(c("2017-01-31",
"2017-02-28",
"2017-03-31",
"2017-03-31",
"2017-04-30",
"2017-05-31",
"2017-05-31",
"2017-06-30"))
Product <-
as.character(c("a","a","a","b","a","a","b","a"))
Amounts <-
as.numeric(c(10,10,10,20,10,10,20,10))
df <- data.frame(Summary_Date,Product,Amounts)
Regards,
Aksel
You can use tidyr:
> library(tidyr)
> complete(data = df,Summary_Date,Product)
# A tibble: 12 x 3
Summary_Date Product Amounts
<date> <fctr> <dbl>
1 2017-01-31 a 10
2 2017-01-31 b NA
3 2017-02-28 a 10
4 2017-02-28 b NA
5 2017-03-31 a 10
6 2017-03-31 b 20
7 2017-04-30 a 10
8 2017-04-30 b NA
9 2017-05-31 a 10
10 2017-05-31 b 20
11 2017-06-30 a 10
12 2017-06-30 b NA